METHODS AND SYSTEMS FOR PROVIDING NETWORK SECURITY IN A PARALLEL PROCESSING ENVIRONMENT
A method of providing network security for executing applications is disclosed. One or more servers including a plurality of microprocessors and a plurality of network processors are provided. A first grouping of microprocessors executes a first application. The first application is executed using the microprocessors in the first grouping. The microprocessors in the first grouping of microprocessors are permitted to communicate with each other via one or more of the network processors. A second grouping of microprocessors executes a second application. At least one server has one or more microprocessors for executing the first application and one or more different microprocessors for executing the second application. The second application is executed using the microprocessors in the second grouping of microprocessors. One or more of the network processors prevent the microprocessors in the first grouping from communicating with the microprocessors in the second grouping during periods of simultaneous execution.
Latest COGNITIVE ELECTRONICS, INC. Patents:
- EFFICIENT IMPLEMENTATIONS FOR MAPREDUCE SYSTEMS
- PROFILING AND OPTIMIZATION OF PROGRAM CODE/APPLICATION
- METHODS AND SYSTEMS FOR OPTIMIZING EXECUTION OF A PROGRAM IN A PARALLEL PROCESSING ENVIRONMENT
- Parallel processing computer systems with reduced power consumption and methods for providing the same
- Methods and systems for performing exponentiation in a parallel processing environment
This application claims priority to U.S. Provisional Patent Application No. 61/528,075 filed Aug. 26, 2011, which is incorporated herein by reference.
BACKGROUND OF THE INVENTIONSecurity is an important part of cloud computing and high performance computing (HPC). While many applications that originated in clusters and private datacenters continue to move to private and public clouds, this progress is not anticipated to be sustainable unless users feel that the security infrastructure of the new systems is trustworthy. Various types of attacks require different types of security precautions,
Accordingly, it is desirable to provide computer architecture making unauthorized penetration more difficult and easier to prevent.
BRIEF DESCRIPTION OF THE INVENTIONIn one embodiment, a method of providing network security for executing a plurality of applications is disclosed. The network includes one or more servers. Each server includes a plurality of microprocessors and a plurality of network processors. A first grouping of microprocessors is defined for executing a first application. The first application is executed using the microprocessors in the first grouping of microprocessors. The microprocessors in the first grouping of microprocessors are permitted to communicate with each other via one or more of the network processors. A second grouping of microprocessors is defined for executing a second application. At least one server has one or more microprocessors for executing the first application and one or more different microprocessors for executing a second application. The second application is executed using the microprocessors in the second grouping of microprocessors. Execution of the second application is initiated prior to the completion of execution of the first application. The microprocessors in the second grouping of microprocessors are permitted to communicate with each other via one or more of the network processors. One or more of the network processors prevent the microprocessors in the first grouping of microprocessors from communicating with the microprocessors in the second grouping of microprocessors during periods of simultaneous execution of the first and second application.
The foregoing summary, as well as the following detailed description of preferred embodiments of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments which are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The following definitions are to be applied to terminology used in the application:
Network processor—A processor that connects to multiple nodes and passes messages between those nodes. The network processor is preferably able to perform some operations on the communicated packets, such as performing a check before a packet is forwarded to its proper destination port. Such a check is performed in order to verify that packets sent from the sender of a packet are allowed, according to the rules initialized into the network, to be passed to the destination.
Simultaneous execution—Capability for a first program to be operating in the system at the same time as a second program is also operating in the system. For example, a first program may be checking web pages for certain keywords using Processors A and B, while a second program is deleting redundant web pages on Processors C and D. When processors A, B, C, and D reside on the same physical network, the network processors and/or network switches perform some operations for the first program, and some operations for the second program, often performing operations for the first program using one part of a network processor while other parts of the same network processor are performing operations for the second program.
For example, processor A might be passing a message to processor B while processor C passes a message to processor D. The network processor may receive the messages from A and C before those either of those messages have been forwarded on, thereby operating in a situation where the programs are simultaneously executing.
DESCRIPTION OF THE PREFERRED EMBODIMENTSCertain terminology is used in the following description for convenience only and is not limiting. The words “right”, “left”, “lower”, and “upper” designate directions in the drawings to which reference is made. The terminology includes the above-listed words, derivatives thereof, and words of similar import. Additionally, the words “a” and “an”, as used in the claims and in the corresponding portions of the specification, mean “at least one.”
Referring to the drawings in detail, wherein like reference numerals indicate like elements throughout, methods and systems for providing security to applications executing in a parallel computing architecture are disclosed. The following description of a parallel computing architecture is one example of an architecture that may be used with the network security features of the preferred embodiment. The architecture is further described in commonly assigned U.S. Patent Application Publication No. 2009/0083263 (Felch et al.), which is incorporated by reference herein.
Parallel Computing Architecture
The DRAM memory 2100 is organized into four banks 2110, 2112, 2114 and 2116, and requires 4 processor cycles to complete, called a 4-cycle latency. In order to allow such instructions to execute during a single Execute stage of the Instruction, eight virtual processors are provided, including new VP#7 (2120) and VP#8 (2122). Thus, the DRAM memories 2100 are able to perform two memory operations for every Virtual Processor cycle by assigning the tasks of two processors (for example VP#1 and VP#5 to bank 2110). By elongating the Execute stage to 4 cycles, and maintaining single-cycle stages for the other 4 stages comprising: Instruction Fetch, Decode and Dispatch, Write Results, and Increment PC; it is possible for each virtual processor to complete an entire instruction cycle during each virtual processor cycle. For example, at hardware processor cycle T=1 Virtual Processor #1 (VP#1) might be at the Fetch instruction cycle. Thus, at T=2 Virtual Processor #1 (VP#1) will perform a Decode & Dispatch stage. At T=3 the Virtual Processor will begin the Execute stage of the instruction cycle, which will take 4 hardware cycles (half a Virtual Processor cycle since there are 8 Virtual Processors) regardless of whether the instruction is a memory operation or an ALU 1530 function. If the instruction is an ALU instruction, the Virtual Processor might spend cycles 4, 5, and 6 simply waiting. It is noteworthy that although the Virtual Processor is waiting, the ALU is still servicing a different Virtual Processor (processing any non-memory instructions) every hardware cycle and is preferably not idling. The same is true for the rest of the processor except the additional registers consumed by the waiting Virtual Processor, which are in fact idling. Although this architecture may seem slow at first glance, the hardware is being fully utilized at the expense of additional hardware registers required by the Virtual Processors. By minimizing the number of registers required for each Virtual Processor, the overhead of these registers can be reduced. Although a reduction in usable registers could drastically reduce the performance of an architecture, the high bandwidth availability of the DRAM memory reduces the penalty paid to move data between the small number of registers and the DRAM memory.
This architecture 1600 implements separate instruction cycles for each virtual processor in a staggered fashion such that at any given moment exactly one VP is performing Instruction Fetch, one VP is Decoding Instruction, one VP is Dispatching Register Operands, one VP is Executing Instruction, and one VP is Writing Results. Each VP is performing a step in the Instruction Cycle that no other VP is doing. The entire processor's 1600 resources are utilized every cycle. Compared to the naïve processor 1500 this new processor could execute instructions six times faster.
As an example processor cycle, suppose that VP#6 is currently fetching an instruction using VP#6 PC 1612 to designate which instruction to fetch, which will be stored in VP#6 Instruction Register 1650. This means that VP#5 is Incrementing VP#5 PC 1610, VP#4 is Decoding an instruction in VP#4 Instruction Register 1646 that was fetched two cycles earlier. VP #3 is Dispatching Register Operands. These register operands are only selected from VP#3 Registers 1624. VP#2 is Executing the instruction using VP#2 Register 1622 operands that were dispatched during the previous cycle. VP#1 is Writing Results to either VP#1 PC 1602 or a VP#1 Register 1620.
During the next processor cycle, each Virtual Processor will move on to the next stage in the instruction cycle. Since VP#1 just finished completing an instruction cycle it will start a new instruction cycle, beginning with the first stage, Fetch Instruction.
Note, in the architecture 2160, in conjunction with the additional virtual processors VP#7 and VP#8, the system control 1508 now includes VP#7 IR 2152 and VP#8 IR 2154. In addition, the registers for VP#7 (2132) and VP#8 (2134) have been added to the register block 1522. Moreover, with reference to
To complete the example, during hardware-cycle T=7 Virtual Processor #1 performs the Write Results stage, at T=8 Virtual Processor #1 (VP#1) performs the Increment PC stage, and will begin a new instruction cycle at T=9. In another example, the Virtual Processor may perform a memory operation during the Execute stage, which will require 4 cycles, from T=3 to T=6 in the previous example. This enables the architecture to use DRAM 2100 as a low-power, high-capacity data storage in place of a SRAM data cache by accommodating the higher latency of DRAM, thus improving power-efficiency. A feature of this architecture is that Virtual Processes pay no performance penalty for randomly accessing memory held within its assigned bank. This is quite a contrast to some high-speed architectures that use high-speed SRAM data cache, which is still typically not fast enough to retrieve data in a single cycle.
Each DRAM memory bank can be architected so as to use a comparable (or less) amount of power relative to the power consumption of the processor(s) it is locally serving. One method is to sufficiently share DRAM logic resources, such as those that select rows and read bit lines. During much of DRAM operations the logic is idling and merely asserting a previously calculated value. Using simple latches in these circuits would allow these assertions to continue and free-up the idling DRAM logic resources to serve other banks Thus the DRAM logic resources could operate in a pipelined fashion to achieve better area efficiency and power efficiency.
Another method for reducing the power consumption of DRAM memory is to reduce the number of bits that are sensed during a memory operation. This can be done by decreasing the number of columns in a memory bank. This allows memory capacity to be traded for reduced power consumption, thus allowing the memory banks and processors to be balanced and use comparable power to each other.
The DRAM memory 2100 can be optimized for power efficiency by performing memory operations using chunks, also called “words”, that are as small as possible while still being sufficient for performance-critical sections of code. One such method might retrieve data in 32-bit chunks if registers on the CPU use 32-bits. Another method might optimize the memory chunks for use with instruction Fetch. For example, such a method might use 80-bit chunks in the case that instructions must often be fetched from data memory and the instructions are typically 80 bits or are a maximum of 80 bits.
When virtual processors are able to perform their memory operations using only local DRAM memory, the example architecture is able to operate in a real-time fashion because all of these instructions execute for a fixed duration.
Network Security
Referring to
The network processors 655 may be configured through a separate security network that is not accessible by user applications run on the microprocessors. Microprocessors 2400 in the same group also share an encryption key which is used to encrypt all outgoing data and decrypt incoming data. The encryption keys may be transmitted to the microprocessors 2400 using the security network. The security keys are preferably not directly accessible by the user applications running on the microprocessors so that if malicious code is running on one of these microprocessors it is not able to access the encryption key(s) and it is not able to reconfigure the network processors 655.
The Key distribution and network initialization server 730 also contains a Master Private Key 750, which it can use to digitally sign messages that it sends and a public key which allows verification of the digital signature produced by the Master Private Key 750. This public key is similar to the private key that is originally installed in the Security Network Nodes 820 during manufacture. With these keys it is possible for the Key distribution and network initialization server 730 to send data to a specific Security Network Node 820 that can only be read by that specific Security Network Node 820. These keys also allow the Security Network Node 830 to verify the data to have been sent by the trusted Key distribution network initialization server 730. The Key Packet generator 740 hardware is designed so that the Master Private Key does not have to be loaded into the memory of the Key distribution and network initialization server 730. The danger with loading such a key into memory is that it is possible that the key could be read by an attacker that has access to the memory. For example, one attack that has been used is to physically read the capacitors of memory using a special device. This works because memory may hold data in capacitors which, depending on the manufacture of the capacitors, can be detected hours or more after the computer has been turned off. If the Master Private Key is obtained by an attacker then it is possible for the attacker to initialize the security of the network, thereby compromising the subsequent network traffic to spying.
The Key Packet Generator 740 receives a public key from the Public Key Database 745 with which it will encrypt the outgoing message. In step 780 the Key Packet Generator then generates a key 755 that will be used for efficient encryption and decryption, such as a Symmetric key for AES-256. Suppose a key is generated such as ABC1. If ABC1 is a symmetric key, which works with a specific symmetric key encyprtion/decryption algorithm, then any node that knows the key can both read and send messages to other nodes that have the same key. Nodes that do not have the key cannot read the messages.
The Key table 760 holds keys that have been generated by the Key Generator 755, which allows the same Symmetric key to be sent in multiple messages. Using a hardware solution for the Key Packet Generator prevents the Symmetric key from ever being loaded within the memory of the Key distribution and network initialization server 740. It is therefore more difficult for an attacker to discover the Symmetric key in order to read messages.
Note that it is possible for the Public Key Database 745 to be implemented within the Key Packet Generator 755 so that it is more difficult for an attacker to insert their own public key into the public key database 745 in the hopes of being sent an encrypted message from the Key Packet Generator 740 that can be decrypted.
Two symmetric keys are generated for a given program, one that will not be loaded into user accessible memory and one that will be loaded. The key that will be loaded into memory is more vulnerable to attack. Therefore, a second key is used so that if the first key is discovered by an attacker it is still not possible for the attacker to read all of the messages. While custom hardware can be designed so that keys do not need to be loaded into memory, it may also be necessary to integrate computer hardware that is not custom and uses software to perform encryption and decryption, thereby requiring the key to be loaded into memory. Using the two-key system an attacker will have much more difficulty reading messages that are sent from custom hardware to other custom hardware, when the custom hardware uses keys that are not saved in user-accessible memory at any point in the system.
In step 782 the Symmetric keys are digitally signed using the Master Private Key 750 within the Key Packet Generator 740, and then in step 784 the signed keys are encrypted using the public key previously loaded from the public key database 745. In step 790 the list of recruited processors and servers (called a white list) and boot data, which has previously been received as input 725, is sent to Key Packet Generator 740 for signature, encryption, and inclusion in the packet. The signature key and encryption keys are the same as those used in steps 782 and 784. In step 786 the packet is sent to the proper Security Network Node over the Security and Initialization Network 765. The loop returns to step 772 until all processors have been initialized, At that point the ending step 788 is reached.
As noted previously, it is possible for the white list to instead be used as a blacklist, in which case the packet is forwarded in the case that the applicable destination is not found. In this case the packet would instead be blocked if the destination address is found in the blacklist. In another embodiment two lists are used, one white and one black, and the packet is forwarded if the destination is found in the white list or if it is not found in the blacklist. This allows blacklisting of some source/destination pairs that must be allowed by using the white list to approve those pairs separately.
Decryption works in a similar manner to the encryption process described above. In decryption, an incoming packet to the processor 1120 has its data 1140 sent to the Decrypter 1150 and its source address 1130 sent to the Key Selector 1136. The Key Selector 1136 uses the Source address 1130 to determine a key 1155 which is then sent to the Decrypter 1150. Once the Decrypter receives both the data payload 1410 and key 1155, the message is decrypted and sent to the network-on-chip 2410 via channel 1165.
Next, the database is searched at step 1340 to determine whether it already contains data similar to the data record read in step 1330. The results are analyzed in step 1350 and if the data is new the program proceeds to step 1360 via path 1354. In step 1360 the new data is inserted into the database and the program proceeds to step 1370. If the data is not new then the program skips step 1360 and proceeds to step 1370 via path 1358.
Step 1370 checks whether all data records have been processed, and if so the program proceeds to step 1380 via path 1378. In step 1380 the database is saved and the program ends at step 1390. If not all data records have been processed then step 1370 proceeds to step 1330 via path 1374 and the next data record begins being processed.
It is possible that messages sent from processor A to processor B might not reach processor B due to unsuccessful forwarding at the network processor NP1 1415. To check if this is the case, the white list for Port 1 1025 is searched for all of the entries for which processor A is a source. Entries 1426, 1427 and 1428 qualify and may contain a valid destination to allow messages to pass to processor B. To fully verify that packets can be transferred from Processor A 1450 to processor B 1460 the first entry is checked and it can be seen that Destination B is indeed valid. (Note that if this was a blacklist the presence of such an entry would invalidate such message passing.) Thus, Program X1 1445 running on processor A 1450 and processor B 1460 can pass messages from Processor A 1450 to processor B 1460.
Because both processor A 1450 and processor B 1460 are running program X1 1445 it may be necessary for processor B 1460 to pass messages to processor A 1450. In order for these messages to be successfully sent and read by processor A there must be a relevant entry in the encryption key selector 1135 with a matching key in the decryption key selector 1136 of processor A 1450. Both tables have an entry ABC1, the first key for Destination A 1461 for encrypting messages from Processor B 1460 to processor A 1450. The second table entry designates use of the same key ABC1 for decrypting messages received by Processor A 1450 from processor B 1460.
The situation for processors C 1470 and D 1480 is the same, except that the key for the relevant key selectors 1135, specify the key XYZ1 for encryption and decryption. In this case the relevant entries are 1471, 1476, 1481, and 1486.
Processors A and B use key ABC2 for communication with servers N1 and Q1, as designated by key selector entries 1452, 1457, 1462, and 1467. Messages sent from processor A 1450 and processor B 1460 to servers N1 and Q1 are allowed within the network processor NP1 1415 because the corresponding entries 1427, 1428, 1432, and 1433 are present. It may be possible, similar to the key selector tables, to use one entry to designate both server N1 and Q1 as valid. This could be implemented, for example, by giving N1 and Q1 network addresses that are contiguous. Any number of servers can be allowed in this way, and keys can be validated in this way, provided the network addresses are contiguous.
Entries 1437 and 1442 similarly allow program X2 1447 running on processor C 1470 and D 1480 to send messages to server N2. The key that is used for encryption and decryption with server N2 is designated by entries in the key selectors 1135, 1136 of keys 1472, 1482, 1487 and 1487. This key is designated as key XYZ2. Note that servers N1, Q1, and N2 similarly have the keys ABC2, ABC2, and XYZ2 respectively, which are used for communicating with each other (in the case of N1 and Q1, which are used by program X1), and with also with the processors 2400 running their respective program.
In steps 15025 and 15030 the security network node 820 configures the network processor 1415 and processors 1450, 1460 with the keys, boot data and white list table entries, which involves signaling a reset to initiate boot of the processor 2400 after configuration. In step 15035 program X1 starts, processors A and B use proper keys, and messages are properly disallowed from servers not running or servicing program X1 to destinations running or servicing program X1. Furthermore, messages are disallowed from program X1 to destinations not running or servicing program X1.
Steps 15040—15060 for program X2 proceed similar to steps 15015-15035 for program X1. After the programs have been initiated and the security has been set up, the process proceeds to step 15065. In step 15065 both program X1 and program X2 are simultaneously executing. Program X1 cannot send messages to program X2, nor can program X2 send messages to X1. Similarly, Program X1 cannot understand messages sent from Program X2 and Program X2 cannot understand messages sent from Program X1.
It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.
Claims
1. A method of providing network security for executing a plurality of applications, the network including one or more servers, each server including (i) a plurality of microprocessors, and (ii) a plurality of network processors, the method comprising:
- (a) defining a first grouping of microprocessors for executing a first application;
- (b) executing the first application using the microprocessors in the first grouping of microprocessors, wherein the microprocessors in the first grouping of microprocessors are permitted to communicate with each other via one or more of the network processors;
- (c) defining a second grouping of microprocessors for executing a second application, wherein at least one server has one or more microprocessors for executing the first application and one or more different microprocessors for executing the second application;
- (d) executing the second application using the microprocessors in the second grouping of microprocessors, the second execution initiating execution prior to the completion of execution of the first application, wherein the microprocessors in the second grouping of microprocessors are permitted to communicate with each other via one or more of the network processors; and
- (e) preventing, via one or more of the network processors, the microprocessors in the first grouping of microprocessors from communicating with the microprocessors in the second grouping of microprocessors during periods of simultaneous execution of the first and second application.
2. The method of claim 1 further comprising:
- (f) configuring the network processors to define communication permissions of the groupings of the microprocessors via a second network, wherein the plurality of microprocessors are permanently prevented from accessing the second network.
3. The method of claim 1 wherein the plurality of microprocessors includes encryption/decryption functionality, the method further comprising:
- (f) assigning a first encryption key to the first grouping of microprocessors, and assigning a second encryption key to the second grouping of microprocessors, wherein the first encryption key is different from the second encryption key, and wherein the first and second groupings of microprocessors do not know each other's encryption keys.
Type: Application
Filed: Aug 24, 2012
Publication Date: Mar 7, 2013
Applicant: COGNITIVE ELECTRONICS, INC. (Lebanon, NH)
Inventor: Andrew C. FELCH (Palo Alto, CA)
Application Number: 13/594,207
International Classification: G06F 21/20 (20060101);