Encryption processor with shared memory interconnect
An encryption chip is programmable to process a variety of secret key and public key encryption algorithms. The chip includes a pipeline of processing elements, each of which can process a round within a secret key algorithm. Data is transferred between the processing elements through dual port memories. A central processing unit allows for processing of very wide data words from global memory in single cycle operations. An adder circuit is simplified by using plural relatively small adder circuits with sums and carries looped back in plural cycles. Multiplier circuitry can be shared between the processing elements and the central processor by adapting the smaller processing element multipliers for concatenation as a very wide central processor multiplier.
Latest Mosaid Technologies Incorporated Patents:
- Non-volatile memory device with concurrent bank operations
- Clock mode determination in a memory system
- Structure and method for providing line end extensions for fin-type active regions
- Clock mode determination in a memory system
- NAND flash memory with vertical cell stack structure and method for manufacturing same
Before the advent of the Internet, corporate data networks typically consisted of dedicated telecommunications lines leased from a public telephone company. Since the hardware implementation of the data networks was the exclusive property of the telephone company, a regulated utility having an absolute monopoly on the medium, security was not much of a problem; the single provider was contractually obligated to be secure, and the lack of access to the switching network from outside made it more or less resistant to external hacking and tampering.
Today, more and more enterprises are discovering the value of the Internet which is currently more widely deployed than any other single computer network in the world and is therefore readily available for use by a multi-national corporate network. Since it is also a consumer-level product, Internet access can usually be provided at much lower cost than the same service provided by dedicated telephone company network. Finally, the availability of the Internet to the end user makes it possible for individuals to easily access the corporate network from home, or other remote locations.
The Internet however, is run by public companies, using open protocols, and in-band routing and control that is open to scrutiny. This environment makes it a fertile proving ground for hackers. Industrial espionage is a lucrative business today, and companies that do business on the Internet leave themselves open to attack unless they take precautions.
Several standards exist today for privacy and strong authentication on the Internet. Privacy is accomplished through encryption/decryption. Typically, encryption/decryption is performed based on algorithms which are intended to allow data transfer over an open channel between parties while maintaining the privacy of the message contents. This is accomplished by encrypting the data using an encryption key by the sender and decrypting it using a decryption key by the receiver (sometimes, the encryption and decryption keys are the same).
Types of Encryption Algorithms
Encryption algorithms can be classified into public-key and secret key algorithms. In secret-key algorithms, both keys are secret whereas in public-key algorithms, one of the keys is known to the general public. Block ciphers are representative of the secret-key cryptosystems in use today. Usually, for block ciphers, the encryption key is the same as the decryption key. A block cipher takes a block of data, typically 32-128 bits, as input and produces the same number of bits as output. The encryption and decryption are performed using a key, anywhere from 56-128 bits in length. The encryption algorithm is designed such that it is very difficult to decrypt a message without knowing the key.
In addition to block ciphers, Internet security protocols also make heavy use of public-key algorithms. A public key cryptosystem such as the Rivest, Shamir, Adelman (RSA) cryptosystem described in U.S. Pat. No. 5,144,667 issued to Pogue and Rivest uses two keys, only one of which is made public. Once someone publishes a key, anyone may send that person a secret message using that key. However, decryption of the message can only be accomplished by use of the secret key. The advantage of such public-key encryption is secret keys do not need to be distributed to all parties of a conversation beforehand. In contrast, if only secret-key encryption were used, multiple secret keys would have to be generated, one for each party intended to receive the message, and each secret key would have to be privately communicated. Attempting to communicate the secret key privately results in the same problem as in sending the message itself using only secret-key encryption; this is called the key distribution problem.
Key exchange is another application of public-key techniques. In a key exchange protocol, two parties can agree on a secret key even if their conversation is intercepted by a third party. The Diffie-Hellman exponential key exchange, described in U.S. Pat. No. 4,200,770, is an example of such a protocol.
Most public-key algorithms, such as RSA and Diffie-Hellman key exchange, are based on modular exponentiation, which is the computation of αx mod p. This expression means “multiply α by itself x times, divide the answer by p, and take the remainder.” This computation is very expensive to perform, for the following reason. In order to perform this operation, many repeated multiplications and divisions are required, although techniques such as Montgomery's method, described in “Modular Multiplication Without Trial Division,” from Mathematics of Computation, Vol. 44, No. 170 of April 1985, can reduce the number of divisions required. In addition, the numbers used are very large (typically 1024 bits or more), so the multiply and divide instructions found in common CPUs cannot be used directly. Instead, special algorithms that break down the large multiplications and divisions into operations small enough to be performed on a CPU must be used. These algorithms usually have a run time proportional to the square of the number of machine words involved. These factors result in multiplication of large numbers being a very slow operation. For example, a Pentium® can perform a 32×32-bit multiply in 10 clock cycles. A 2048-bit number can be represented in 64 32-bit words. A 2048×2048-bit multiply requires 64×64 separate 32×32-bit multiplications, which takes 40960 clocks on the Pentium. An exponentiation with a 2048-bit exponent requires up to 4096 multiplications if done in the normal way, which requires about 167 million clock cycles. If the Pentium is running at 166 MHZ, the entire operation requires roughly one second. This example does not consider the time required to perform the divisions, either! Clearly, a common CPU such as a Pentium cannot expect to do key generation and exchange at any great rate.
Because public-key algorithms are so computationally intensive, they are typically not used to encrypt entire messages. Instead, private-key cryptosystems are used for message transfer. The private key used to encrypt the message, called the session key, is chosen at random and encrypted using a public key. The encrypted session key, as well as the encrypted message, are then sent to the other party. The other party uses its secret key to decrypt the session key, at which point the message may be decrypted using the session key. A different session key is used for each communication, so that if one session key is ever broken, only the one message encrypted with it may be read. This public-key/private-key method can also be used to protect continuous communications, such as interactive terminal sessions that never terminate in normal operation. In this case, the session key is periodically changed (e.g. once an hour) by repeating the public-key generation technique. Again, frequent changing of the session key limits the amount of data compromised if the encryption is broken.
Prior Art
Network-level encryption devices, allowing access to corporate networks using a software-based solution are experiencing widespread usage. Products such as Raptor Eagle Remote and others perform encryption entirely in software. The software limits the encryptor's throughput. Session key generation using public-key techniques may take several minutes. For this reason, session keys are not re-generated as often as some people would like. However, software does have the advantage that the encryption algorithms are easily changed in response to advances in the field.
Other devices use a combination of hardware and software. For example, the Northern Telecom (now Entrust) Sentinel X.25 encryption product uses a DES chip produced by AMD to perform the DES secret-key encryption. Hardware implementations of DES are much faster, since DES was designed for efficient implementation in hardware. A transposition that takes many CPU instructions in software can be done using parallel special-purpose lookup tables and wiring.
The Sentinel also makes use of a Motorola DSP56000 processor to perform the public-key operations. At the time, the single-cycle multiplication ability of the DSP made this approach significantly faster than implementing the public-key algorithms on regular CISC microprocessors.
Most hardware encryption devices are severely limited in the number of algorithms that they can implement. For example, the AMD chip used in the Sentinel performs only DES. More recent devices, from Hi/Fn can perform DES and RC4. However, if you need to implement either RC5 or IDEA, then you would need to use another product.
SUMMARY OF THE INVENTIONA preferred high-performance programmable network encryption device, integrated into a single chip, is a parallel-pipelined processor system whose instruction set is optimized for common encryption algorithms. The present invention realizes the advantages of both hardware and software approaches. Since the processor is a programmable processor, any encryption algorithm may be implemented, contrary to a hardware implemented encryption processor which is dedicated to executing only one algorithm. However, the processor's architecture permits parallel computations of a nature useful for encryption, so its performance more closely approximates that of a dedicated hardware device.
In accordance with a preferred implementation of the invention, an electronic encryption device comprises an array of processing elements. Each processing element comprises an instruction memory for storing a round of an encryption algorithm, the round comprising a sequence of instructions. The processing element also includes a processor for implementing the round from the instruction memory and data storage for storing encryption data operands and encrypted data resulting from implementing the round. Each processing element of the array implements one of the rounds and transfers results to successive processing elements such that the array of processing elements implements successive rounds of the encryption algorithm in a processing element pipeline.
In a preferred embodiment, the data storage has a portion thereof which is shared between adjacent processing elements of the linear array for transfer of data between adjacent processing elements of the linear array. The shared data storage is preferably comprised of dual port memories but may also comprise shared registers.
The preferred processing element comprises a control unit and an ALU. The control unit, ALU, instruction memory and data storage, including local data memory and shared data memory, are connected to a local processing element bus. The local bus is segmented by a switch into a local instruction bus segment, connecting the instruction memory and the control unit, and a local data bus segment connecting the ALU, local data memory and shared data memory. The switch permits either independent simultaneous operation on the two local bus segments or a communication between the two bus segments. Each processing element further comprises a multiplier for performing multiplication operations within the processing element.
The preferred encryption device further comprises a global random access memory and a global bus through which data is transferred between the global random access memory and the processing element data storage. A central processor is coupled to the global bus for processing data words which are wider than data words processed by the processing elements. The multipliers of the plural processing elements may be adapted for concatenation as segments of a wider multiplier used by the central processor. Preferably, each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
Preferably, the central processor comprises a novel adder. In the adder, each of plural adder segments has a carry output and a sum output and each of the adder segments processes a segment of each of two operands. Selectors select the carry outputs as carry inputs to successive adder segments for successive clock cycles so long as any carry results in an adder cycle. Selectors also select each sum output as an operand input to the same adder segment. Accordingly, so long as any carry results in an adder cycle, the sum output of an adder is fed back to its input, and the adder segment receives a carry input generated as a carry output from a preceding segment in a preceding cycle.
Preferably, each processing element performs a modular adjust operation to compute M mod N without using a divide circuit. Each processing element also performs a modulo add/subtract operation to compute A±B mod N. Further, each processing element performs a modulo multiply operation to compute A×B mod N.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
The encryption chip of the present invention may be programmed to perform common data encryption and decryption algorithms on one or more data streams in any application. The principal purpose of the encryption chip is to perform high speed data encryption using algorithms that are expected to be in use on the Internet, at data rates of 100-2000 Mbps.
Example applications are illustrated in
In a link encryption application illustrated in
Three main secret-key block encryption algorithms are in common use today: DES, RC5 and IDEA. The first two algorithms are standard Internet Protocol SECurity IPSEC standard algorithms. IDEA is the algorithm used by PGP, a popular email encryption program.
Typically, block algorithms consist of a number of rounds; each round is a sequence of operations in an encryption algorithm. Anywhere from 8-32 rounds are required to completely implement an encryption algorithm. The operations performed by each round are often the same, although they need not be. In software, each round is implemented with a few machine instructions. In hardware, each round is implemented with dedicated circuitry. The hardware is typically pipelined, with each round being implemented in its own pipeline stage.
Data enters the encryption chip through an input stage 40, which receives network data, typically as a serial bit stream. Ethernet, ATM or any other serialized format may be used. The input stage converts the serial data stream to block-aligned data suitable for processing as an input to an encryption/decryption pipeline. The size of the input blocks is programmable. In the preferred embodiment illustrated in
Data can be transferred among non-adjacent processing elements and/or other elements within the encryption ship via a global data bus 38. Also connected to the global data bus 38 is I/O communication logic 54, which allows communication with a host CPU (not shown). Host CPU communication is required to program the encryption chip prior to use. Global random access memory (RAM) 44 is also connected to the global data bus 38, allowing global communication among the processing elements. A control CPU 52 synchronizes the operations of the encryption pipeline processors. This CPU may be implemented using any available embedded CPU core such as MIPS, ARM or ARC. Furthermore, to allow processing of algorithms which utilize very wide operands such as public-key encryption algorithms, a public-key (PK) core processor 46 is connected to the control CPU 52. The PK core includes a register file 48 consisting of 8-16 512-bit wide registers as well as a PK ALU 50. The PK core processor can make data transfers to and from the global RAM 44 over a 512-bit bus in one system clock cycle. 512-bit operands are processed in the ALU 50, typically in 2-32 clock cycles. The PK core ALU 50 is a coprocessor controlled by the control CPU 52, performing only arithmetic and logic operations along with loads and stores. Other instructions necessary for implementing PK algorithms can be executed within the control CPU 52.
The encryption chip implements the code for each round of a secret key algorithm in a separate processing element of the pipeline. Once computed upon, data from one PE is transferred to the next PE where the next round is implemented. The first PE is then free to process an encryption round for the next block of data coming in. The pipeline process continues for the remaining PE's. The time required to encrypt a block using this architecture is therefore equal to the time required to encrypt one round.
Many block algorithms use one set of operations to encrypt the data, and a separate set of operations to expand the key. Key expansion is the process of transforming a relatively small key (56-128 bits) into a larger number (512 bits or more) with statistically random properties. This expanded key is distributed into smaller subkeys, where a different portion of the expanded key is used for each round. It is important to note that the expanded key does not change with the data. Therefore, since it is not in the critical path, it can be pre-computed and stored in memory. The sample code discussed later assumes that key information has been pre-computed and stored in the local data memories of each PE.
The basic application of a block algorithm transforms a block of plaintext (unencrypted information) into a similar-sized block of ciphertext (encrypted information) and vice-versa. This operating mode is known as electronic codebook (ECB) mode. Due to its many inherent security weaknesses, methods of introducing feedback into the encryption by cycling some of the basic output back into the input are commonly used. The encryption chip uses the global data bus 38 to perform cipher feedback (CFB). In ECB mode, a new block of data can be encrypted once per pipeline cycle, which can be 10-100 instructions. However, in CFB mode, each datum must pass through the pipeline multiple times. This mode substantially reduces throughput on a single channel. However, peak performance can be achieved by encrypting multiple data channels, interleaved in the pipeline.
A block diagram of one individual processing element PE according to an embodiment of the present invention is shown in
The memories 66 and 68 of a processing element are dual port SRAMs and are shared with the PE of the previous and next pipe stage, respectively. Note that a PE's next-neighbor memory is the same as the next PE's previous-neighbor memory.
These dual ported SRAMs are used to propagate data through the pipeline stages. One processing element writes data to be transferred into its associated next-neighbor shared memory. Then the next-neighbor processing element reads the stored data from its previous-neighbor shared memory, which as described before, is one and the same as the previous processing element's next-neighbor shared memory. Since the memories are dual-ported, there are no timing restrictions on accesses. Synchronization of accesses is performed using static scheduling of machine instructions by the software author or compiler. Furthermore, since the global bus is not used for communication between adjacent PEs, the PEs may all communicate concurrently.
The global memory 44 is connected to the global communication bus. Only one processing element is allowed to access the global memory 44 at any one time. This memory is used to pass data between non-adjacent processing elements, for example, during feedback encryption algorithms and can serve as supplemental storage for individual processing elements.
The PE instruction memory 62 has an instruction set resembling that of a modern RISC processor integer unit. The instruction set is more or less orthogonal in that any register can be used as an operand to any instruction. No floating-point or memory management support need be provided, since neither are useful in encryption. However, the instruction set contains the following useful enhancements: a modular addition/subtraction instruction, a modular multiplication instruction and a modulo adjust instruction.
The modular addition/subtraction instruction computes A±B mod N (the number “M mod N” is the remainder when M is divided by N).
The modulo adjust instruction, illustrated in
In
Although modular addition and subtraction can be performed on conventional processors using only 2-3 instructions, the inclusion of these instructions as special functions of the encryption chip instruction set offers a minor speedup for the specific case of encryption algorithms.
The modular multiplication instruction computes A*B mod N. The multiplier used for this instruction will be described in more detail below. The encryption chip provides a full modular multiply instruction, for reasons which will become clear below.
Table 1 gives a representative sample of the instruction set of the PEs to be used in subsequent examples. Other conventional RISC instructions may also be implemented.
Layout Issues
A general layout of the encryption chip is illustrated in
The individual elements can be compared to the elements of
The layout of a typical processing element with the local bus connections is shown in
Many multiprocessor architectures have been proposed. Most of them are designed for general-purpose multiprocessing, so communication between processing elements is usually done using a switching matrix that can be dynamically configured to switch data from any one PE to any other. These switch designs are extremely complex. Since they are not required for encryption, an embodiment of the present invention uses a simpler linear arrangement of the PEs with much less switching circuitry.
In addition, the use of shared memory as the interconnect technique rather than I/O ports as documented in the literature produces a much simpler and more powerful programming model. Consider two PEs, A and B, connected with a single 32-bit I/O port. In order for A to transfer multiple words of data to B, A must write each word to the I/O port and wait for B to read it. In contrast, if A and B are connected by a shared memory large enough to hold all words of the communication, then A may write out its data without waiting for B to read any. Furthermore, PE B has the freedom to read the words out in any order, or to pick and choose from the data as required by the job at hand. Finally, it should be noted that if some of the shared memory is not required for communication, then it may be used as an extension of the local memory to provide additional local work space.
Public Key Support
Efficient public-key encryption requires efficient modular exponentiation, provided by the public-key co-processor. This unit comprises the following items:
-
- PK register file 48, consisting of 16 512-bit wide registers
- PK 512×32-bit multiplier 70 made up of concatenated SK multiplier elements (this unit can perform a 512×512 multiply in only 32 clock cycles)
- PK 512-bit adder ALU 50 which can perform addition in 2-16 cycles, typically no more than 2
- global memory 44 organized for 512-bit parallel access from the PK coprocessor for loading and storing 512-bit words in a single clock cycle
The PK core processor accelerates modular multiplication by performing it using 512-bit words. A 512×512 multiply operation using the PK unit of the invention would be implemented by performing 16 512×32 multiplies using the concatenated multiplier elements of the 16 processing elements described below. Assuming each multiply requires 2 clock cycles and 16 such multiplies are required, a 512×512 multiply would require 32 clock cycles and a 2048×2048 multiply would require only 512 clock cycles. The full modular exponentiation operation, requiring 4096 multiplies, would take a total of 2 million clock cycles. This represents an 80 fold improvement to the Pentium example discussed earlier. The performance improvement of PK algorithms is expected to be similar. This represents a significant performance gain compared to the prior art, and will enable more frequent changing of session keys, thereby increasing security.
512-bit Adder
Adders are not shared between the public key PK and secret key SK units. Rather, since addition and logic operations are common for both PK and SK, each unit has its own adder, so that operations may proceed concurrently.
Within the public-key PK ALU 50, a 512-bit single cycle adder would be extremely complex and would add substantially to the critical path time of the ALU. Accordingly, the 512-bit adder in the ALU 50 is formed of 16 32-bit adders as illustrated in
The operation of the 512-bit adder can be better understood with reference to the following example using four 4-bit binary words instead of the 16 32-bit words of the actual implementation.
Note that two additions were required to arrive at the final sum where no further carries resulted. This is a typical case. Since the adder is being used for encryption operations, it is safe to assume that the numbers being added are more or less randomly distributed. The probability of a carry-out after the first add is quite high. However, the probability that a carry, added back in as a least significant bit, will result in another carry from the most significant bit is quite low. For this reason, most add operations are expected to take only two clock cycles.
Returning to the original problem of constructing a 512-bit adder, if a standard carry lookahead or carry bypass adder design were used, the critical path through the adder would be quite long, since the carry must propagate through some optimized circuitry that operates on 512 bits. This adder would be quite large and slow. In contrast, in one embodiment of the present invention, a 512-bit adder is composed of 32-bit adders, whose design is well-known today and has been well optimized. The maximum clock speed of an individual 32-bit adder is expected to be more than twice that of a 512-bit carry lookahead design. Thus, the two-or-more cycle adder according to the invention, would on average operate faster than a large 512-bit adder, while consuming less chip area.
In a worst case, as illustrated below, it is possible that 16 cycles would be required to completely compute the final sum without carries, for 16 32-bit adders implementation. Using the 4-bit binary word example once again for illustrative purposes:
Four additions were required. In general, for n groups of numbers, at most, n additions will be required.
512×32 Multiplier
Multipliers are large in area. Each secret-key processing element must contain its own multiplier in order to implement any secret key algorithm requiring multiplication, for example IDEA which will be discussed in more detail below. The area taken by each PE multiplier collectively is significant and as a result, use of this area is made in implementing the 512×32-bit public key multiplier. To save area, the large 512×32 multiplier is implemented by concatenating the 16 32×32 multipliers in each secret key processing element. In other words, the secret and public key units can share the multiplier elements, as is illustrated in the layout of the chip in
To illustrate the concatenation of the multipliers, a simple design of a combination 4×4/4×N multiplier is illustrated below. Note that more advanced techniques of multiplier design such as Booth encoding and 4:2 compressors are available. The following example provides a simple presentation:
Single-digit multiplication can be easily implemented by using an AND gate. The result, when using two 4-bit operands, consists of 16 bits of partial product. These partial products must be added together efficiently. The partial products could, for example, be added using two 4-bit and one 6-bit full adder, but they would take a substantial time to perform the addition of the partial products, since the carry may have to propagate through several adders. The overall result of such an adder implementation would be too slow. A better approach would consist of an adder whose carry has to go through fewer stages.
The basic component of the preferred multiplier is a full adder, a circuit that takes three inputs and outputs the two-bit sum of their inputs. A full adder is illustrated using the symbol in
A comparison of the performance of various adder and multiplier architectures helps to illustrate the advantages of the multiplier according to the present invention. A naive implementation of a 4-bit adder consists of four full adders A0-A3 in series as shown in
An improved four-bit adder is a carry lookahead design. A three-bit carry lookahead adder is shown in
In the full 4×4 multiplier, the carry save design creates a critical path through two full adders, plus the final carry lookahead adder. An implementation using only full adders would have a longer critical path since a naive adder using chained carries is slower than the carry lookahead adder. Finally, if we used full carry lookahead adders in the first two stages of the partial product summation, then the resulting multiplier would again be slower, since carry lookahead adders are slower than individual full adders. Note that the multiplier design according to the present invention never propagates a carry from one adder to another at the same partial product level. In this manner, the critical path through the multiplier is sure to include no more than two full adders in the first two stages of partial product summation.
In
For a 4×N multiplier, the neighboring partial products must also be considered. They are labeled in
Note that D′ is the neighboring (either to the left or to the right) equivalent of D. The 8-bit final sum is indicated as S7,S6,S5,S4,S3,S2,S1,S0 and the three lower-order bits of the left-hand neighbor's sum are S2′,S1′,S0′. The 2:1 multiplexer 88 has a selection signal Sel. In general, if Sel is logic 1, then the left input is passed to the output of the multiplexer; otherwise, if Sel is logic 0, the right input is passed to the output of the multiplexer. The Sel signal is also used to gate the AND gates 90. When Sel is logic 1, the other input to the AND gate is passed to the output; otherwise, with Sel at logic 0, the AND gates 90 are disabled and pass a logic 0 regardless of the value of the other input. Thus, in the implementation of
Example Implementations
Example implementations of common encryption algorithms will be now described with reference to the encryption chip preferred embodiment discussed above. RC5 is perhaps one of the simplest encryption algorithms to implement. It basically utilizes three types of operations: XOR, additions and rotations, all supportable by any one of the processing elements discussed above, as shown in Table 1. Although RC5 has a variable length block, most commonly, each round of the RC5 algorithm operates on a 64-bit data block plus a key value stored in Si1 and Si2 which are constants within each processing element, depending only on the round and the key. To encrypt data, a 64-bit input block is split into two 32-bit words which are then stored in locations A and B in the previous-neighbor memory, the output block is to be written to A_next and B_next in the next-neighbor memory. An example of a round of the RC5 encryption algorithm follows:
Each round requires 11 clock cycles. If the encryption chip is designed using a logic process that can run up to 400 MHZ, then 36 million blocks can be encrypted per second, or 288 MB/s in ECB mode. If we assume 12 rounds (a typical case for RC5), then compared to a conventional CPU running at the same clock speed, the concurrent execution of multiple PEs according to an embodiment of the present invention results in a 12-fold performance improvement over the conventional software implementation.
IDEA is one of the most secure block algorithms available and has a substantially more complex structure. It operates on 64-bit plaintext blocks. A 128-bit key is used. The same algorithm is used both for encryption and decryption. The main philosophy of the algorithm is to mix operations from different algebraic groups, operations such as XOR, addition modulo 216, and multiplication modulo 216+1. These operations are used to operate on 16-bit blocks.
IDEA therefore makes use of both modular multiplication and addition, which are expensive operations in software. The multiplication is complicated by IDEA's treatment of zero: in a multiply, a zero is interpreted as (−1) modulo 65537. Assuming that the value 65537 has been pre-loaded into register r8 of the processing element's register file, and that register r0 contains zero, the following multiplication macro is presented for illustrative purposes:
Each round of IDEA consists of modular multiplications, modular addition and exclusive-OR. The 128-bit key is broken down into subkeys. Each processing element's subkey is a function solely of the key and the processing element, and therefore can be computed in advance and stored in the PE. The plaintext input to IDEA consists of four 16-bit sub-blocks X1 through X4, as indicated earlier. Each round uses six subkeys K1 through K6, and can be coded as follows:
Since IDEA has eight rounds, the encryption chip hardware implementation according to an embodiment of the present invention accelerates the execution by a factor of eight or more. Additional acceleration is provided by the modular multiply instruction which is not available on most microprocessors. The above code requires roughly 50 clock cycles to perform one round. At 400 MHZ, the encryption chip can encrypt with IDEA at a rate of 64 MB/s, about three times faster than a 25 MHZ hardware implementation developed at ETH University in Zurich.
Data Encryption Standard, or DES, was originally designed for hardware implementation, and is therefore the most difficult to implement in software. Nevertheless, it can easily be coded in the encryption chip, according to an embodiment of the present invention.
Like the previous two algorithms, DES is also a block cipher encrypting data in 64-bit blocks. A 64-bit block of plaintext is the input and a 64-bit ciphertext is the output. Once again, both encryption and decryption use the same algorithm, making DES a symmetrical algorithm. DES creates subkeys from a single key, in this case 56 bits. The subkeys are a function of the PE and the 56-bit key, so they can be computed in advance.
The basic concept behind DES, as illustrated in
-
- Expansion: The 64-bit block is divided into two 32-bit pieces 108,110. One piece is unaffected by the encryption. (The pieces are operated on every other round.) The piece that is affected is divided into eight groups of four bits. Each group is expanded by copying the two bits adjacent to it.
- Each expanded group is XOR'ed at 112 with a subkey.
- The six-bit result of the XOR is used to index a 64-entry×4-bit lookup table 114 called an S-box. Each of the eight groups uses its own S-box.
- The output from the S-boxes is permuted at 116: the bits are scrambled. Eight outputs yield 32 bits.
- The 32-bit output is XOR'ed at 118 with the other 32-bit half of the block.
The operations can be coded as follows: expansion is performed by copying the input word, and masking bits such that there are two words: one representing even-numbered S-box inputs and one representing odd-numbered S-box inputs. The two words are XOR'ed with key information. The result is used to index the S-box lookup table. The data in each S-box is pre-permuted, so that the output of the S-box is 32-bit data. The final value is the logical OR of all components. Sample code follows:
This sample code requires 44 clock cycles to perform one round. At 400 MHZ, a data rate of 72 MB/s can be achieved. This rate compares favorably with hardware implementations of DES available in the mid-1990's, that encrypt at rates ranging from 1-35 MB/s. VLSI Technology's VM007 can encrypt up to 200 MB/s.
In each of the above cases, performance has been shown to be much faster than a software implementation on a conventional CPU, but slower than a dedicated hardware implementation. The advantage of this invention over the hardware implementations is that the encryption chip is programmable, so that it may implement any algorithm, including those that have yet to be conceived.
Although no specific public key algorithm examples have been given, it should be noted that similar improvements over existing approaches will result by employing the techniques as discussed in the preferred embodiment of the present invention.
EQUIVALENTSWhile this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described specifically herein. Such equivalents are intended to be encompassed in the scope of the claims.
Claims
1. An electronic encryption device comprising, on a single chip, an array of processing elements, each processing element comprising:
- an instruction memory for storing a round of an encryption algorithm, the round comprising a sequence of instructions;
- a processor for implementing the round from the instruction memory;
- data storage for storing encryption data operands and encrypted data resulting from implementing the round;
- processing elements of the array each implementing one of the rounds and transferring results to successive processing elements such that the array of processing elements implements successive rounds of the encryption algorithm in a processing element pipeline.
2. An electronic encryption device as claimed in claim 1 wherein the data storage has a portion thereof which is shared between adjacent processing elements of the linear array for transfer of data between adjacent processing elements of the linear array.
3. An electronic encryption device as claimed in claim 2 wherein each processing element comprises a control unit and an ALU, the control unit, instruction memory and data storage being connected to a local processing element data bus, the local data bus being segmented into two independent portions by a switch; the control unit and instruction memory being connected to one of the segments, and the ALU and data storage being connected to the other of the segments.
4. An electronic encryption device as claimed in claim 2 wherein each processing element comprises a control unit and an ALU, the control unit, instruction memory, local data memory and shared data storage connected to a local processing element bus, the local bus being segmented by a switch into a local instruction bus segment connecting the instruction memory and the control unit, and a local data bus segment connecting the ALU, local data memory and shared data storage, the switch permitting either independent simultaneous operation on the two local bus segments, or a communication between the two bus segments.
5. An electronic encryption device as claimed in claim 4 wherein each processing element further comprises a multiplier for performing multiplication operations within the processing element.
6. An electronic encryption device as claimed in claim 2 wherein, during the implementation of the encryption algorithm, each processing element in the pipeline writes resulting data into the data storage shared with the subsequent processing element for direct access by the subsequent processing element.
7. An electronic encryption device as claimed in claim 2 wherein the shared data storage of the processing elements are comprised of dual port memories shared between adjacent processing elements of the linear array for transfer of data between adjacent processing elements of the linear array.
8. An electronic encryption device as claimed in claim 7 wherein each processor comprises a control unit and an ALU, the control unit, ALU, instruction memory, local data memory and shared data storage being connected to a local processing element data bus, the local data bus being segmented into two independent segments by a switch; the control unit and instruction memory being connected to one of the segments and the ALU, local data memory and shared data storage being connected to the other of the segments.
9. An electronic encryption device as claimed in claim 1 wherein each processing element further comprises a multiplier for performing multiplication operations within the processing element.
10. An electronic encryption device as claimed in claim 9 wherein the multipliers of plural processing elements are adapted for concatenation as segments of a wider multiplier.
11. An electronic encryption device as claimed in claim 10 wherein each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier, and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
12. An electronic encryption device as claimed in claim 1 wherein each processor comprises a control unit and an ALU, the control unit, ALU, instruction memory, local data memory and shared data storage being connected to a local processing element data bus, the local data bus being segmented into two independent segments by a switch; the control unit and instruction memory being connected to one of the segments and the ALU, local data memory and shared data storage being connected to the other of the segments.
13. An electronic encryption device as claimed in claim 1 further comprising a global random access memory and a global bus through which data is transferred between the global random access memory and the processing element data storage.
14. An electronic encryption device as claimed in claim 13 further comprising a central processor coupled to the global bus for processing data words which are wider than data words processed by the processing elements.
15. An electronic encryption device as claimed in claim 14 wherein each processing element further comprises a multiplier for performing multiplication operations within the processing element.
16. An electronic encryption device as claimed in claim 15 wherein the multipliers of plural processing elements are adapted for concatenation as segments of a wider multiplier.
17. An electronic encryption device as claimed in claim 16 wherein each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier, and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
18. An electronic encryption device as claimed in claim 13 wherein the central processor comprises an adder comprising:
- plural adder segments, each of the plural adder segments having a carry output and a sum output, each of the plural adder segments processing a segment of each of two operands;
- carry selectors to select the carry outputs as carry inputs to successive adder segments and for successive clock cycles so long as any carry results in an adder cycle; and
- operand selectors to select each sum output as an operand input to the same adder segment for successive clock cycles so long as any carry results in an adder cycle.
19. An electronic encryption device as claimed in claim 1 wherein each processor of each processing element performs a modulo adjust operation to compute M mod N.
20. An electronic encryption device as claimed in claim 1 wherein each processor of each processing element performs a modulo add or subtract operation to compute A±B mod N.
21. An electronic encryption device as claimed in claim 1 wherein each processor of each processing element performs a modulo multiply operation to compute A×B mod N.
22. An electronic encryption device as claimed in claim 1 wherein the encryption device further comprises an adder comprising:
- plural adders segments, each of the plural adder segments having a carry output and a sum output, each of the plural adder segments processing a segment of each of two operands; and
- carry selectors to select the carry outputs as carry inputs to successive adder segments for successive clock cycles so long as any carry results in an adder cycle;
- operand selectors to select each sum output as an operand input to the same adder segment for successive clock cycles so long as any carry results in an adder cycle.
23. An electronic encryption device comprising, on a single chip, a linear array of processing elements, each processing element comprising:
- an instruction memory for storing code required to implement at least a round of an encryption algorithm;
- a processor for processing the round from the instruction memory;
- local data memory;
- shared data storage between two adjacent processing elements;
- processing elements of the linear array each implementing one of the rounds and transferring results to successive processing elements such that the linear array of processing elements processes successive rounds of the encryption algorithm in a processing element pipeline.
24. An electronic encryption device as claimed in claim 23 wherein, during the implementation of the encryption algorithm, each processing element in the pipeline writes resulting data into the data memory shared with the subsequent processing element for direct access by the subsequent processing element.
25. An encryption data processing system comprising a linear array of processing elements, each processing element comprising:
- an instruction memory;
- a processor for processing instructions from the instruction memory;
- data memory;
- data memories of the processing elements of the linear array comprising dual port memories shared between adjacent processing elements for transfer of data between adjacent processing elements of the linear array.
26. An electronic encryption system as claimed in claim 25 wherein each processor comprises a control unit and an ALU, the control unit, ALU, instruction memory, data memories of the processing elements being connected to a local processing element data bus, the local data bus being segmented into two independent portions by a switch; the control unit and instruction memory being connected to one of the segments and the ALU and local and shared data memories being connected to the other of the segments.
27. An electronic encryption system as claimed in claim 25 wherein each processing element further comprises a multiplier for performing multiplication operations within the processing element.
28. An electronic encryption system as claimed in claim 27 wherein the multipliers of plural processing elements are adapted for concatenation as segments of a wider multiplier.
29. An electronic encryption system as claimed in claim 28 wherein each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier, and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
30. An electronic encryption system as claimed in claim 25 further comprising a global random access memory and a global bus through which data is transferred between the global random access memory and the processing element data memories.
31. An electronic encryption system as claimed in claim 30 further comprising a central processor coupled to the global bus for processing data words which are wider than data words processed by the processing elements.
32. An electronic encryption system as claimed in claim 31 further comprising a multiplier for performing multiplication operations within the processing element.
33. An electronic encryption system as claimed in claim 32 wherein the multipliers of plural processing elements are adapted for concatenation as segments of a wider multiplier.
34. An electronic encryption system as claimed in claim 33 wherein each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier, and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
35. An electronic encryption system as claimed in claim 31 wherein the central processor comprises an adder comprising:
- plural adder segments, each of the plural adder segments having a carry output and a sum output, each of the plural adder segments processing a segment of each of two operands;
- carry selectors to select the carry outputs as carry inputs to successive adder segments for successive clock cycles so long as any carry results in an adder cycle; and
- operand selectors to select each sum output as an operand input to the same adder segment for successive clock cycles so long as any carry results in an adder cycle.
36. An electronic encryption system as claimed in claim 25 wherein each processor of each processing element performs a modulo adjust operation to compute M mod N.
37. An electronic encryption system as claimed in claim 25 wherein each processor of each processing element performs a modulo add or subtract operation to compute A±B mod N.
38. An electronic encryption system as claimed in claim 25 wherein each processor of each processing element performs a modulo multiply operation to compute A×B mod N.
39. An electronic encryption system as claimed in claim 25 wherein the encryption device further comprises an adder comprising:
- plural adder segments, each of the plural adder segments having a carry output and a sum output, each of the plural adder segments processing a segment of each of two operands;
- carry selectors to select the carry outputs as carry inputs to successive adder segments for successive clock cycles so long as any carry results in an adder cycle; and
- operand selectors to select each sum output as an operand input to the same adder segment for successive clock cycles so long as any carry results in an adder cycle.
40. An electronic encryption device as claimed in claim 25 wherein, during the implementation of the encryption algorithm, each processing element in the pipeline writes resulting data into the data memory shared with the subsequent processing element for direct access by the subsequent processing element.
41. A multiplier circuit comprising:
- a plurality of multiplier segments, each receiving operand words of a first length; and
- input selectors which select a first set of inputs when the multiplier segments operate as individual multipliers and a second set of inputs to concatenate the multiplier segments as a wider multiplier operating on operands of a second word length,
- wherein the first word length is 32 bits and the second word length is 512 bits.
42. A multiplier as claimed in claim 41 wherein each multiplier segment comprises partial product adders.
43. An adder comprising:
- plural adder segments, each having a carry output and a sum output, each of the adder segments processing a segment of each of two operands;
- carry selectors to select the carry outputs as carry input to successive adder segments for successive clock cycles so long as any carry results in an adder cycle; and
- operand selectors to select each carry output as an operand input to the same adder segment for successive clock cycles so long as any carry results in an adder cycle.
44. An electronic encryption device comprising, on a single chip:
- a linear array of processing elements, each comprising an instruction store, data storage, and a processor which processes a sequence of instructions from the instruction store to operate on data words of a first length, the data storage of the processing elements including dual port memories shared between adjacent processing elements for transfer of data between adjacent processing elements of the array, the processing elements of the linear array having stored in their instruction stores respective rounds of an encryption algorithm and transferring results of the rounds to successive processing elements such that the linear array of processing elements processes successive rounds of the encryption algorithm in a processing element pipeline;
- a global random access memory;
- a global bus to which data is transferred between the global random access memory and the processing element data memories; and
- a public key encryption processor operating on data words of a second length at least an order of magnitude longer than the first length, the public key processor accessing global random access memory in word lengths of the second length.
45. An electronic encryption device comprising, on a single chip, an array of processing elements, each processing element comprising:
- instruction memory means for storing a round of an encryption algorithm;
- processor means for implementing the round from the instruction memory; and
- data storage means for storing encryption data operands and encrypted data resulting from implementing the round.
46. An electronic encryption device as claimed in claim 45 wherein the data storage means has a portion thereof which is shared between adjacent processing elements of the linear array for transfer of data between adjacent processing elements of the linear array.
47. An electronic encryption device as claimed in claim 46 further comprising global random access memory means and a global bus means through which data is transferred between the global random access memory means and the processing element data storage means.
48. An electronic encryption device as claimed in claim 47 further comprising central processing means coupled to the global bus means for processing data words which are wider than the data words processed by the processing elements.
49. A method of encryption comprising:
- in an electronic circuit on a single chip, receiving data to be encrypted;
- applying the data to a pipeline of data processing elements on the chip, each processing element processing an encryption round from a block cipher algorithm and transferring results to successive processing elements such that the processing elements implement successive rounds of the encryption algorithm in a processing element pipeline.
50. A method as claimed in claim 49 wherein results are transferred to successive processing elements through shared memory.
51. A method as claimed in claim 50 further comprising processing encryption algorithms on the chip in a central processor coupled to the processing elements through a global bus, the central processor processing data words which are wider than data words processed by the processing elements.
52. A method as claimed in claim 49 further comprising processing encryption algorithms on the chip in a central processor coupled to the processing elements through a global bus, the central processor processing data words which are wider than data words processed by the processing elements.
53. The multiplier as claimed in claim 41 wherein providing the multiplier segments operate as individual multipliers each individual multiplier is configured to operate as a first length by first length multiplier.
54. The multiplier as claimed in claim 41 wherein the wider multiplier is configured to operate as a first length by first length times a number of the plurality of multiplier segments multiplier.
55. The multiplier as claimed in claim 41 wherein the second word length is equal to the first length times a number of the plurality of multiplier segments.
56. The multiplier as claimed in claim 55 wherein the wider multiplier provides an output having a length equal to the first length plus the second length.
57. The multiplier as claimed in claim 41 wherein each of the multiplier segments is configured to receive a first operand word of the first length and a second operand word of the first length.
58. The multiplier as claimed in claim 57 wherein each of the multiplier segments comprises:
- a plurality of AND gates,
- each AND gate configured to: receive a first bit from the first operand word, receive a second bit from the second operand word, and provide one of a set of partial products.
59. The multiplier as claimed in claim 58 wherein each of the multiplier segments comprises:
- a plurality of partial product adders, and
- input selection circuitry configured to select the set of partial products as inputs to the partial product adders when operating as an individual multiplier, or a second set of inputs including partial products from an adjacent multiplier segment when concatenated.
60. The multiplier as claimed in claim 59 wherein each of the plurality of partial product adders comprises three stages, wherein the first and second stages comprise full adders and the third stage comprises a carry look-ahead adder.
61. The multiplier as claimed in claim 41 wherein each of the plurality of multiplier segments is configurable to operate as a 4 by 4 multiplier.
62. A multiplier circuit comprising:
- a plurality of multiplier segments, each receiving operand words of a first length; and
- input selectors which select a first set of inputs when the multiplier segments operate as individual multipliers and a second set of inputs to concatenate the multiplier segments as a wider multiplier operating on operands of a second word length,
- wherein at least one multiplier segment includes a gate configured to receive a select signal and a partial product signal from a neighboring multiplier segment, the select signal enabling the gate when the multiplier segments are concatenated to pass the partial product signal to a summation logic circuit, the select signal disabling the gate when the multiplier segments operate as individual multipliers to disable the passing of the partial product signal.
63. A multiplier as claimed in claim 62 wherein the gate is an AND gate, the select signal is logic 1 when the multiplier segments are concatenated, and the select signal is logic 0 when the multiplier segments operate as individual multipliers.
64. A multiplier circuit comprising:
- a plurality of multiplier segments, each receiving operand words of a first length; and
- input selectors which select a first set of inputs when the multiplier segments operate as individual multipliers and a second set of inputs to concatenate the multiplier segments as a wider multiplier operating on operands of a second word length,
- wherein at least one multiplier segment includes a plurality of gates each receiving a select signal and a respective one of a plurality of partial product signals from a neighboring multiplier segment, the select signal enabling the gates when the multiplier segments are concatenated to pass the partial product signals to a summation logic circuit, the select signal disabling the gates when the multiplier segments operate as individual multipliers to disable the passing of the partial product signals.
65. A multiplier as claimed in claim 64 wherein the plurality of gates are AND gates, the select signal is logic 1 when the multiplier segments are concatenated, and the select signal is logic 0 when the multiplier segments operate as individual multipliers.
66. A multiplier circuit comprising:
- a plurality of multiplier segments, each receiving operand words of a first length; and
- input selectors which select a first set of inputs when the multiplier segments operate as individual multipliers and a second set of inputs to concatenate the multiplier segments as a wider multiplier operating on operands of a second word length,
- wherein at least one multiplier segment includes a multiplexer receiving a select signal, a sum signal from a neighboring multiplier segment, and a partial product signal, the select signal enabling the multiplexer when the multiplier segments are concatenated to pass the sum signal to a summation logic circuit, the select signal enabling the multiplexer when the multiplier segments operate as individual multipliers to pass the partial product signal to the summation logic circuit.
67. A multiplier circuit comprising:
- a plurality of multiplier segments, each receiving operand words of a first length; and
- input selectors which select a first set of inputs when the multiplier segments operate as individual multipliers and a second set of inputs to concatenate the multiplier segments as a wider multiplier operating on operands of a second word length,
- wherein at least one multiplier segment includes a plurality of multiplexers each receiving a select signal, a respective one of a plurality of sum signals from a neighboring multiplier segment, and a respective one of a plurality of partial product signals, the select signal enabling the multiplexers when the multiplier segments are concatenated to pass the sum signals to a summation logic circuit, the select signal enabling the multiplexers when the multiplier segments operate as individual multipliers to pass the partial product signals to the summation logic circuit.
68. An electronic encryption device as claimed in claim 45, in which said processor means comprises a memory which is shared between adjacent processing elements of the array of processing elements for transfer of data between adjacent processing elements of the array.
3098153 | July 1963 | Heijn |
4004089 | January 18, 1977 | Richard et al. |
4274085 | June 16, 1981 | Marino, Jr. |
4306289 | December 15, 1981 | Lumley |
4439839 | March 27, 1984 | Kneib et al. |
4484025 | November 20, 1984 | Ostermann et al. |
4598170 | July 1, 1986 | Piosenka et al. |
4641238 | February 3, 1987 | Kneib |
4747139 | May 24, 1988 | Taaffe |
4809169 | February 28, 1989 | Sfarti et al. |
4922418 | May 1, 1990 | Dolecek |
5038282 | August 6, 1991 | Gilbert et al. |
5109506 | April 28, 1992 | Begun |
5239654 | August 24, 1993 | Ing-Simmons et al. |
5301340 | April 5, 1994 | Cook |
5343416 | August 30, 1994 | Eisig et al. |
5475856 | December 12, 1995 | Kogge |
5524250 | June 4, 1996 | Chesson et al. |
5546343 | August 13, 1996 | Elliott et al. |
5627966 | May 6, 1997 | Hanko |
5636351 | June 3, 1997 | Lee |
5675164 | October 7, 1997 | Brunner et al. |
5694143 | December 2, 1997 | Fielder et al. |
5708836 | January 13, 1998 | Wilkinson et al. |
5724422 | March 3, 1998 | Shang et al. |
5808928 | September 15, 1998 | Miyoshi |
5864683 | January 26, 1999 | Boebert et al. |
5892960 | April 6, 1999 | Seide |
5907714 | May 25, 1999 | Boutaud et al. |
5917912 | June 29, 1999 | Ginter et al. |
5926546 | July 20, 1999 | Maeda et al. |
5987124 | November 16, 1999 | Matyas, Jr. et al. |
6434699 | August 13, 2002 | Jones et al. |
20010024502 | September 27, 2001 | Ohkuma et al. |
40 16 203 | November 1991 | DE |
0 239 899 | October 1987 | EP |
0 395 348 | October 1990 | EP |
0 577 483 | January 1994 | EP |
0654733 | July 1994 | EP |
2 172 129 | September 1986 | GB |
2 215 496 | September 1989 | GB |
62-271016 | November 1987 | JP |
64-003734 | January 1989 | JP |
03-129384 | June 1991 | JP |
06-214760 | August 1994 | JP |
07-200260 | August 1995 | JP |
08-123670 | May 1996 | JP |
11-506230 | June 1999 | JP |
WO 89/07375 | August 1989 | WO |
WO 91/18460 | November 1991 | WO |
WO 97/09705 | March 1997 | WO |
WO 99/14881 | March 1999 | WO |
WO 99/34514 | July 1999 | WO |
WO 99/44329 | September 1999 | WO |
- “Intel Architecture Software Developer's Manual, vol. 1: Basic Architecture”, Intel, 283 pp. (1997).
- “Intel Architecture Software Developer's Manual, vol. 2: Instruction Set Reference”, Intel, 566 pp. (1997).
- “Pentium Processor Family Developer's Manual, vol. 3: Architecture and Programming Manual”, Intel, 1032 pp. (1995).
- “Pentium Processor Family Developer's Manual”, Intel, 609 pp. (1997).
- Butter, A., et al., “High Performance Cryptographic Hardware Using Pipelined Data Encryption Standard Units,” IBM Technical Disclosure Bulletin, 38(1): 503-507 (Jan. 1995).
- International Search Report in International Application No. PCT/CA99/00176, 5 pages, mailed Dec. 3, 1999.
- International Preliminary Examination Report in International Application No. PCT/CA99/00176, 22 pages, mailed Jul. 6, 2000.
- Wolter, Stefan et al., “On the VLSI Implementation of the International Data Encryption Algorithm IDEA,” IEEE International Symposium on Circuits and Systems (ISCAS), Seattle, WA, Apr. 30-May 3, 1995., vol. 1, Apr. 30, 1995, pp. 397-400.
- Sauerbrey, J., “A Modular Exponentiation Unit Based on Systolic Arrays,” Advances in Cryptology—Auscrpyt, Gold Coast, Queensland, Dec. 13-16, 1992, conf. 3, Dec. 13, 1992, pp. 505-516.
- Barrett, Paul, “Implementing The Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor,” Computer Security LTD, Aug. 1986, pp. 311-323.
- Montgomery, Peter L., “Modular Multiplication Without Trial Division,” Mathematics of Computation, vol. 44, No. 170, Apr. 1985, pp. 519-521.
- Jones, D. et al., “A Time-Multiplexed FPGA Architecture for Logic Emulation,” Proceedings of the IEEE 1995 Custom Integrated Circuits Conference, May 1995, pp. 495-498.
- Mirsky, Ethan et al., “Matrix: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources,” IEEE Symposium on FPGA's For Custom Computing Machines, Published at FCCM '96, Apr. 17-19, 1996, pp. 1-10.
- Chen, Dev C., “A Reconfigurable Multiprocessor IC for Rapid Prototyping of Algorithmic-Specific High-Speed DSP Data Paths,” IEEE Journal of Solid-State Circuits, vol. 27, No. 12, Dec. 1992, pp. 1895-1904.
- Yeung, Alfred K. et al., “A 2.4GOPS Data-Driven Reconfigurable Multiprocessor IC for DSP,” IEEE International Solid-State Circuits Conference, ISSCC95, Feb. 16, 1995, Session 6, Digital Design Elements, Paper TA 6.3, 3 pages.
Type: Grant
Filed: Sep 4, 2012
Date of Patent: Jan 7, 2014
Assignee: Mosaid Technologies Incorporated (Ottawa, Ontario)
Inventors: David E. Jones (Ottawa), Cormac M. O'Connell (Carp)
Primary Examiner: Benjamin Lanier
Application Number: 13/603,137
International Classification: G06F 21/00 (20130101);