Speculative execution for data ciphering operations
In one embodiment, a computer-implemented method comprises receiving a data cipher operation. The method also comprises processing the data cipher operation. The processing of the operation includes generating a number of portions of ciphertext from plaintext, wherein a load operation associated with the generating of at least one portion of the ciphertext executes prior to a store operation associated with the generating of a prior portion of the ciphertext.
Latest Cavium Networks, Inc. Patents:
This application claims the benefit of U.S. provisional patent application No. 60/361,247 entitled “Speculative Execution for Data Ciphering Operations,” filed Mar. 1, 2002.
FIELD OF THE INVENTIONThe invention relates to the field of computer processing. More specifically, the invention relates to speculative execution for data ciphering.
BACKGROUND OF THE INVENTIONCommunication networks and the number of users of such networks continue to increase. On-line sales involving both business-to-business and business to consumer over the Internet continues to proliferate. Additionally, the number of people that are telecommuting continues to grow. Both on-line sales and telecommuting are examples of usage of communication networks that typically involve private and sensitive data that needs to be protected during its transmission across the different communication networks.
Accordingly, security protocols, (e.g., Transport Layer Security (TLS), Secure Sockets Layer (SSL) 3.0, Internet Protocol Security (IPSec), etc.), have been developed to establish secure sessions between remote systems. These security protocols provide a method for remote systems to establish a secure session through message exchange and calculations, thereby allowing sensitive data being transmitted across the different communication networks to have a measure of security and/or untamperability.
Moreover, different ciphers have been developed to allow for these secure communications using such different security protocols. RC4 is a stream cipher having variable size keys using byte-oriented operations. RC4 employs a Substitution (S)-box in its generation of byte values that are subsequently XORed with the plaintext to generate the ciphertext and/or XORed with ciphertext to generate the corresponding plaintext.
However, current approaches for RC4 are limited in their execution speed due to the bottlenecking that occurs while accessing the S-box. In particular, a processing unit executes the RC4 operations in conjunction with a memory that is used to store the S-box needed for the operations. Accordingly, the processing unit accesses the S-box for the different operations. To help illustrate,
Additionally,
Columns 206-216 of waterfall diagram 200 illustrate the change in the variables of pseudo-code 100 (i, j, a, b, t and l). Column 218 illustrates a temporary variable (temp) that is used for the storage of temporary results within the processor. The variables i, j and t are used as indexes into the S-box. The variables a and b are used to temporarily store values retrieved from the S-box based on the index variables i and j, respectively. The variable l is used as an index into the data arrays for both the plaintext and the ciphertext generated there from.
Column 220 of waterfall diagram 200 illustrates the reading of the memory (that is coupled to the processor executing pseudo-code 100) for a first cycle, while column 222 illustrates the reading of the memory for a second cycle. Column 224 of waterfall diagram 200 illustrates the writing to the memory.
As will be shown in
Returning to
At process cycle zero (shown at column 204 of waterfall diagram 200), the processor executes code statement 104, wherein the value stored in the S-box (S[ ]) at an offset of i+1 is assigned to the variable of a, as part of the memory accessing (as shown in column 220).
At process cycle one, the processor executes a portion of code statement 106, wherein the variable i is incremented (as shown in column 206). Additionally, within process cycle one (as shown in column 222), the second read cycle associated with the read operation of S[i] is also occurring, as two process cycles are needed to retrieve the data from memory.
At process cycle two, the processor executes the other portion of code statement 106, as the value of j+a is assigned to j. In particular, since the read operation of S[i] takes two cycles to complete (after two process cycles), the result of such operation cannot be employed until three cycles later (process cycle two), wherein j equals the value of a added to the current value of j. Accordingly, the result of the memory operation is added to the current value of j to generate a new value of j (as shown in columns 208 and 210).
At process cycle three, the processor executes code statement 108 wherein the value stored in the S-box (S[ ]) at an offset of j is assign to the variable b, for a second memory access (as shown in column 220). In particular, the access of the S-box is available three cycles after a prior access to the S-box (process cycle zero).
At process cycle four, none of the code statements in pseudo-code 100 is executed. Rather, in process cycle four, the second cycle associated with the read operation of S[j] is executed (as shown in column 222).
At process cycle five, (as part of the swapping of the data in the S-box) the processor executes a portion of code statement 110, wherein the value of b is assigned to the location within the S-box having an offset of i (as shown in column 224). Accordingly, as shown in
At process cycle six, (as part of the swapping of the data in the S-box) the processor executes another portion of code statement 110, wherein the location within the S-box having an offset of j is set to the value of a (as shown in column 224). Also within process cycle six, the processor retrieves the value at the location within the S-box having an offset of t (as shown in column 220). Additionally within process cycle six, the second read cycle for the retrieval of the value at the location within the array of data to be ciphered (plain[ ]) at an offset of l is read from the memory (as shown in column 222).
Additionally, an overlap in the generation of two different portions of ciphertext is occurring, as the second iteration of the while loop of
At process cycle eight (process cycle one of the subsequent iteration of the while loop of
At process cycle nine (process cycle two of the subsequent iteration of the while loop of
The loop of
In a similar manner to the generation of ciphertext through encryption of plaintext, RC4 can also be used to generate plaintext through the decryption of ciphertext. Thus, RC4, whether encrypting or decrypting, translates input text blocks into output text blocks.
Disadvantageously, as illustrated, the bottleneck of this prior art approach to an RC4 operation is associated with the access of data from the S-box stored in memory. The overlapping of the generation of the two different portions of ciphertext are non-speculative in nature. Specifically, in order to avoid the generation of inaccurate data for the ciphertext, the write operation to the S-box for the generation of a first portion of ciphertext is complete prior to the load operation to the S-box for the generation of a second portion of the ciphertext. As shown, three cycles are needed for each read operation in order to ensure that the data retrieved from the S-box is up-to-date. Accordingly, one byte of the plaintext is encrypted into ciphertext per seven process cycles.
SUMMARY OF THE INVENTIONA method, apparatus and system for speculative execution for data ciphering are described. In one embodiment, a computer-implemented method comprises receiving a data cipher operation. The method also comprises processing the data cipher operation. The processing of the operation includes generating a number of portions of ciphertext from plaintext, wherein a load operation associated with the generating of at least one portion of the ciphertext executes prior to a store operation associated with the generating of a prior portion of the ciphertext.
In an embodiment, a computer-implemented method executes in a processor. The method comprises receiving a request to perform data ciphering of plaintext. The method also comprises processing the request based on a data structure stored in a memory coupled to the processor. The processing includes performing a first access of data from the data structure and swapping the data from the first access. The process also includes data ciphering a first portion of the plaintext based on the swapped data from the first access. Additionally, the processing comprises performing a second access of data from the data structure prior to the swapping of the data from the first access. The processing also includes performing the following, upon determining that the data from the first access does not equal the data from the second access: swapping the data from the second access and data ciphering a second portion of the plaintext based on the swapped data from the second access.
In an embodiment, an apparatus comprises a memory to store a data structure. The apparatus also comprises a processing unit coupled to the memory. The processing unit is to execute a data ciphering operation. Additionally, the processing unit is to swap data stored in the data structure for data ciphering of a first portion of plaintext. Moreover, prior to the completion of the swapping of the data stored in the data structure for data ciphering of the first portion of the plaintext, the processing unit is to access data stored in the data structure for data ciphering of a second portion of the plaintext.
Embodiments of the invention may be best understood by referring to the following description and accompanying drawings that illustrate such embodiments. The numbering scheme for the Figures included herein are such that the leading number for a given element in a Figure is associated with the number of the Figure. For example, host execution unit 300 can be located in
In the drawings:
A method, apparatus and system for speculative execution for data ciphering are described. In the following description, numerous specific details are set forth to provide a thorough understanding of the invention. However, it is understood that the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the invention.
Overview
Embodiments of the invention provide for faster execution of data ciphering in comparison to current approaches. As will be described in more detail below, in various embodiments of the invention, a data ciphering scheme is implemented by overlapping operations to generate different output blocks of data through use of speculative execution of memory accesses. In one such embodiment, data ciphering based on RC4 is employed such that the memory storing the S-box (which stores data needed for the RC4 operations) is more fully utilized. RC4 is a data ciphering scheme in which output blocks are weakly coupled, such that there is a relatively low probability that collisions will occur between data used to generate such different output blocks. While embodiments are described in which RC4 is employed, alternative embodiments may employ different data ciphering schemes in which neighboring output blocks are weakly coupled (where the output blocks are generated in an order and the generation of those output blocks that are in close proximity to each other according to that order are weakly coupled). However, embodiments of the invention are not limited to such data ciphering schemes, as embodiments can be incorporated into any of a number of different types of data ciphering schemes (such as Advanced Encryption Standard (AES), RC5, Data Encryption Standard (DES) etc.).
Further, in one embodiment, the RC4 operations are based on modulo 256, wherein there are two inputs (eight bits in length) and one output (eight bits in length). However, embodiments of the invention are not so limited, as other variations of the RC4 operations can be incorporated into embodiments of the invention.
Additionally, while embodiments of the invention are described with reference to the generation of ciphertext through encryption of plaintext, embodiments of the invention are not so limited. Rather, embodiments of the invention are also applicable to the generation of plaintext through decryption of ciphertext. Thus RC4 for example, whether encrypting or decrypting, translates input text blocks into output text blocks.
Apparatus Description
Alternative embodiments of the invention may include additional primitive security operation units or fewer primitive security operation units. Memory 314 is coupled to the primitive security operation units 304, 306, 308, 310 and 312 through bus 335. In one embodiment, bus 335 is also coupled to external memory and/or other execution units. For example, one such configuration is described in further detail in co-pending patent application, entitled “An Interface to a Security Co-Processor” Ser. No. 10/025,512 to Richard E. Kessler, David A. Carlson, Muhammad Raghib Hussain, Robert A. Sanzone and Khaja E. Ahmed, which is hereby incorporated by reference.
In one embodiment, microcode block 302 is used by microcontroller unit 316 to translate a security operation into one or more primitive security operations. Microcontroller unit 316 retrieves from memory 314 the appropriate data for each of the primitive security operations. The primitive security operations are placed into execution queue 318 by microcontroller unit 316. When a primitive security operation's corresponding primitive security operation block is able to perform the primitive security operation, execution queue 318 pushes the primitive security operation to the appropriate primitive security operation unit 304, 306, 308, 310, 312 and/or 320. Once a primitive security operation unit 304, 306, 308, 310 and 312 has executed the primitive security operation, the primitive security operation unit either passes the results to memory 314 or onto bus 335.
In an embodiment, the primitive security operations placed into execution queue 318 also indicate where the results of an operation will be stored. In one embodiment, the results of an operation by a given primitive security operation unit can be (1) stored in memory 314, a memory external to execution unit 300 (not shown) and/or (2) served as inputs to another primitive security operation unit. For example, a first primitive security operation unit can generate a result, wherein a second primitive security operation unit reads that result from the first primitive security operation unit for processing within the second primitive security operation unit.
Memory 314 can be any of a number or a combination of different types of memories (e.g., static random access memory (SRAM), dynamic random access memory (DRAM), etc.). Additionally, in an embodiment, memory 314 is to store S-box 330, plaintext 332 and ciphertext 334. Although S-box 330, plaintext 332 and ciphertext 334 can be any of a number of different data structures, in an embodiment, S-box 330, plaintext 332 and ciphertext 334 are data arrays. For example, in other embodiments, S-box 330, plaintext 332 and ciphertext 334 could be different types of tables or classes of objects (such as those used in object-oriented databases). In one embodiment, S-box 330 is a 256-byte array.
As will be described in more detail below, RC4 unit 320 executes a number of operations for data ciphering wherein data stored in plaintext 332 is converted to data stored in ciphertext 334 (or vice versa) based on accesses (both read and write) to S-box 330. In one embodiment, RC4 unit 320 is a hardware state machine for the generation of ciphertext and/or plaintext based on RC4 operations.
While one embodiment is described in which each execution unit has its own microcode block, alternative embodiments have one or more execution units share a single microcode block. Yet other embodiments have a central microcode block (e.g., in SRAM) whose contents are loaded upon power-up into local microcode blocks in each of the execution units. Regardless of the arrangement of the microcode block(s), in certain embodiments the microcode blocks are reprogrammable to allow for flexibility in the selection of the security operations (be they macro and/or primitive security operations) to be performed.
Moreover, embodiments of the invention are not limited to the system illustrated in
Data Ciphering Operations
The data ciphering operations will now be described in terms of a flow diagram illustrated in
As described in more detail below,
As shown in block 502, a request to execute operation to cipher data is received. With reference to the exemplary embodiment in
In process decision block 504, a decision is made whether the generation of the ciphertext is complete. With reference to the exemplary embodiment in
In process block 506, upon determining that the generation of the ciphertext is complete, the process is complete. In terms of pseudo-code 600 at while statement 602, once the value of l is not less than LEN, the process is complete. With reference to the exemplary embodiment in
In process block 508, upon determining that the generation of the ciphertext is not complete, a first memory access and a second memory access (for this iteration of the while loop in
To better illustrate,
Waterfall diagram 700 includes column 702, which is a list of the operations of pseudo-code 600. Accordingly, as shown, the rows of waterfall diagram 700 correspond to the operations shown in pseudo-code 600. Waterfall diagram 700 also includes column 704, which includes the process cycles that correspond to the execution of pseudo-code 600.
Columns 706-722 of waterfall diagram 700 illustrate the change in the variables of pseudo-code 600, i, j1, j2, a1, a2, b1, b2, t and l, over the execution of the process cycles. The numbers 1 and 2 following such variables designate whether the variable is being used to generate respectively the non-speculative or speculative ciphertext portion for a given iteration of the loop of
The variables i, j1, j2 and t are used as indexes into S-box 330. The variables a1, a2, b1 and b2 are used to temporarily store values retrieved from S-box 330 for the index variables i, j1 and j2. The variable l is used as an index into the data structures for both the plaintext and the ciphertext generated there from (plaintext 332 and ciphertext 334, respectively). Additionally, the variables j1, a1 and b1 are used in conjunction with a first memory access and a first swap of elements within S-box 330 for a cipher of a first byte for a given iteration (as will be described in more detail below). The variables j2, a2 and b2 are used in conjunction with the second memory access and second (speculative) swap of elements within S-box 330 for a cipher of a second byte for a given iteration (as will be described in more detail below).
Column 728 illustrates the reading of memory 314 for a first cycle, while column 730 illustrates the reading of memory 314 for a second cycle. Column 732 illustrates the writing of data to memory 314. Additionally, waterfall diagram 700 illustrates process cycles six-nine of an iteration of the while loop in
With regard to process block 508, in terms of the process cycles illustrated in waterfall diagram 700, the first and second memory accesses are within process cycles zero, one, two, three, four and five. At process cycle zero, as shown in column 728 of
At process cycle one, in one embodiment, RC4 unit 320 does not execute one of the code statements of pseudo-code 600. However, within process cycle one (as shown in column 730), the second read cycle associated with the read operation of S[i] is occurring, as, in one embodiment, two process cycles are needed to retrieve the data from memory 314.
At process cycle two, RC4 unit 320 executes code statement 608, code statement 610 and code statement 612. In particular, for the second memory access, RC4 unit 320 executes code statement 608, which causes the value stored in S-box 330 (S[ ]) at an offset of i+1 to be assigned to the variable a2 (as shown in column 728). Additionally, at process cycle two, RC4 unit 320 executes code statement 610, wherein the value of i is incremented (as shown in column 706). Also, at process cycle two (related to the first memory access), RC4 unit 320 executes code statement 612, wherein the variables is assigned to the value of j1 added to the value of a1. In other words, the value stored in j1 is increased by the value stored in S[i] (as shown in column 708).
At process cycle three, RC4 unit 320 executes code statement 616, wherein the value stored in S-box 330 at an offset of j1 is stored in variable b1 (as shown in column 728). Moreover, within process cycle three (as shown in column 730), the second read cycle associated with the read operation of S[i] is also occurring.
At process cycle four, RC4 unit 320 executes code statement 614 (relating to the second memory access), wherein the value of j2 is increased by the value of a2 (as shown in column 710). Also, at process cycle four, RC4 unit 320 executes code statement 624, wherein the value of l is incremented (as shown in column 722), thereby incrementing the index into plaintext 332 and ciphertext 334. Additionally, within process cycle four (as shown in column 730), the second read cycle associated with the read operation of S[j1] is also occurring. At process cycle five, RC4 unit 320 executes code statement 618 (related to the second memory access), wherein the variable b2 is assigned to the value stored in S-box 330 at an offset of j2 (S[j2]) (as shown in column 728).
Returning to flow diagram 500 of
With reference to waterfall diagram 700, at process cycle four, RC4 unit 320 executes code statement 620, wherein the value of a1 (which is the value within S[i−1], in reference to the current value of i) is stored in S-box 330 at an offset of j1 (as shown in column 732). To help illustrate,
At process cycle five, RC4 unit 320 continues executing code statement 620, wherein the value of b1 (which is the value in S[j1]) is stored in S-box 330 at an offset of i−1 (as shown in column 732). With reference to
Returning to flow diagram 500 of
For the generation of the ciphertext, at process cycle four, RC4 unit 320 retrieves the value stored in plaintext 332 (plain[ ]) at an offset of l (as shown in column 728) (in conjunction with the execution of code statement 622). Additionally, within process cycle five, (as shown in column 730), the second read cycle associated with the read operation of plain[l] is occurring.
Moreover, the value within S-box 330 that will be used in the generation of the first byte of the ciphertext is retrieved. Accordingly, at process cycle five, an index (t) into S-box 330 for this value is generated based on the values stored at S[i−1] (stored in a1) and S[j1] (stored in b1) (as shown in column 720).
The value within S-box 330 at the index location (t) is retrieved. In particular, at process cycle six, RC4 unit 320 retrieves the value stored in S-box 330 at an offset of t (as shown in column 728). Additionally, within process cycle six, the value retrieved from plaintext 332 is stored in a temporary variable, temp1 (as shown in column 724).
Within process cycle seven, the second read cycle associated with the read operation of S[t] is occurring (as shown in column 730). Within process cycle eight, with the needed values returned from memory 314 (for the generation of the ciphertext), RC4 unit 320 sets the value of temp1 to the result of XORing the plaintext element, plain[l] (stored in temp1) with the value from S-box 330, S[t] (as shown in column 724). In an embodiment, the resultant value of this XOR is the first byte of ciphertext for the iteration of the loop in
In process decision block 514, a decision is made on (1) whether there was a collision between the first swap and the second memory access or (2) whether the generation of the ciphertext is complete. With reference to the exemplary embodiment in
In one embodiment, if the index values into S-box 330 for the first swap, i−1 or j1, equals the index values into S-box 330 for the second memory access, i or j2, then a collision has occurred. Therefore, the current values for the second memory access were not retrieved. Accordingly, the values retrieved for the second memory access cannot be used in a second swap to generate a second byte of ciphertext (for this iteration of
With regard to checking for collisions, in reference to pseudo-code 600, code statement 620 checks to see whether (1) i equals j1, (2) i−1 equals to j2 or (3) j1 equals j2. The condition of whether i−1 equals to i is not checked as this condition cannot occur. Moreover, in one embodiment, a collision is forced when the last element in the plaintext is being converted to ciphertext (as shown by the condition that check to see if the variable l is not equal to LEN (the length of the plaintext)).
Upon determining that a collision between the first swap and the second memory access did occur, the second swap and the generation of the second byte of ciphertext for this iteration are aborted and the process returns to process decision block 504, where a decision is made on whether the generation of the ciphertext is complete (as described above). With reference to the exemplary embodiment in
RC4 unit 320 executes code statement 634, which is an else statement that corresponds to the if statement in code statement 626. RC4 unit 320 executes code statement 636, which is part of the else clause under code statement 634. In process cycle six, the execution of code statement 634 causes the decrementing of the value stored in i (as shown in column 706). Accordingly, the second memory access of this current iteration is re-executed as the first memory access of the subsequent iteration.
Conversely, in process block 516, upon determining that a collision between the first swap and the second memory access did not occur, a second swap of data in S-box 330 is performed based on the second memory access. In other words, a store operation associated with the generation of a second portion of ciphertext (the speculative ciphertext block for a given iteration) is performed, upon determining that a store operation associated with the generation of a first portion of ciphertext (the non-speculative block for a given iteration) did not collide with a load operation associated with the generation of a second portion of ciphertext (for the same iteration). In an embodiment, in terms of the code statements in pseudo-code 600 and the process cycles in waterfall diagram 700, the second swap of data in S-box 330 is within code statement 628 and process cycles six and seven, respectively.
With reference to the exemplary embodiment in
In process block 518, a second byte of ciphertext (in this iteration) is generated based on the second swap. In an embodiment, in terms of the code statements in pseudo-code 600 and the process cycles in waterfall diagram 700, the generation of the second byte of ciphertext (in this iteration) occurs within code statement 630 and process cycles seven, eight, nine, ten, eleven and sixteen, respectively.
At process cycle seven, RC4 unit 320 retrieves the value stored in plaintext 332 (plain[ ]) at an offset of l (as shown in column 728) (in conjunction with code statement 630). Additionally, within process cycle eight (as shown in column 730), the second read cycle associated with the read operation of plain[l] is occurring.
Moreover, the value within S-box 330 that will be used in the generation of the second byte of the ciphertext is retrieved. Accordingly, at process cycle eight, an index (t) into S-box 330 for this value is generated based on the values stored at S[i] (stored in a2) and S[j2] (stored in b2) (as shown in column 720).
The value within S-box 330 at the index location (t) is retrieved. In particular, at process cycle nine, RC4 unit 320 retrieves the value stored in S-box 330 at an offset of t (as shown in column 728). Additionally, within process cycle nine, the value retrieved from plaintext 332 is stored in a temporary variable, temp2 (as shown in column 726).
Within process cycle ten, the second read cycle associated with the read operation of S[t] is occurring (as shown in column 730). Within process cycle eleven, with the needed values returned from memory 314 (for the generation of the ciphertext), RC4 unit 320 sets the value of temp2 to the result of XORing the plaintext element plain[l] (stored in temp) with the value from S-box 330, S[t], (as shown in column 726). The resultant value of this XOR is the ciphertext.
In one embodiment, at process cycle sixteen, this resultant value is stored in ciphertext 334, cipher[ ], at an offset of l (as shown in column 732).
Moreover, at process cycle seven, as part of code statement 632, the value stored in l is incremented to proceed to the next element in plaintext 332 that will be converted to ciphertext (as shown in column 722). Also, at process cycle seven, as part of code statement 632, j1 is set to the value stored in j2. In particular, j1 is used in the subsequent iteration of
As shown in waterfall diagram 700, with overlapping among iterations of the while loop of
Moreover, embodiments of the invention can vary the scheduling of the execution of the code statements shown in pseudo-code 600. For example, in an embodiment, if there is a collision between the first swap and the second memory read access, the next iteration of the while loop (in
Additionally, as illustrated, embodiments of the invention can be used in conjunction with data ciphering operations wherein the iterations associated therein are weakly coupled. In particular, in an embodiment, the S-box includes 256 elements wherein only two of such elements change for a given generation of ciphertext. Accordingly, the speculative execution for the generation of an additional byte of ciphertext for a given iteration has a high probability for success.
System Description
Host memory 904 stores request queue 906, input data 908A-908I and output data 909A-909I. Request queue 906 is illustrated and described in terms of a queue. However, embodiments of the invention are not so limited, as request queue 906 can be any other type of data structure for storage of requests to be transmitted to coprocessor 912, which is described in more detail below. In one embodiment, request queue 906 is a circular queue (ring buffer). In an embodiment, the write pointer for request queue 906 is maintained by request processing unit 934 and the read pointer for request queue 906 is maintained by request unit 914 of coprocessor 912. Accordingly, request processing unit 934 increments its write pointer when storing requests into request queue 906, while request unit 914 decrements its read pointer when extracting or retrieving requests from request queue 906.
Additionally, although input data 908A-908I and output data 909A-909I are data structures that are described as tables, such data can be stored in other types of data structures, such as data objects in an object-oriented environment. In one embodiment, input data 908A-908I are contiguously stored in host memory 904. Accordingly, request unit 914 within coprocessor 912 can extract the input data across multiple requests using one direct memory access (DMA) read operation.
Requests inserted into request queue 906 by request processing unit 934 can include instructions, such as an operation code, the data to be operated on as well as a pointer to other locations in host memory 904 storing data (which is related to the request) that could not be placed into the request inside request queue 906, due to restraints on the size of the requests. In particular, requests within request queue 906 can point to one of input data 908A-908I. In one embodiment, these requests are 32 bytes in size. The types of requests can comprise different security operations including, but not limited to, a request to (1) generate a random number, (2) generate a prime number, (3) perform modular exponentiation, (4) perform a hash operation, (5) generate keys for encryption/decryption, (6) perform a hash-message authentication code (H-MAC) operation, (7) perform a handshake hash operation and (8) perform a finish/verify operation.
Coprocessor 912 includes Peripheral Component Interconnect (PCI) unit 930, lightening data transport (LDT) unit 932, key unit 944, request unit 914, doorbell register 920, execution units 916A-916I, execution units 917A-917I, random number generator unit 918 and request buffer 922, which are coupled together. Additionally, PCI unit 930 and LDT unit 932 are coupled to system bus 910. PCI unit 930 and LDT unit 932 provide communication between the different components in coprocessor 912 and host memory 904, host processor 902 and request processing unit 934. While one embodiment is described in which PCI and LDT units are used to connect to a system bus, alternative embodiments could use different buses.
The number of execution units 916 and 917 and the number of random number generator units 918 are by way of example and not by way of limitation, as a lesser or greater number of such units can be included within coprocessor 912. In one embodiment, execution units 916-917 execute the data ciphering operations described above. For example, in an embodiment, execution unit 300 (illustrated in
Memory described herein includes a machine-readable medium on which is stored a set of instructions (i.e., software) embodying any one, or all, of the methodologies described herein. Software can reside, completely or at least partially, within this memory and/or within processors described herein. For the purposes of this specification, the term “machine-readable medium” shall be taken to include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media; optical storage media, flash memory devices, electrical, optical, acoustical, or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
Thus, a method, apparatus and system for speculative execution for data ciphering have been described. Although the invention has been described with reference to specific exemplary embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. For example, embodiments of the invention were described such that two portions of the ciphertext are generated for a given iteration. However, in other embodiments, a greater number of portions of the ciphertext can be generated for a given iteration. For example, if a memory access included a four cycle latency, the memory pipeline can be fully utilized by generating four portions of ciphertext in a given iteration. Moreover, in an embodiment, the plaintext and ciphertext data structures could be stored contiguously within memory. Accordingly, a number of bytes from the plaintext data structure could be retrieved in a single memory access for use across a number of iterations. Additionally, a number of bytes could be stored into the ciphertext data structure in a single memory access that had been generated across a number of iterations. To further illustrate, embodiments of the invention described the generation of ciphertext based on bytes of the plaintext. However, embodiments of the invention are not so limited, as smaller and/or lesser portions of the plaintext can be used to generate the ciphertext. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Claims
1. A computer-implemented method executing in a processor, the method comprising:
- receiving a request to perform for data ciphering of plaintext; and
- processing the request based on a data structure stored in a memory coupled to the processor, wherein the processing comprises, performing a first access of data from the data structure; swapping the data from the first access; data ciphering a first portion of the plaintext based on the swapped data from the first access; performing a second access of data from the data structure prior to the swapping of the data from the first access; performing the following, upon determining that the data from the first access does not equal the data from the second access, swapping the data from the second access; and data ciphering a second portion of the plaintext based on the swapped data from the second access in an iteration including data ciphering a first portion of the plaintext based on the swapped data from the first access; and performing the following, upon determining that the data from the first access equals data from the second access, reexecuting the performing of the second access of data from the data structure in an iteration that is subsequent to determining that the data from the first access does not equal the data from the second access; swapping the data from the second access; and data ciphering the second portion of the plaintext based on the swapped data from the second access.
2. The computer-implemented method of claim 1, wherein the data ciphering comprises an RC4 operation.
3. The computer-implemented method of claim 1, wherein the data structure comprises a substitution-box.
4. A co-processor coupled to a host processor and a host memory, the co-processor comprising:
- an interface unit to retrieve a data encryption operation, a substitution (S)-box and plaintext associated with the data encryption operation from the host memory based on an instruction from the host processor; and
- an execution unit coupled to the interface unit, the execution unit comprising, a memory to store the plaintext and the S-box associated with the operation for the data cipher; a microcontroller unit to schedule the data cipher operation; and a RC4 unit to receive the data cipher operation, wherein the RC4 unit is to swap data stored in the S-box for data ciphering of a first portion of the plaintext wherein the RC4 unit is to read data stored in the S-box for data ciphering of a second portion of the plaintext, prior to completion of the swapping of data stored in the S-box for data ciphering of the first portion of the plaintext, and wherein the RC4 unit is to data cipher the second portion of the plaintext upon determining that the data being swapped in the S-box does not equal the data being read from the S-box.
5. The co-processor of claim 4, wherein the RC4 unit is to data cipher the first portion of the plaintext.
6. The co-processor of claim 4, wherein the RC4 unit is to swap data retrieved from the S-box for the data ciphering of the second portion of the plaintext upon determining that the data being swapped for the data ciphering of the first portion of the plaintext does not equal the data read from the S-box for data ciphering of the second portion of the plaintext.
7. A system comprising:
- a host processor;
- a host memory coupled to the host processor, the host memory to include a security operation, wherein the security operation includes a data cipher operation based on RC4, the host memory to include plaintext and a data structure for the data cipher operation;
- a co-processor coupled to the host processor, the co-processor comprising, an interface unit to retrieve the security operation from the host memory based on an instruction from the host processor; an execution unit coupled to the interface unit, the execution unit comprising, a memory to store the plaintext and the data structure associated with the data cipher operation; a microcontroller unit to store the data cipher operation in an execution queue; and an RC4 unit coupled to the execution queue, the RC4 unit to receive the data cipher operation, wherein the RC4 unit is to swap data stored in the S-box for data ciphering of a first portion of the plaintext and wherein the RC4 unit is to read data stored in the S-box for data ciphering of a second portion of the plaintext, prior to completion of the swapping of data stored in the S-box for data ciphering of the first portion of the plaintext, and wherein the RC4 unit is to swap data retrieved from the data structure for the data ciphering of the second portion of the plaintext upon determining that the data being swapped for the data ciphering of the first portion of the plaintext does not equal the data read from the data structure for data ciphering of the second portion of the plaintext.
8. The system of claim 7, wherein the RC4 unit is to data cipher the second portion of the plaintext upon determining that the data being swapped in the data structure does not equal the data being read from the data structure.
9. The system of claim 7, wherein the RC4 unit is to data cipher the first portion of the plaintext.
10. A machine-readable medium that provides instructions, which when executed by a machine, cause said machine to perform operations comprising:
- receiving a request to perform data ciphering of plaintext; and
- processing the request based on a data structure stored in a memory coupled to the processor, wherein the processing comprises, performing a first access of data from the data structure; swapping the data from the first access; data ciphering a first portion of the plaintext based on the swapped data from the first access; performing a second access of data from the data structure prior to the swapping of the data from the first access; performing the following, upon determining that the data from the first access does not equal the data from the second access, swapping the data from the second access; and data ciphering a second portion of the plaintext based on the swapped data from the second access in an iteration including data ciphering a first portion of the plaintext based on the swapped data from the first access; and performing the following, upon determining that the data from the first access equals data from the second access; reexecuting the performing of the second access of data from the data structure in an iteration that is subsequent to determining that the data from the first access does not equal the data from the second access; swapping the data from the second access; and data ciphering the second portion of the plaintext based on the swapped data from the second access.
11. The machine-readable medium of claim 10, wherein the data ciphering comprises an RC4 operation.
12. The machine-readable medium of claim 10, wherein the data structure comprises a substitution-box.
13. The machine-readable medium of claim 10, wherein processing the request for data ciphering of the plaintext comprises data ciphering the plaintext over a number of iterations and wherein the data ciphering of the first portion of the plaintext is in a same iteration as the data ciphering of the second portion of the plaintext.
4078152 | March 7, 1978 | Tuckerman, III |
5016275 | May 14, 1991 | Smith |
5301235 | April 5, 1994 | Shimada |
5444781 | August 22, 1995 | Lynn et al. |
5454117 | September 26, 1995 | Puziol et al. |
5572707 | November 5, 1996 | Rozman et al. |
5754812 | May 19, 1998 | Favor et al. |
5794061 | August 11, 1998 | Hansen et al. |
5799165 | August 25, 1998 | Favor et al. |
5801975 | September 1, 1998 | Thayer et al. |
5826073 | October 20, 1998 | Ben-Meir et al. |
5835599 | November 10, 1998 | Buer |
5884059 | March 16, 1999 | Favor et al. |
5919256 | July 6, 1999 | Widigen et al. |
5926642 | July 20, 1999 | Favor |
6047372 | April 4, 2000 | Thayer et al. |
6061521 | May 9, 2000 | Thayer et al. |
6154831 | November 28, 2000 | Thayer et al. |
6185304 | February 6, 2001 | Coppersmith et al. |
6185679 | February 6, 2001 | Coppersmith et al. |
6189095 | February 13, 2001 | Coppersmith et al. |
6192129 | February 20, 2001 | Coppersmith et al. |
6195744 | February 27, 2001 | Favor et al. |
6202204 | March 13, 2001 | Wu et al. |
6223276 | April 24, 2001 | Lee et al. |
6226742 | May 1, 2001 | Jakubowski et al. |
6243470 | June 5, 2001 | Coppersmith et al. |
6249582 | June 19, 2001 | Gilley |
6332214 | December 18, 2001 | Wu |
6347143 | February 12, 2002 | Goff et al. |
6356270 | March 12, 2002 | Pentkovski et al. |
6369813 | April 9, 2002 | Pentkovski et al. |
6421730 | July 16, 2002 | Narad et al. |
6459792 | October 1, 2002 | Ohmori et al. |
6463579 | October 8, 2002 | McKinsey |
6539541 | March 25, 2003 | Geva |
6549622 | April 15, 2003 | Matthews, Jr. |
6598156 | July 22, 2003 | Arora |
6643745 | November 4, 2003 | Palanca et al. |
6658559 | December 2, 2003 | Arora et al. |
6658578 | December 2, 2003 | Laurenti et al. |
6681317 | January 20, 2004 | Mathews |
6704871 | March 9, 2004 | Kaplan et al. |
6728867 | April 27, 2004 | Kling |
6873707 | March 29, 2005 | Batcher |
20010025341 | September 27, 2001 | Marshall |
20020004904 | January 10, 2002 | Blaker et al. |
20020194483 | December 19, 2002 | Wenocur et al. |
20020196935 | December 26, 2002 | Wenocur et al. |
- Bruce Schneier, Doug Whiting, Fast Software Encryption: Designing Encryption Algorithms for Optimal Software Speed on the Intel Pentium Processor, Lecture Notes in Computer Science, vol. 1267, Jan. 1997, p. 242.
- Mosanya et al., CryptoBooster: A Reconfigurable and Modular Cryptographic Coprocessor, 1999, Springer Verlag Berlin, pp. 246-256.
- Wollinger et al., How Well Are High-End DSPs Suited for the AES Algorithms?, Apr. 2000, Texas Instrument.
- Shehata et al., VLSI Implementation of a High Speed Block-Cipher Module, 2001, IEEE.
- Chodowiec et al., Fast implementations of secret-key block ciphers using mixed inner- and outer-round pipelining, Feb. 2001, ACM.
- Hong et al., Hardware Design and Performance Extimation of the 128-bit Block Cipher CRYPTON, 1999, Springer-Verlag Berlin, pp. 49-60.
- McLoone et al., Single-Chip FPGA Implementation of the Advanced Encryption Standard Algorithm, 2001, Springer-Verlag Berlin, pp. 152-161.
- Moldovyan et al., A Cipher Based on Data-Dependent Permutations, Aug. 2001, Journal of Cryptology, pp. 61-72.
- Lin et al., A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm, 2000, IEEE.
- Childers et al., Architectural Considerations for Application-Specific Counterflow Pipelines, Mar. 1999, IEEE, Advanced Research in VLSI.
- Sherigar et al., A pieplined parallel processor to implement MD4 message digest algorithm on Xilinx FPGA, Jan. 1998, IEEE, Eleventh International Conference on VLSI Design, pp. 394-399.
- Craig Clapp, Optimizing a Fast Stream Cipher for VLIW, SIMD, and Superscalar Processors, 1997, Proceedings of Fast Software Encryption Workshop.
- Ye et al., CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit, 2000, IEEE, Proceedings of the 27th International Symposium on Computer Architecture, pp. 225-235.
- Ian Goldberg and David Wagner. Architecture Considerations for Cryptanalytic Hardware. CS252 Report, May 1996. <http://citeseer.ist.psu.edu/goldberg96architectural.html>.
- Fluhrer, S., et al., “Attack on RC4 and WEP,” Cryptobytes 2002, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 9 pages.
- Fluhrer, S., et al., “Statistical Analysis of the Alleged RC4 Keystream Generator,” FSE 2000, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 12 pages.
- Fluhrer, S., et al., “Weakness in the Key Scheduling Algorithm of RC4,” SAC 2001, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 23 pages.
- Golic, Jovan D., “Linear Statistical Weakness of Alleged RC4 Keystream Generator,” EUROCRYPT 1997, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, pp. 226-238.
- Grosul, A. L., et al., “A Related-Key Cryptanalysis of RC4,” Jun. 6, 2002, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, pp. 1-13.
- Knudsen, L.R., et al., “Analysis Methods for (Allegd) RC4,” ASIACRYPT 1998, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 15 pages.
- Mantin and Shamir, “A Practical Attack on Broadcast RC4,” FSE 2001, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 13 pages.
- Schneier, B., Applied Cryptography, Second Edition, published by John Wiley & Sons, Inc., 1996, ch. 17 “Other Stream Ciphers and Real Random-Sequence Generators,” pp. 397-398.
- Stubblefield, A., et al., “Using the Fluhrer, Mantin, and Shamir Attack to Break WEP,” AT&T Labs Technical Report TD-4ZCPZZ, Aug. 6, 2001, http://www.wisdom.weizmann.ac.il/˜itsik/RC4/rc4.html, 9 pages.
- RSA Laboratories, Cryptography FAQ, What is RC4? http://www.rsasecuirty.com/rsalabs/faq/3-6-3.html, Jan. 30, 2002, 1 page.
Type: Grant
Filed: Mar 6, 2002
Date of Patent: Aug 21, 2007
Assignee: Cavium Networks, Inc. (Mountain View, CA)
Inventor: David A. Carlson (Haslet, TX)
Primary Examiner: Nasser Moazzami
Assistant Examiner: David Garcia Cervetti
Attorney: Blakely, Sokoloff, Taylor & Zafman LLP
Application Number: 10/092,328
International Classification: H04K 1/04 (20060101);