Byte Execution Unit for Carrying Out Byte Instructions in a Processor
A disclosed byte execution unit receives byte instruction information and two operands, and performs an operation specified by the byte instruction information upon one or both of the operands, thereby producing a result. The byte instruction specifies either a count ones in bytes operation, an average bytes operation, an absolute differences of bytes operation, or a sum bytes into halfwords operation. In one embodiment, the byte execution unit includes multiple byte units. Each byte unit includes multiple population counters, two compressor units, adder input multiplexer logic, adder logic, and result multiplexer logic. A data processing system is described including a processor coupled to a memory system. The processor includes the byte execution unit. The memory system includes a byte instruction, wherein the byte instruction specifies either the count ones in bytes operation, the average bytes operation, the absolute differences of bytes operation, or the sum bytes into halfwords operation.
This invention relates generally to data processing systems and, more particularly, to instruction execution units of processors of data processing systems.
BACKGROUND OF THE INVENTIONIn many audio-visual or multimedia applications involving images, sound, and/or moving pictures (i.e., videos), the basic unit of data is the 8-bit byte. An 8-bit data byte can represent any one of 28=256 different binary levels, and two 8-bit bytes can represent any one of 216=65,536 different binary levels. The levels may be equally sized (linear quantization) or different sizes (e.g., logarithmic quantization). For example, in the United States, telephone voice signals are typically sampled using logarithmic u-law encoding.
Images and individual frames of moving pictures or videos are made up of two-dimensional arrays of picture elements (i.e., “pixels”) called bitmaps. Each pixel is typically represented by a collection of bits conveying intensity and/or color. For example, a single bit allows only two values (e.g., black and white), while 8 bits allows 28=256 different values (e.g., black, white, and 254 intermediate shades of gray).
The acronym “MPEG” is commonly used to refer to the family of standards developed by the Moving Picture Experts Group (MPEG) for coding audio-visual information (e.g., movies, video, music) in a digital compressed format. MPEG data compression has greatly facilitated the storing and distribution of digital video and audio signals.
In general, MPEG video data compression predicts motion from frame to frame in time, then uses discrete cosine transforms (DCTs) to organize redundancy in other dimensions (i.e., other “spatial directions”). Motion prediction is typically performed on 16×16 pixel blocks called “macroblocks,” and DCTs are performed on 8×8 pixel blocks of the macroblocks. For example, given a 16×16 macroblock in a current frame, an attempt is made to find a closely matching macroblock in a previous or future frame. If a close match is found, DCTs are performed on differences between the 8×8 pixel blocks of the current macroblock and the close match. On the other hand, if a close match is not found, DCTs are performed directly on the 8×8 pixel blocks of the current macroblock. The resulting DCT coefficients are then divided by a determined value (i.e., “quantized”) and Huffman coded using fixed tables.
In the MPEG standards, the fundamental unit of data is the 8-bit byte. Each pixel of a video frame typically has three color components, each represented by one or more bytes. For example, each pixel may be represented by a 24-bit red-green-blue (RGB) value having one byte for red, one byte for green, and one byte for blue.
It would thus be advantageous to have a computer system capable of efficiently operating on 8-bit data bytes.
SUMMARY OF THE INVENTIONA disclosed byte execution unit receives byte instruction information and two operands, and performs an operation specified by the byte instruction information upon one or both of the operands, thereby producing a result. The byte instruction specifies either a count ones in bytes operation, an average bytes operation, an absolute differences of bytes operation, or a sum bytes into halfwords operation. In one embodiment, the byte execution unit includes multiple byte units. Each byte unit includes multiple population counters, two compressor units, adder input multiplexer logic, adder logic, and result multiplexer logic.
A data processing system is described including a processor coupled to a memory system. The processor includes the byte execution unit. The memory system includes a byte instruction, wherein the byte instruction specifies either the count ones in bytes operation, the average bytes operation, the absolute differences of bytes operation, or the sum bytes into halfwords operation.
BRIEF DESCRIPTION OF THE DRAWINGSThe invention may be understood by reference to the following description taken in conjunction with the accompanying drawings, in which like reference numerals identify similar elements, and in which:
In the following discussion, numerous specific details are set forth to provide a thorough understanding of the present invention. However, those skilled in the art will appreciate that the present invention may be practiced without such specific details. In other instances, well-known elements have been illustrated in schematic or block diagram form in order not to obscure the present invention in unnecessary detail. Additionally, for the most part, details concerning network communications, electromagnetic signaling techniques, and the like, have been omitted in as much as such details are not considered necessary to obtain a complete understanding of the present invention, and are considered to be within the understanding of persons of ordinary skill in the relevant art.
It is further noted that, unless indicated otherwise, all functions described herein may be performed in either hardware or software, or some combination thereof. In a preferred embodiment, however, the functions are performed by a processor, such as a computer or an electronic data processor, in accordance with code, such as computer program code, software, and/or integrated circuits that are coded to perform such functions, unless indicated otherwise.
As indicated in
In the embodiment of
In the embodiment of
In one embodiment, the instruction set executable by the processor 102 of
The source register 2 field 202 specifies a register of the register file 112 of
In one embodiment, the byte instruction 110 is the “count ones in bytes” instruction having the mnemonic “CNTB.” The opcode field 200 of the CNTB instruction is an 11-bit field identifying the instruction as the CNTB instruction, the source register 2 field 202 is ignored, the source register 1 field 204 specifies a source register “RA,” and the destination register field 206 specifies a destination register “RT.”
In one embodiment, the register file 112 of
An assembly language instruction using the CNTB instruction is expressed “CNTB RT,RA” wherein the RT register is the destination register and the RA register is the source register. In general, execution of the CNTB instruction involves carrying out the following operations for each of 16 byte slots of the source register RA and the destination register RT: (i) count the number of bits in a byte slot of the source register RA, and (ii) store the count in the corresponding byte slot of the destination register. The following pseudo code expresses the operation the processor 102 of
In another embodiment, the byte instruction 110 is the “average bytes” instruction having the mnemonic “AVGB.” The opcode field 200 of the AVGB instruction is an 11-bit field identifying the instruction as the AVGB instruction, the source register 2 field 202 specifies a source register “RB,” the source register 1 field 204 specifies the source register RA, and the destination register field 206 specifies the destination register RT.
As described above, the source register RA is a 128-bit register containing 16 8-bit byte slots referred to as RA[0] through RA[15], and the destination register RT is a 128-bit register containing 16 8-bit byte slots referred to as RT[0] through RT[15]. Similarly, the source register RB is a 128-bit register containing 16 8-bit byte slots referred to as RB[0] through RB[15].
An assembly language instruction using the AVGB instruction is expressed “AVGB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, execution of the AVGB instruction involves carrying out the following operations for each of the 16 byte slots of the source and destination registers: (i) compute an average of values stored in the corresponding byte slots of the RA and RB source registers, and (ii) store the average of values in the corresponding byte slot of the destination register. The following pseudo code expresses the operation the processor 102 of
In another embodiment, the byte instruction 110 is the “absolute difference of bytes” instruction having the mnemonic “ABSDB.” The opcode field 200 of the ABSDB instruction is an 11-bit field identifying the instruction as the ABSDB instruction, the source register 2 field 202 specifies the source register RB, the source register 1 field 204 specifies the source register RA, and the destination register field 206 specifies the destination register RT.
An assembly language instruction using the ABSDB instruction is expressed “ABSDB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, execution of the ABSDB instruction involves carrying out the following operations for each of 16 byte slots of the source and destination registers: (i) subtract a value stored in a byte slot of the RA source register from a value stored in the corresponding byte slot of the RB source register, (ii) compute an absolute value of a result of the subtraction operation, and (iii) store the absolute value of the result of the subtraction operation in the corresponding byte slot of the destination register. The following pseudo code expresses the operation the processor 102 of
In another embodiment, the byte instruction 110 is the “sum bytes into half words” instruction having the mnemonic “SUMB.” The opcode field 200 of the SUMB instruction is an 11-bit field identifying the instruction as the SUMB instruction, the source register 2 field 202 specifies the source register RB, the source register 1 field 204 specifies the source register RA, and the destination register field 206 specifies the destination register RT.
An assembly language instruction using the SUMB instruction is expressed “SUMB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, execution of the SUMB instruction involves carrying out the following operations for each of 4 16-bit (double byte) word slots: (i) compute a first sum of values stored in the next 4 consecutive byte slots of the source register RB, (ii) store the sum in the next 2 consecutive byte slots of the destination register RT, (iii) compute a second sum of values stored in the next 4 consecutive byte slots of the source register RA, and (iv) store the second sum in the next 2 consecutive byte slots of the destination register RT. The following pseudo code expresses the operation the processor 102 of
RT[0:1]=RB[0]+RB[1]+RB[2]+RB[3]
RT[2:3]=RA[0]+RA[1]+RA[2]+RA[3]
RT[4:5]=RB[4]+RB[5]+RB[6]+RB[7]
RT[6:7]=RA[4]+RA[5]+RA[6]+RA[7]
RT[8:9]=RB[8]+RB[9]+RB[10]+RB[11]
RT[10:11]=RA[8]+RAB[9]+RA[10]+RA[11]
RT[12:13]=RB[12]+RB[13]+RB[14]+RB[15]
RT[14:15]=RA[12]+RAB[13]+RA[14]+RA[15]
In the embodiment described below, ordered sets of bits are numbered such that higher valued (i.e., more significant) bits have lower numbers than lower valued (i.e., less significant) bits. For example, the A[0:31] operand includes bits A[0] through A[31], wherein the bit A[0] is the highest valued (most significant) bit and bit A[31] is the lowest valued (least significant) bit.
In one embodiment, the four byte units 300A-300D are substantially identical and operate similarly.
The 4:2 compressor 402A receives the B[0:31] operand and produces output signals “F0[0:7],” “F0[8],” and “F1[0:7]” wherein the F0[0] signal conveys a carry value resulting from an addition operation (B[0]+B[8]+B[16]), the F0[1:8 ] signal conveys a sum vector (see
The portion of the adder input MUX logic 404 shown in
Table 1 below shows the output signals produced by the portion of the adder input MUX logic 404 shown in
As used herein, the suffix “_b” following a signal name indicates the logical complement of the signal. For example, the B_b[0:7] signal is the bitwise logical complement of the B[0:7] signal. The ‘+’ symbols in Table 1 above represent a concatenation operation.
In
The compound adder 408A produces the SUM/S0[0:8] signal by summing X0, Y0, and C0, i.e. SUM/S0[0:8]=X0[0:7]+Y0[0:7]+C0. The most significant bit S0[0] is the carry out, and bits S0[1:8] represent the 8-bit sum. The SUM1/T0[0:8] signal is produced by summing X0, Y0, and a carry in of ‘1’, i.e., SUM1/T0[0:8]=X0[0:7]+Y0[0:7]+1.
The 8-bit compound adder 408B receives the X1, Y1, and C1 signals produced by the portion of the MUX logic 404 shown in
The compound adder 408B produces the SUM/S1[0:8] signal by summing X1, Y1, and C1, i.e. SUM/S1[0:8]=X1[0:7]+Y1[0:7]+C1. The most significant bit S1[0] is the carry out, and bits S1[1:8] represent the 8-bit sum. The SUM1/T1[0:8] signal is produced by summing X1, Y1, and a carry in of ‘1’, i.e., SUM1/T1[0:8]=X1[0:7]+Y1[0:7]+1.
The portion of the result MUX logic 410 shown in
Table 2 below shows the output signals produced by the portion of the result MUX logic 410 shown in
It is noted that, as indicated in Table 2 above, when a SUMB instruction is fetched and executed, the 8-bit compound adder 408B produces a 10-bit result conveyed by the concatenated RESULT[0:7] and RESULT[8:15] signals.
A second population counter unit 400B, a second 4:2 compressor 402B, and another portion of the adder input MUX logic 404 are shown in
The 4:2 compressor 402B receives the A[0:31] operand and produces output signals “F2[0:7],” “F2[8],” and “F3[0:7]” wherein the F2[0] signal conveys a carry value resulting from an addition operation (A[0]+A[8]+A[16]), the F2[1:8] signal conveys a sum vector, the F2[8] signal conveys a sum value resulting from an addition operation (A[7]+A[15]+A[23]+A[31]), and the F3[0:7] signal conveys a carry vector.
The portion of the adder input MUX logic 404 shown in
Table 3 below shows the output signals produced by the multiplexers of the portion of the MUX logic 404 shown in
In Table 3 above, the ‘+’ signal represents the concatenation operation. In
The compound adder 408C produces the SUM/S2[0:8] signal by summing X2, Y2, and C2, i.e. SUM/S2[0:8]=X2[0:7]+Y2[0:7]+C2. The most significant bit S2[0] is the carry out, and bits S2[1:8] represent the 8-bit sum. The SUM1/T2[0:8] signal is produced by summing X2, Y2, and a carry in of ‘1’, i.e., SUM1/T2[0:8]=X2[0:7]+Y2[0:7]+1.
The 8-bit compound adder 408D receives the X3, Y3, and C3 signals produced by the portion of the MUX logic 404 shown in
The compound adder 408D produces the SUM/S3[0:8] signal by summing X3, Y3, and C3, i.e. SUM/S3[0:8]=X3[0:7]+Y3[0:7]+C3. The most significant bit S3[0] is the carry out, and bits S3[1:8] represent the 8-bit sum. The SUM1/T3[0:8] signal is produced by summing X3, Y3, and a carry in of ‘1’, i.e., SUM1/T3[0:8]=X3[0:7]+Y3[0:7]+1.
The portion of the result MUX logic 410 shown in
Table 4 below shows the output signals produced by the portion of the result MUX logic 410 shown in
It is noted that, as indicated in Table 4 above, when a SUMB instruction is fetched and executed, the 8-bit compound adder 408D produces a 10-bit result conveyed by the concatenated RESULT[16:23] and RESULT[24:31] signals.
The byte execution unit 106 of
Each of the four byte units also includes four 8-bit compound adders that constitute adder logic. In general, the adder logic receives signals produced by the pre-processing logic and performs an addition operation upon the received signals, thereby producing a result. In some cases the result includes a sum signal and a sum+1 signal.
Each of the four byte units also includes result MUX logic forming post-processing logic. The post-processing logic receives the result produced by the adder logic and performs an operation upon the result dependent upon control signals produced by the corresponding control unit.
For example, as described above, an assembly language instruction using the “absolute differences of bytes” instruction ABSDB is expressed “ABSDB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, for each byte, RT=ABS(RA−RB). The ABSDB instruction may be implemented as RT=((RA+RB_b+EAC) XOR EAC) where EAC=end-around-carry=CARRY(RA+RB_b). In this situation, the pre-processing logic may produce values X, Y, and C (i.e., CIN) for an 8-bit compound adder such that X=RA, Y=NOT(RB), and CIN=0. In general, the 8-bit compound receives the X, Y, and CIN signals, and produces signals “S[0:8]” and “T[0:8]” such that S[0:8]=X+Y and T[0:8]=X+Y+1. The post-processing logic produces a “RESULT” signal such that if S[0]=0 then RESULT=NOT(S[1:8]) else RESULT=T[1:8].
The “average of bytes” instruction AVGB may be expressed “AVGB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, for each byte, RT=(RA+RB+1)>>1. The pre-processing logic may produce the values X, Y, and CIN for the 8-bit compound adder such that X=RA, Y=RB, and CIN=1. As described above, the 8-bit compound receives the X, Y, and CIN signals, and produces the signals S[0:8] and T[0:8] such that S[0:8]=X+Y and T[0:8]=X+Y+1. The pre-processing logic produces the RESULT signal such that RESULT=S[0:7].
The “count ones in bytes” instruction CNTB may be expressed “CNTB RT,RA” wherein the RT register is the destination register and the RA register is the source register. In general, for each byte, RT=COUNT_ONES(RA). The CNTB instruction may be implemented by counting the number of logic ones in 4-bit units of RA and adding the results. (This approach allows more generalized hardware to be used.) The pre-processing logic may produce the values X, Y, and CIN for the 8-bit compound adder such that X=COUNT_ONES(RA[0:3]), Y=COUNT_ONES(RA[4:7]), and CIN=0. As described above, the 8-bit compound receives the X, Y, and CIN signals, and produces the signals S[0:8] and T[0:8] such that S[0:8]=X+Y and T[0:8]=X+Y+1. The pre-processing logic produces the RESULT signal such that RESULT=S[1:8].
The “sum bytes into half words” instruction SUMB may be expressed “SUMB RT,RA,RB” wherein the RT register is the destination register and the RA and RB registers are the source registers. In general, for word slot:
RT[0:15]=RB[0:7]+RB[8:15]+RB[16:23]+RB[24:31]
RT[16:31]=RA[0:7]+RA[8:15]+RA[16:23 ]+RA[24:31]
A 4:2 compressor receives four 8-bit input vectors and produces two intermediate result vectors: a 9-bit vector F0[0:8] and an 8-bit vector F1[0:7]. The 8-bit compound adder receives F0[0:7] as X and F1[0:7] as Y, and computes the signal S[0:8] such that S[0:8]=F0[0:7]+F1[0:7]. The post-processing logic produces a 10-bit result signal “R[0:9]” such that R[0:9]={S[0:8], F0[8]}.
The particular embodiments disclosed above are illustrative only, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the invention. Accordingly, the protection sought herein is as set forth in the claims below.
Claims
1. A byte execution unit, comprising:
- logic coupled to receive byte instruction information, to receive a first operand from a first source register, to receive a second operand from a second source register, and configured to perform an operation specified by the byte instruction information upon at least one of the first operand or the second operand, thereby producing a result in a destination register, wherein the byte instruction information specifies either a count ones in bytes operation, an average bytes operation, an absolute differences of bytes operation, or a sum bytes into halfwords operations,
- wherein responsive to a count ones in bytes operation, the byte execution unit counts a number of logical one bits in each byte of at least the first operand and stores each result in a corresponding byte of the destination register;
- wherein responsive to an average bytes operation, the byte execution unit averages each byte of the first operand with a corresponding byte of the second operand and stores each result in a corresponding byte of the destination register;
- wherein responsive to an absolute differences of bytes operation, the byte execution unit subtracts each byte of the first operand from a corresponding byte of the second operand to form an intermediate result, determines an absolute value of each intermediate result to form a final result, and stores each final result in a corresponding byte of the destination register; and
- wherein responsive to a sum bytes into halfwords operation, the byte execution unit sums a number of corresponding one-byte portions of the first operand or the second operand and stores each result in a corresponding halfword of the destination register.
2. The byte execution unit as recited in claim 1, wherein each the first operand and the second operand comprises a plurality of bits, and wherein the bits of each of the first operand and the second operand are grouped to form a plurality of corresponding 8-bit bytes.
3. The byte execution unit as recited in claim 2, wherein each of the first operand and the second operand comprises 128 bits, and wherein the bits of each of the first operand and the second operand are grouped to form 16 corresponding bytes.
4-7. (canceled)
8. A byte execution unit, comprising:
- pre-processing logic coupled to receive a plurality of operands and configured to perform an operation upon the operands dependent upon an operation specified by a byte instruction, thereby producing an intermediate result;
- adder logic coupled to receive the intermediate result and configured to perform an addition operation upon the intermediate result, thereby producing a sum and a sum+1;
- post-processing logic coupled to receive the sum and sum+1 and configured to perform an operation upon the sum and sum+1 dependent upon the operation specified by a byte instruction, thereby producing a result; and
- a control unit coupled to the pre-processing logic, the adder logic, and the post-processing logic wherein responsive to the byte instruction, the control unit sets control signals to configure the pre-processing logic, the adder logic, and the post-processing logic to perform either a count ones in bytes operation, an average bytes operation, an absolute differences of bytes operation, or a sum bytes into halfwords operation.
9. (canceled)
10. (canceled)
11. The byte execution unit as recited in claim 8, wherein the pre-processing logic comprises population counter logic coupled to receive the operands and configured to produce population output signals indicative of numbers of logic ones in portions of the operands.
12. The byte execution unit as recited in claim 8, wherein the pre-processing logic comprises compressor logic coupled to receive the operands and configured to perform a compression function.
13. The byte execution unit as recited in claim 8, wherein the post-processing logic comprises end-around carry logic configured to perform an end-around carry function.
14. The byte execution unit as recited in claim 8, wherein the post-processing logic is configured to perform bit shift operations.
15-28. (canceled)
29. A data processing system, comprising:
- a memory system comprising a byte instruction, wherein the byte instruction specifies either a count ones in bytes operation, an average bytes operation, an absolute differences of bytes operation, or a sum bytes into halfwords operation; and
- a processor coupled to the memory system and configured to fetch and execute instructions from the memory system, wherein the processor comprises: a byte execution unit coupled to receive byte instruction information, to receive a first operand from a first source register, and to receive a second operand from a second source register, and configured to perform an operation specified by the byte instruction information upon at least one of the first operand or the second operand, thereby producing a result in a destination register, wherein responsive to a count ones in bytes operation, the byte execution unit counts a number of logical one bits in each byte of at least the first operand and stores each result in a corresponding byte of the destination register; wherein responsive to an average bytes operation, the byte execution unit averages each byte of the first operand with a corresponding byte of the second operand and stores each result in a corresponding byte of the destination register; wherein responsive to an absolute differences of bytes operation, the byte execution unit subtracts each byte of the first operand from a corresponding byte of the second operand to form an intermediate result determines an absolute value of each intermediate result to form a final result, and stores each final result in a corresponding byte of the destination register; and wherein responsive to a sum bytes into halfwords operation, the byte execution unit sums a number of corresponding one-byte portions of the first operand or the second operand and stores each result in a corresponding halfword of the destination register.
30. The data processing system as recited in claim 29, wherein each of the first operand and the second operand comprises a plurality of bits, and wherein the bits of each of the first operand and the second operand are grouped to form a plurality of corresponding 8-bit bytes.
31. The data processing system as recited in claim 30, wherein each of the first operand and the second operand comprises 128 bits, and wherein the bits of each of the first operand and the second operand are grouped to form 16 corresponding bytes.
32. The bye execution unit of claim 8, wherein responsive to a count ones in bytes operation, the control unit sets control signals to configure the pre-processing logic, the adder logic, and the post-processing logic to count a number of logical one bits in each byte of at least the first operand and to store each result in a corresponding byte of the destination register.
33. The bye execution unit of claim 8, wherein responsive to an average bytes operation, the control unit sets control signals to configure the pre-processing logic, the adder logic, and the post-processing logic to average each byte of the first operand with a corresponding byte of the second operand and to store each result in a corresponding byte of the destination register.
34. The bye execution unit of claim 8, wherein responsive to an absolute differences of bytes operation, the control unit sets control signals to configure the pre-processing logic, the adder logic, and the post-processing logic to subtract each byte of the first operand from a corresponding byte of the second operand to form an intermediate result, to determine an absolute value of each intermediate result to form a final result, and to store each final result in a corresponding byte of the destination register.
35. The bye execution unit of claim 8, wherein responsive to a sum bytes into halfwords operation, the control unit sets control signals to configure the pre-processing logic, the adder logic, and the post-processing logic to sum a number of corresponding one-byte portions of the first operand or the second operand and to store each result in a corresponding halfword of the destination register.
36. The byte execution unit of claim 8, wherein each of the operands comprises a plurality of bits, and wherein the bits of each of the operands are grouped to form a plurality of corresponding 8-bit bytes.
37. The byte execution unit of claim 36, wherein each of the operands comprises 128 bits, and wherein the bits of each of the operands are grouped to form 16 corresponding bytes.
Type: Application
Filed: Nov 1, 2006
Publication Date: Mar 15, 2007
Inventors: Sang Dhong (Austin, TX), Hwa-Joon Oh (Austin, TX), Brad Michael (Cedar Park, TX), Silvia Mueller (St. Ingbert), Kevin Tran (Austin, TX)
Application Number: 11/555,513
International Classification: G06F 9/44 (20060101);