Distributing Of Vector Data To Vector Registers Patents (Class 712/4)
-
Patent number: 12135968Abstract: Techniques for converting FP16 to BF8 using bias are described.Type: GrantFiled: December 26, 2020Date of Patent: November 5, 2024Assignee: Intel CorporationInventors: Alexander Heinecke, Naveen Mellempudi, Robert Valentine, Mark Charney, Christopher Hughes, Evangelos Georganas, Zeev Sperber, Amit Gradstein, Simon Rubanovich
-
Patent number: 12099439Abstract: In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.Type: GrantFiled: August 2, 2021Date of Patent: September 24, 2024Assignee: NVIDIA CorporationInventors: Ching-Yu Hung, Ravi P Singh, Jagadeesh Sankaran, Yen-Te Shih, Ahmad Itani
-
Patent number: 11550584Abstract: Various techniques for accelerating Smith-Waterman sequence alignments are provided. For example, threads in a group of threads are employed to use an interleaved cell layout to store relevant data in registers while computing sub-alignment data for one or more local alignment problems. In another example, specialized instructions that reduce the number of cycles required to compute each sub-alignment score are utilized. In another example, threads are employed to compute sub-alignment data for a subset of columns of one or more local alignment problems while other threads begin computing sub-alignment data based on partial result data received from the preceding threads. After computing a maximum sub-alignment score, a thread stores the maximum sub-alignment score and the corresponding position in global memory.Type: GrantFiled: September 30, 2021Date of Patent: January 10, 2023Assignee: NVIDIA CORPORATIONInventors: Maciej Piotr Tyrlik, Ajay Sudarshan Tirumala, Shirish Gadre
-
Patent number: 11544214Abstract: A computer processor comprising a vector unit is disclosed. The vector unit may comprise a vector register file comprising at least one register to hold a varying number of elements. The vector unit may further comprise a vector length register file comprising at least one register to specify the number of operations of a vector instruction to be performed on the varying number of elements in the at least one register of the vector register file. The computer processor may be implemented as a monolithic integrated circuit.Type: GrantFiled: May 12, 2015Date of Patent: January 3, 2023Assignee: Optimum Semiconductor Technologies, Inc.Inventors: Mayan Moudgill, Gary J. Nacer, C. John Glossner, Arthur Joseph Hoane, Paul Hurtley, Murugappan Senthilvelan, Pablo Balzola, Vitaly Kalashnikov, Sitij Agrawal
-
Patent number: 11537399Abstract: In an embodiment, a processor supports one or more compression assist instructions which may be employed in compression software to improve the performance of the processor when performing compression/decompression. That is, the compression/decompression task may be performed more rapidly and consume less power when the compression assist instructions are employed then when they are not. In some cases, the cost of a more effective, more complex compression algorithm may be reduced to the cost of a less effective, less complex compression algorithm.Type: GrantFiled: July 12, 2021Date of Patent: December 27, 2022Assignee: Apple Inc.Inventors: Eric Bainville, Ali Sazegari
-
Patent number: 11468003Abstract: A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.Type: GrantFiled: September 21, 2020Date of Patent: October 11, 2022Assignee: Texas Instruments IncorporatedInventors: Ching-Yu Hung, Shinri Inamori, Jagadeesh Sankaran, Peter Chang
-
Patent number: 11442726Abstract: Vector pack and unpack instructions are described. An instruction to perform a conversion between one decimal format and another decimal format is executed, in which the one decimal format or the other decimal format is a zoned decimal format. The executing includes obtaining a value from at least one register specified using the instruction. At least a portion of the value is converted from the one decimal format to the other decimal format different from the one decimal format to provide a converted result. A result obtained from the converted result is written into a single register specified using the instruction.Type: GrantFiled: February 26, 2021Date of Patent: September 13, 2022Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Eric Mark Schwarz, Timothy Slegel, Jonathan D. Bradbury, Michael Klein, Reid Copeland, Xin Guo
-
Patent number: 11436010Abstract: Disclosed embodiments relate to a new instruction for detecting conflicts in a set of vector elements. In one example, a system includes circuits to fetch, decode, and execute an instruction that includes an opcode, a destination vector identifier, and a source vector identifier, wherein the execution circuit is to, for each data element position of a source vector identified by the source vector identifier, determine a nearest matching data element position in the source vector storing a same data value as stored at the data element position, the nearest matching data element position located between the data element position and a least significant data element position of the source vector, and store, in a corresponding data element position of a destination vector identified by the destination vector identifier, a value identifying the determined nearest data element position.Type: GrantFiled: June 30, 2017Date of Patent: September 6, 2022Assignee: Intel CorporationInventors: Mikhail Plotnikov, Christopher J. Hughes, Andrey Naraikin
-
Patent number: 11182561Abstract: A similar expression collecting unit (103) collects a word/phrase of an expression similar to an analysis viewpoint word/phrase and a word embedding corresponding the word/phrase from distributed representations of words data. A dimension selecting unit (104) selects a dimension of a word embedding depending on the analysis viewpoint word/phrase, and compresses a word embedding corresponding to the word/phrase collected by the similar expression collecting unit (103) in the selected dimension.Type: GrantFiled: February 14, 2017Date of Patent: November 23, 2021Assignee: MITSUBISHI ELECTRIC CORPORATIONInventors: Takeyuki Aikawa, Hiroyasu Itsui
-
Patent number: 11126588Abstract: A processor circuit is disclosed. The processor circuit includes a data path block circuit configured to perform a data path operation to generate one or more results. The processor circuit also includes a data register files circuit, having a first register file, where the first register file has a first quantity of read and write ports. The data register files circuit also includes a second register file, where the second register file has a second different quantity of read and write ports. The processor circuit also includes an instruction decoder circuit configured to provide an operation signal to the data path block circuit, where the operation signal identifies a particular data path operation to be performed by the data path block circuit and identifies one or more read ports of the data register files circuit for retrieving data encoding the first and second operands.Type: GrantFiled: July 28, 2020Date of Patent: September 21, 2021Assignee: SHENZHEN GOODIX TECHNOLOGY CO., LTD.Inventor: Jaehoon Heo
-
Patent number: 11113054Abstract: Methods and apparatuses for determining set-membership using Single Instruction Multiple Data (“SIMD”) architecture are presented herein. Specifically, methods and apparatuses are discussed for compressing or packing, in parallel, multiple fixed-length values into a stream of multiple variable-length values using SIMD architecture.Type: GrantFiled: July 15, 2016Date of Patent: September 7, 2021Assignee: Oracle International CorporationInventors: Shasank K. Chavan, Phumpong Watanaprakornkul, Victor Chen
-
Patent number: 11080049Abstract: Aspects for matrix multiplication in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-multiply-matrix (MM) instruction that includes a first starting address of a first matrix, a first size of the first matrix, a second starting address of a second matrix, and a second size of the second matrix; multiple computation modules configured to respectively multiply, in response to the MM instruction, row vectors of the first matrix with column vectors of the second matrix to generate one or more result elements; and an interconnection unit configured to combine the result elements to generate one or more row vectors of a result matrix.Type: GrantFiled: October 17, 2019Date of Patent: August 3, 2021Assignee: CAMBRICON TECHNOLOGIES CORPORATION LIMITEDInventors: Xiao Zhang, Shaoli Liu, Tianshi Chen, Yunji Chen
-
Patent number: 11003450Abstract: A vector data transfer instruction is provided for triggering a data transfer between storage locations corresponding to a contiguous block of addresses and multiple data elements of at least one vector register. The instruction specifies a start address of the contiguous block using a base register and an immediate offset value specifies as a multiple of the size of the contiguous block of addresses. This is useful for loop unrolling which can help to improve performance of vectorised code by combining multiple iterations of a loop into a single iteration of an unrolled loop, to reduce the loop control overhead.Type: GrantFiled: September 14, 2016Date of Patent: May 11, 2021Assignee: ARM LimitedInventor: Nigel John Stephens
-
Patent number: 10963251Abstract: There is provided an apparatus that includes a set of vector registers, each of the vector registers being arranged to store a vector comprising a plurality of portions. The set of vector registers is logically divided into a plurality of columns, each of the columns being arranged to store a same portion of each vector. The apparatus also includes register access circuitry that comprises a plurality of access blocks. Each access block is arranged to access a portion in a different column when accessing one of the vector registers than when accessing at least one other of the vector registers. The register access circuitry is arranged to simultaneously access portions in any one of: the vector registers and the columns.Type: GrantFiled: June 15, 2017Date of Patent: March 30, 2021Assignee: ARM LimitedInventor: Thomas Christopher Grocutt
-
Patent number: 10795678Abstract: Neural network processors including a vector register file (VRF) having a multi-port memory and related methods are provided. The processor may include tiles to process an N by N matrix of data elements and an N by 1 vector of data elements. The VRF may, in response to a write instruction, store N data elements in a multi-port memory and during each one of out of P clock cycles provide N data elements to each one of P input interface circuits of the multi-port memory comprising an input lane configured to carry L data elements in parallel. During the each one of the P clock cycles the multi-port memory may be configured to receive N data elements via a selected at least one of the P input interface circuits. The VRF may include output interface circuits for providing N data elements in response to a read instruction.Type: GrantFiled: April 21, 2018Date of Patent: October 6, 2020Assignee: Microsoft Technology Licensing, LLCInventors: Jeremy Fowers, Kalin Ovtcharov, Eric S. Chung, Todd Michael Massengill, Ming Gang Liu, Gabriel Leonard Weisz
-
Patent number: 10713046Abstract: Methods and System for use on a memory controller are disclosed which provides atomic compute operations of any size using an asynchronous, pipelined message passing interface between clients and the memory controller.Type: GrantFiled: December 20, 2017Date of Patent: July 14, 2020Assignee: EXTEN TECHNOLOGIES, INC.Inventors: Daniel B. Reents, Michael Enz, Ashwin Kamath
-
Patent number: 10671392Abstract: Systems, apparatuses, and methods for performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single packed delta decode instruction are described. A processor may include a decoder to decode an instruction, and execution unit to execute the decoded instruction to calculate for each packed data element position of a source operand, other than a first packed data element position, a value that comprises a packed data element of that packed data element position and all packed data elements of packed data element positions that are of lesser significance, store a first packed data element from the first packed data element position of the source operand into a corresponding first packed data element position of a destination operand, and for each calculated value, store the value into a corresponding packed data element position of the destination operand.Type: GrantFiled: July 31, 2018Date of Patent: June 2, 2020Assignee: INTEL CORPORATIONInventors: Elmoustapha Ould-Ahmed-Vall, Thomas Willhalm, Tracy Garrett Drysdale
-
Patent number: 10671292Abstract: A method of orchestrated shuffling of data in a non-uniform memory access device including a plurality of processing nodes that connected by interconnects. The method includes running an application on a plurality of threads executing on the plurality of processing nodes. Data to be shuffled is identified from source threads running on source processing nodes among the processing nodes to target threads executing on target processing nodes among the processing nodes. The method further includes generating a plan for orchestrating the shuffling of the data among the all of the memory devices associated with the threads and for simultaneously transmitting data over different interconnects to a plurality of different target processing nodes from a plurality of different source processing nodes. The data is shuffled among all of the memory devices based on the plan and each processing node is capable of accessing data from first and second local memory devices.Type: GrantFiled: December 1, 2017Date of Patent: June 2, 2020Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Yinan Li, Guy M. Lohman, Rene Mueller, Ippokratis Pandis, Vijayshankar Raman
-
Patent number: 10666984Abstract: Methods and apparatus for coding video information having a plurality of video samples include partitioning samples into groups for transmission within a single clock cycle, wherein the samples are associated with a bit length B, and a group having a group size K. The sample group is mapped to a code number and coded to form a vector-based code comprising a first portion identifying a type of look-up-table used to performing the mapping, and a second portion representing the samples of the group. The look-up-table may be constructed based upon occurrence probabilities of different sample groups. In addition, different types of look-up-tables may be used for different B and K values.Type: GrantFiled: March 3, 2017Date of Patent: May 26, 2020Assignee: QUALCOMM IncorporatedInventors: Vijayaraghavan Thirumalai, Natan Haim Jacobson, Rajan Laxman Joshi
-
Patent number: 10534544Abstract: A method of orchestrated shuffling of data in a non-uniform memory access device that includes a plurality of processing nodes that are connected by interconnects. The method includes running an application on a plurality of threads executing on the plurality of processing nodes. Data to be shuffled is identified from source threads running on source processing nodes among the processing nodes to target threads executing on target processing nodes among the processing nodes. The method further includes generating a plan for orchestrating the shuffling of the data among the all of the memory devices associated with the threads and for simultaneously transmitting data over different interconnects to a plurality of different target processing nodes from a plurality of different source processing nodes. The data is shuffled among all of the memory devices based on the plan. The data includes operand data and operational state data of the source threads.Type: GrantFiled: December 1, 2017Date of Patent: January 14, 2020Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Yinan Li, Guy M. Lohman, Rene Mueller, Ippokratis Pandis, Vijayshankar Raman
-
Patent number: 10521128Abstract: A method of orchestrated shuffling of data in a non-uniform memory access device that includes a plurality of processing nodes connected by interconnects. The method includes running an application on a plurality of threads executing on the plurality of processing nodes. Data to be shuffled is identified from source threads running on source processing nodes among the processing nodes to target threads executing on target processing nodes among the processing nodes. The method further includes generating a plan for orchestrating the shuffling of the data among the all of the memory devices associated with the threads and for simultaneously transmitting data over different interconnects to a plurality of different target processing nodes from a plurality of different source processing nodes. The data is shuffled among all of the memory devices based on the plan. Shifting the data-shifting table includes rotating a first ring with respect to a second ring.Type: GrantFiled: December 1, 2017Date of Patent: December 31, 2019Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Yinan Li, Guy M. Lohman, Rene Mueller, Ippokratis Pandis, Vijayshankar Raman
-
Patent number: 10430192Abstract: Data processing apparatus comprises processing circuitry to selectively apply vector processing operations to one or more data items of a data vector comprising a plurality of data items at respective positions in the data vector, according to the state of respective predicate flags associated with the positions; the processing circuitry comprising: instruction decoder circuitry to decode program instructions; and instruction processing circuitry to execute instructions decoded by the instruction decoder circuitry; wherein the instruction decoder circuitry is responsive to a WHILE instruction and a CHANGE instruction, to control the instruction processing dependent upon a number of the predicate flags.Type: GrantFiled: July 28, 2016Date of Patent: October 1, 2019Assignee: ARM LimitedInventors: Nigel John Stephens, Grigorios Magklis, Alejandro Martinez Vicente, Nathanael Premillieu, Mbou Eyole
-
Patent number: 10430229Abstract: To use SIMD lanes efficiently for domain shader execution, domain point data from different domain shader patches may be packed together into a single SIMD thread. To generate an efficient code sequence, each domain point occupies one SIMD lane and all attributes for the domain point reside in their own partition of General Register File (GRF) space. This technique is called the multiple-patch SIMD dispatch mode.Type: GrantFiled: December 21, 2015Date of Patent: October 1, 2019Assignee: Intel CorporationInventors: Jayashree Venkatesh, Guei-Yuan Lueh, Subramaniam Maiyuran
-
Patent number: 10411881Abstract: A data processing apparatus for rearranging multiple items of data to be input, includes a processor; a memory; and an input unit configured to receive as input a rearrangement number with which a rearrangement pattern of the data can be identified. The processor executes calculating a rearrangement destination for each of the items of the data based on the rearrangement number; and rearranging the data based on the rearrangement destinations.Type: GrantFiled: April 24, 2017Date of Patent: September 10, 2019Assignee: FUJI ELECTRIC CO., LTD.Inventor: Kenji Takatsukasa
-
Patent number: 10372447Abstract: An instruction defined to be a looping instruction is obtained and processed. A determination is made as to whether an obtained selected character is an expected selected character. Based on the obtained selected character being the expected selected character, an execution process is used that includes a sequence of operations to perform an operation, the sequence of operations replacing a loop and providing a non-looping sequence to perform the operation on up to a defined number of units of data. The sequence of operations is configured to repeat one or more times and to terminate based on the obtained selected character. Based on the obtained selected character being different than the expected selected character, an alternate execution process is chosen.Type: GrantFiled: December 7, 2018Date of Patent: August 6, 2019Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventor: Michael K. Gschwind
-
Patent number: 10372448Abstract: An instruction defined to be a looping instruction is obtained and processed. A determination is made as to whether an obtained selected character is an expected selected character. Based on the obtained selected character being the expected selected character, an execution process is used that includes a sequence of operations to perform an operation, the sequence of operations replacing a loop and providing a non-looping sequence to perform the operation on up to a defined number of units of data. The sequence of operations is configured to repeat one or more times and to terminate based on the obtained selected character. Based on the obtained selected character being different than the expected selected character, an alternate execution process is chosen.Type: GrantFiled: December 7, 2018Date of Patent: August 6, 2019Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventor: Michael K. Gschwind
-
Patent number: 10268580Abstract: Processors and methods implementing a machine instruction to perform cache line demotion on multiple cache lines to enable efficient sharing of cache lines between processor cores. One general aspect includes a processor comprising: a plurality of hardware processor cores, where each of the hardware processor cores to include a first cache. The processor also includes a second cache, communicatively coupled to and shared by the plurality of hardware processor cores. The processor to support a first machine instruction, the first machine instruction to include a vector register operand identifying a vector register which contains a plurality of data elements each used to identify a cache line. An execution of the first machine instruction by one of the plurality of hardware processor cores to cause a plurality of identified cache lines to be demoted, such that the demoted cache lines are moved from the first cache to the second cache.Type: GrantFiled: September 30, 2016Date of Patent: April 23, 2019Assignee: Intel CorporationInventors: Kshitij A. Doshi, Namakkal N. Venkatesan, Ren Wang, Andrew J. Herdrich
-
Patent number: 10180838Abstract: A processor fetches a multi-register gather instruction that includes a destination operand that specifies a destination vector register, and a source operand that identifies content that indicates multiple vector registers, a first set of indexes of each of the vector registers that each identifies a source data element, and a second set of indexes of the destination vector register for each identified source element. The instruction is decoded and executed, causing, for each of the first set of indexes of each of the vector registers, the source data element that corresponds to that index of that vector register to be stored in a set of destination data elements that correspond to the second set of identified indexes of the destination vector register for that source data element.Type: GrantFiled: September 19, 2017Date of Patent: January 15, 2019Assignee: Intel CorporationInventor: Ashish Jha
-
Patent number: 10152323Abstract: Method, apparatus, and program means for shuffling data. The method of one embodiment comprises receiving a first operand having a set of L data elements and a second operand having a set of L control elements. For each control element, data from a first operand data element designated by the individual control element is shuffled to an associated resultant data element position if its flush to zero field is not set and a zero is placed into the associated resultant data element position if its flush to zero field is not set.Type: GrantFiled: October 21, 2016Date of Patent: December 11, 2018Assignee: Intel CorporationInventors: Patrice L. Roussel, William W. Macy, Jr., Huy V. Nguyen, Eric L. Debes
-
Patent number: 10111049Abstract: A method, an apparatus, and a computer program product for wireless communication are provided. The apparatus may be a UE. The UE receives at least one of a unicast or a broadcast/multicast communication on a first frequency from a first cell of a serving eNB through a first receive chain. In addition, the UE receives at least one of broadcast/multicast signal, synchronization signal, or reference signal communication on a second frequency from a second cell of the serving eNB through a second receive chain without having received instruction from the serving eNB to receive the at least one of the broadcast/multicast signal, the synchronization signal, or the reference signal communication.Type: GrantFiled: July 8, 2013Date of Patent: October 23, 2018Assignee: QUALCOMM IncorporatedInventors: Jack Shyh-Hurng Shauh, Srinivasan Balasubramanian, Kuo-Chun Lee, Sivaramakrishna Veerepalli, Jun Wang
-
Patent number: 10055225Abstract: A processor fetches a multi-register scatter instruction that includes a source operand and a destination operand. The source operand specifies a source vector register that includes multiple source data elements. The destination operand identifies multiple destination data elements that each specify a destination vector register and an index into that destination vector register. The instruction is decoded and executed, causing, for each of those identified destination data elements, the one of the source data elements that is in a position in the source vector register that corresponds with a position of that destination data element to be stored in the destination vector register at the index specified by that destination data element.Type: GrantFiled: December 23, 2011Date of Patent: August 21, 2018Assignee: Intel CorporationInventor: Ashish Jha
-
Patent number: 10019262Abstract: A processor comprises a plurality of vector registers, and an execution unit, operatively coupled to the plurality of vector registers, the execution unit comprising a logic circuit implementing a load instruction for loading, into two or more vector registers, two or more data items associated with a data structure stored in a memory, wherein each one of the two or more vector registers is to store a data item associated with a certain position number within the data structure.Type: GrantFiled: December 22, 2015Date of Patent: July 10, 2018Assignee: Intel CorporationInventors: Ashish Jha, Elmoustapha Ould-Ahmed-Vall, Robert Valentine, Mark J. Charney, Milind B. Girkar
-
Patent number: 9894371Abstract: A method, system and computer program for decompressing video data, storing the compressed video data in such a way that random access is possible and the data can be mapped efficiently into existing memory systems and interface protocols. The compression is accomplished via lossless compression using an algorithm optimized for video data. Due to the compressed format, data transmission consumes less bandwidth than using uncompressed data and prevents degradation in the decoded video.Type: GrantFiled: October 19, 2016Date of Patent: February 13, 2018Assignee: Entropic Communications, LLCInventors: Matthew Damian Bates, Mark Alan Baur
-
Patent number: 9893880Abstract: A method for secure comparison of encrypted symbols. According to one embodiment, a user may encrypt two symbols, share the encrypted symbols with an untrusted third party that can compute algorithms on these symbols without access the original data or encryption keys such that the result of running the algorithm on the encrypted data can be decrypted to a result which is equivalent to the result of running the algorithm on the original unencrypted data. In one embodiment the untrusted third party may perform a sequence of operations on the encrypted symbols to produce an encrypted result which, when decrypted by a trusted party, indicates whether the two symbols are the same.Type: GrantFiled: November 15, 2013Date of Patent: February 13, 2018Assignee: RAYTHEON BBN TECHNOLOGIES CORP.Inventors: Kurt Rohloff, David Bruce Cousins, Richard Schantz
-
Patent number: 9843551Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for storing and transferring messages. An example method includes providing a queue having an ordered plurality of storage blocks. Each storage block stores one or more respective messages and is associated with a respective time. The times increase from a block designating a head of the queue to a block designating a tail of the queue. The method also includes reading, by each of a plurality of first sender processes, messages from one or more blocks in the queue beginning at the head of the queue. The read messages are sent, by each of the plurality of first sender processes, to a respective recipient. One or more of the blocks are designated as old when they have associated times that are earlier than a first time. A block is designated as a new head of the queue when the block is associated with a time later than or equal to the first time.Type: GrantFiled: October 12, 2016Date of Patent: December 12, 2017Assignee: Machine Zone, Inc.Inventor: Igor Milyakov
-
Patent number: 9798541Abstract: An apparatus and method for propagating conditionally evaluated values are disclosed. For example, a method according to one embodiment comprises: reading each value contained in an input mask register, each value being a true value or a false value and having a bit position associated therewith; for each true value read from the input mask register, generating a first result containing the bit position of the true value; for each false value read from the input mask register following the first true value, adding the vector length of the input mask register to a bit position of the last true value read from the input mask register to generate a second result; and storing each of the first results and second results in bit positions of an output register corresponding to the bit positions read from the input mask register.Type: GrantFiled: December 23, 2011Date of Patent: October 24, 2017Assignee: INTEL CORPORATIONInventors: Jayashankar Bharadwaj, Nalini Vasudevan, Victor W. Lee, Daehyun Kim, Albert Hartono, Sara S. Baghsorkhi
-
Patent number: 9792115Abstract: A processing core is described having execution unit logic circuitry having a first register to store a first vector input operand, a second register to a store a second vector input operand and a third register to store a packed data structure containing scalar input operands a, b, c. The execution unit logic circuitry further include a multiplier to perform the operation (a*(first vector input operand))+(b*(second vector operand))+c.Type: GrantFiled: December 23, 2011Date of Patent: October 17, 2017Assignee: Intel CorporationInventors: Jesus Corbal, Andrew T. Forsyth, Thomas D. Fletcher, Lisa K. Wu, Eric Sprangle
-
Patent number: 9766887Abstract: A processor fetches a multi-register gather instruction that includes a destination operand that specifies a destination vector register, and a source operand that identifies content that indicates multiple vector registers, a first set of indexes of each of the vector registers that each identifies a source data element, and a second set of indexes of the destination vector register for each identified source element. The instruction is decoded and executed, causing, for each of the first set of indexes of each of the vector registers, the source data element that corresponds to that index of that vector register to be stored in a set of destination data elements that correspond to the second set of identified indexes of the destination vector register for that source data element.Type: GrantFiled: December 23, 2011Date of Patent: September 19, 2017Assignee: Intel CorporationInventor: Ashish Jha
-
Patent number: 9760372Abstract: A method for combining data values through associative operations. The method includes, with a processor, arranging any number of data values into a plurality of columns according to natural parallelism of the associative operations and reading each column to a register of an individual processor. The processors are directed to combine the data values in the columns in parallel using a first associative operation. The results of the first associative operation for each column are stored in a register of each processor.Type: GrantFiled: September 1, 2011Date of Patent: September 12, 2017Assignee: Hewlett Packard Enterprise Development LPInventors: Ren Wu, Bin Zhang, Meichun Hsu, Qiming Chen
-
Patent number: 9665368Abstract: Systems, apparatuses, and methods of performing in a computer processor broadcasting data in response to a single vector packed broadcasting instruction that includes a source writemask register operand, a destination vector register operand, and an opcode. In some embodiments, the data of the source writemask register is zero extended prior to broadcasting.Type: GrantFiled: September 28, 2012Date of Patent: May 30, 2017Assignee: Intel CorporationInventors: Christopher J. Hughes, Mark J. Charney, Jesus Corbal, Milind B. Girkar, Elmoustapha Ould-Ahmed_Vall, Bret L. Toll, Robert Valentine
-
Patent number: 9594556Abstract: A circuit arrangement and program product provide support for packed sum of absolute difference operations in a floating point execution unit, e.g., a scalar or vector floating point execution unit. Existing adders in a floating point execution unit may be utilized along with minimal additional logic in the floating point execution unit to support efficient execution of a fixed point packed sum of absolute differences instruction within the floating point execution unit, often eliminating the need for a separate vector fixed point execution unit in a processor architecture, and thereby leading to less logic and circuit area, lower power consumption and lower cost.Type: GrantFiled: March 18, 2016Date of Patent: March 14, 2017Assignee: International Business Machines CorporationInventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
-
Patent number: 9594557Abstract: A method provides support for packed sum of absolute difference operations in a floating point execution unit, e.g., a scalar or vector floating point execution unit. Existing adders in a floating point execution unit may be utilized along with minimal additional logic in the floating point execution unit to support efficient execution of a fixed point packed sum of absolute differences instruction within the floating point execution unit, often eliminating the need for a separate vector fixed point execution unit in a processor architecture, and thereby leading to less logic and circuit area, lower power consumption and lower cost.Type: GrantFiled: March 18, 2016Date of Patent: March 14, 2017Assignee: International Business Machines CorporationInventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
-
Patent number: 9594724Abstract: An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder.Type: GrantFiled: August 9, 2012Date of Patent: March 14, 2017Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Bruce M. Fleischer, Thomas W. Fox, Hans M. Jacobson, Ravi Nair
-
Patent number: 9582466Abstract: An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder.Type: GrantFiled: August 13, 2012Date of Patent: February 28, 2017Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Bruce M. Fleischer, Thomas W. Fox, Hans M. Jacobson, Ravi Nair
-
Patent number: 9507601Abstract: An apparatus for processing a plurality of data sets is disclosed, wherein one data set of the plurality of data sets includes N components and has a data type of one of a scalar type and a vector type, wherein N is a positive integer number. The apparatus includes a memory module and a data accessing module. The memory module comprises N memory units configured to store the plurality of data sets. The data accessing module is configured to write the data set into the memory module according to a write data index corresponding to the data set and one of a first writing mapping information and a second writing mapping information, wherein the first writing mapping information is employed when the data type is one of the scalar and the vector type and the second writing mapping information is employed when the data type is the other of the scalar and the vector type.Type: GrantFiled: February 19, 2014Date of Patent: November 29, 2016Assignee: MEDIATEK INC.Inventors: Pei-Kuei Tsung, Mu-Fan Murphy Chang, Po-Chun Fan
-
Patent number: 9471289Abstract: Systems and methods for source-to-source transformation for compiler optimization for many integrated core (MIC) coprocessors, including identifying data dependencies in candidate loops and data elements used in each iteration for arrays, profiling candidate loops to find a proper number m, wherein data transfer and computation for m iterations take an equal amount of time, and creating an outer loop outside the candidate loop, with each iteration of the outer loop executing m iterations of the candidate loop. Data streaming is performed by determining optimum buffer size for one or more arrays and inserting code before the outer loop to create optimum sized buffers, overlapping data transfer between central processing units (CPUs) and MICs with the computation; reusing buffers to reduce memory employed on the MICs, and reusing threads on MICs to repeatedly launch kernels on the MICs for asynchronous data transfer.Type: GrantFiled: March 25, 2015Date of Patent: October 18, 2016Assignee: NEC CorporationInventors: Min Feng, Srimat Chakradhar, Linhai Song
-
Patent number: 9436465Abstract: A processor, which executes m number of arithmetic operations in parallel, executes a partial sum instruction which takes an i-th to (i+m?1)-th elements of an input data series as input elements, so as to obtain first vector data, executes the partial sum instruction which takes a (i+x)-th to (i+x+m?1)-th elements of the input data series as the input elements, so as to obtain second vector data, and performs operations to subtract the p-th element of the first vector data and add the p-th element of the second vector data from and to a sum of the i-th to (i+x?1)-th elements of the input data series in parallel for each of the 0-th to (m?1)-th elements, so as to calculate sums of elements for m sections different from each other in parallel, and moving average processing to calculate a moving average from the sums of elements of the sections.Type: GrantFiled: April 7, 2014Date of Patent: September 6, 2016Assignee: FUJITSU LIMITEDInventors: Makiko Ito, Manabu Kubota, Kazuhiro Nomoto
-
Patent number: 9424031Abstract: Various embodiments are generally directed to overcoming limitations of vector registers in their use with bit-parallel string matching algorithms. An apparatus includes a processor element; and logic to receive a pattern comprising a first string of elements to employ in a string matching operation, instantiate a test bitmask in a first vector register of the processor element, the first vector register comprising multiple lanes, copy bit values at MSB bit positions of the multiple lanes of the first vector register to a first vector mask as a vector value, bit-shift the vector value as a scalar value, bit-shift the first vector register, employ the vector value of the first vector mask to selectively fill LSB bit positions of lanes of a second vector register of the processor element; and OR the second vector register into the first vector register. Other embodiments are described and claimed.Type: GrantFiled: March 13, 2013Date of Patent: August 23, 2016Assignee: INTEL CORPORATIONInventors: Hariharan Thantry, Mani Azimi
-
Patent number: 9350584Abstract: An element selection unit (200) and a method therein for vector element selection. The element selection unit comprises a selector control circuit (404) and a selector data path circuit (406), which data path circuit comprises a plurality of layers of multiplexers. The element selection unit further comprises a receiving circuit (401) configured to receive an instruction to perform a selection of elements from an input vector. The selector control circuit (404) is configured to generate a multiplexer control signal for each multiplexer based on a bit map and on a plurality of relative offset values. The data path circuit is configured to propagate the elements comprised in the input vector through the plurality of layers of multiplexers towards an output vector based on the generated multiplexer control signals. The data path circuit is further configured to write the propagated elements to enabled elements of the output vector.Type: GrantFiled: June 10, 2013Date of Patent: May 24, 2016Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)Inventor: David Van Kampen
-
Patent number: 9268626Abstract: An apparatus and method are described for detecting and responding to fault conditions in a processor. For example, one embodiment of a method comprises: reading each active element in succession from a first vector register, each active element specifying an address for a gather or load operation; detecting one or more fault conditions associated with one or more of the active elements; for each active element read in succession prior to a detected fault condition on an element other than the first active element, storing the data loaded from an address associated with the active element in a first output vector register; and for each active element associated with the detected fault condition and following the detected fault condition, setting a bit in an output mask register to indicate the detected fault condition.Type: GrantFiled: December 23, 2011Date of Patent: February 23, 2016Assignee: INTEL CORPORATIONInventors: Jayashankar Bharadwaj, Victor W. Lee, Kim Daehyun, Nalini Vasudevan, Tin-Fook Ngai, Albert Hartono, Sara S. Baghsorkhi