Single Instruction, Multiple Data (simd) Patents (Class 712/22)
-
Patent number: 10613863Abstract: Techniques and mechanisms described herein include a signal processor implemented as an overlay on a field-programmable gate array (FPGA) device that utilizes special purpose, hardened intellectual property (IP) modules such as memory blocks and digital signal processing (DSP) cores. A Processing Element (PE) is built from one or more DSP cores connected to additional logic. Interconnected as an array, the PEs may operate in a computational model such as Single Instruction-Multiple Thread (SIMT). A software hierarchy is described that transforms the SIMT array into an effective signal processor.Type: GrantFiled: July 3, 2019Date of Patent: April 7, 2020Assignee: Nextera Video, Inc.Inventors: John E. Deame, Steven Kaufmann, Liviu Voicu
-
Patent number: 10592466Abstract: A GPU architecture employs a crossbar switch to preferentially store operand vectors in a compressed form allowing reduction in the number of memory circuits that must be activated during an operand fetch and to allow existing execution units to be used for scalar execution. Scalar execution can be performed during branch divergence.Type: GrantFiled: May 12, 2016Date of Patent: March 17, 2020Assignee: Wisconsin Alumni Research FoundationInventors: Nam Sung Kim, Zhenhong Liu
-
Patent number: 10514917Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: November 2, 2017Date of Patent: December 24, 2019Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 10514918Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: December 21, 2017Date of Patent: December 24, 2019Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 10514916Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: June 5, 2017Date of Patent: December 24, 2019Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 10509652Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: December 21, 2017Date of Patent: December 17, 2019Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 10497440Abstract: A crossbar array, comprises a plurality of row lines, a plurality of column lines intersecting the plurality of row lines at a plurality of intersections, and a plurality of junctions coupled between the plurality of row lines and the plurality of column lines at a portion of the plurality of intersections. Each junction comprises a resistive memory element, and the junctions are positioned to calculate a matrix multiplication of a first matrix and a second matrix.Type: GrantFiled: August 7, 2015Date of Patent: December 3, 2019Assignee: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LPInventors: Miao Hu, John Paul Strachan, Zhiyong Li, R. Stanley Williams
-
Patent number: 10489155Abstract: Systems and methods relate to a mixed-width single instruction multiple data (SIMD) instruction which has at least a source vector operand comprising data elements of a first bit-width and a destination vector operand comprising data elements of a second bit-width, wherein the second bit-width is either half of or twice the first bit-width. Correspondingly, one of the source or destination vector operands is expressed as a pair of registers, a first register and a second register. The other vector operand is expressed as a single register. Data elements of the first register correspond to even-numbered data elements of the other vector operand expressed as a single register, and data elements of the second register correspond to data elements of the other vector operand expressed as a single register.Type: GrantFiled: July 21, 2015Date of Patent: November 26, 2019Assignee: QUALCOMM IncorporatedInventors: Eric Wayne Mahurin, Ajay Anant Ingle
-
Patent number: 10437769Abstract: A method of transition minimized low speed data transfer is described herein. In an embodiment, a data rate of a set data to be transmitted on a data bus is determined. A one hot value is encoded on the data bus in response to a low data rate. An XOR operation is performed with a previous state of the data bus and the encoded one hot value. Additionally, a resulting value of the XOR operation is driven onto the data bus.Type: GrantFiled: December 26, 2013Date of Patent: October 8, 2019Assignee: Intel CorporationInventor: Daniel Greenspan
-
Patent number: 10419501Abstract: A data streaming unit (DSU) and a method for operating a DSU are disclosed. In an embodiment the DSU includes a memory interface configured to be connected to a storage unit, a compute engine interface configured to be connected to a compute engine (CE) and an address generator configured to manage address data representing address locations in the storage unit. The data streaming unit further includes a data organization unit configured to access data in the storage unit and to reorganize the data to be forwarded to the compute engine, wherein the memory interface is communicatively connected to the address generator and the data organization unit, wherein the address generator is communicatively connected to the data organization unit, and wherein the data organization unit is communicatively connected to the compute engine interface.Type: GrantFiled: December 3, 2015Date of Patent: September 17, 2019Assignee: Futurewei Technologies, Inc.Inventors: Ashish Rai Shrivastava, Alan Gatherer, Sushma Wokhlu
-
Patent number: 10324515Abstract: Approaches are provided for a predictive electrical appliance power-saving management mode. An approach includes ascertaining a location and pace of a mobile device. The approach further includes calculating an amount of time that it will take to enable or start programs and services upon a computing device waking from a sleep mode or hybrid sleep mode. The approach further includes determining a distance threshold to the computing device that allows for the calculated amount of time to pass such that the programs and services are enabled or started prior to a user of the mobile device arriving at the computing device when the user is returning to the computing device at the ascertained pace. The approach further includes sending a signal to awaken the computing device from the sleep mode or hybrid sleep mode when the mobile device is within the distance threshold.Type: GrantFiled: May 8, 2017Date of Patent: June 18, 2019Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: James E. Bostick, John M. Ganci, Jr., Sarbajit K. Rakshit, Kimberly G. Starks
-
Patent number: 10311539Abstract: A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.Type: GrantFiled: November 2, 2016Date of Patent: June 4, 2019Assignee: Imagination Technologies LimitedInventors: John Howson, Jonathan Redshaw, Yoong Chert Foo
-
Patent number: 10291514Abstract: Aspects of this disclosure provide techniques for dynamically configuring flow splitting via software defined network (SDN) signaling instructions. An SDN controller may instruct an ingress network node to split a traffic flow between two or more egress paths, and instruct the ingress network node, and perhaps downstream network nodes, to transport portions of the traffic flow in accordance with a forwarding protocol. In one example, the SDN controller instructs the network nodes to transport portions of the traffic flow in accordance with a link-based forwarding protocol. In other examples, the SDN controller instructs the network nodes to transport portions of the traffic flow in accordance with a path-based or source-based transport protocol.Type: GrantFiled: August 28, 2017Date of Patent: May 14, 2019Assignee: Huawei Technologies Co., Ltd.Inventors: Xu Li, Hang Zhang
-
Patent number: 10241802Abstract: A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks.Type: GrantFiled: November 20, 2015Date of Patent: March 26, 2019Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)Inventors: John Lancaster, Martin Whitaker
-
Patent number: 10235732Abstract: A method and system are described herein for an optimization technique on two aspects of thread scheduling and dispatch when the driver is allowed to pick the scheduling attributes. The present techniques rely on an enhanced GPGPU Walker hardware command and one dimensional local identification generation to maximize thread residency.Type: GrantFiled: December 27, 2013Date of Patent: March 19, 2019Assignee: INTEL CORPORATIONInventors: Jayanth N. Rao, Michal Mrozek
-
Patent number: 10228972Abstract: In some embodiments, the present invention provides an exemplary computing device, including at least: a scheduler processor; a CPU; a GPU; where the scheduler processor configured to: obtain a computing task; divide the computing task into: a first set of subtasks and a second set of subtasks; submit the first set to the CPU; submit the second set to the GPU; determine, for a first subtask of the first set, a first execution time, a first execution speed, or both; determine, for a second subtask of the second set, a second execution time, a second execution speed, or both; dynamically rebalance an allocation of remaining non-executed subtasks of the computing task to be submitted to the CPU and the GPU, based, at least in part, on at least one of: a first comparison of the first execution time to the second execution time, and a second comparison of the first execution speed to the second execution speed.Type: GrantFiled: June 21, 2018Date of Patent: March 12, 2019Assignee: Banuba LimitedInventor: Yury Hushchyn
-
Patent number: 10229468Abstract: Systems, apparatuses and methods may provide for receiving a general purpose graphics processing unit (GPGPU) workload and converting the GPGPU workload to a three-dimensional (3D) workload. Additionally, the 3D workload may be dispatched to a 3D pipeline. In one example, converting the GPGPU workload to the 3D workload includes identifying a plurality of thread groups in the GPGPU workload and mapping the plurality of thread groups to a 3D matrix of cubes.Type: GrantFiled: June 3, 2015Date of Patent: March 12, 2019Assignee: Intel CorporationInventors: Robert B. Taylor, Abhishek Venkatesh
-
Patent number: 10223334Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix multiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.Type: GrantFiled: July 20, 2017Date of Patent: March 5, 2019Assignee: NOVUMIND LIMITEDInventors: Chien-Ping Lu, Yu-Shuen Tang
-
Patent number: 10218635Abstract: A network interface controller (NC) that can provide a connection for a device to a network. The NC can include a sideband port controller. The sideband port controller can provide a sideband connection between the network and a sideband endpoint circuit that can communicate information with the network via the sideband. The sideband port controller can include a receive data route that has an input for receiving packets of data from the network and an output for passing the packets of data received from the network to the sideband endpoint circuit. The receive data route may include a buffer to receive the packets of data from the network and to pass the packets of data received from the network to the sideband endpoint.Type: GrantFiled: September 18, 2015Date of Patent: February 26, 2019Assignee: International Business Machines CorporationInventors: Jean-Paul Aldebert, Claude Basso, Jean-Luc Frenoy, Fabrice J. Verplanken
-
Patent number: 10216704Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix multiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.Type: GrantFiled: July 20, 2017Date of Patent: February 26, 2019Assignee: NOVUMIND LIMITEDInventors: Chien-Ping Lu, Yu-Shuen Tang
-
Patent number: 10218634Abstract: A network interface controller for providing a connection for a device to a network. The network interface controller may include a sideband port controller. The sideband port controller may provide a sideband connection between the network and a sideband endpoint circuit that is operative to communicate information with the network via the sideband. The sideband port controller may include a transmit data route having an input for receiving packets from the sideband endpoint circuit and an output for passing packets received from the sideband endpoint to the network. A packet parser is connected to the transmit data route. The packet parser is operative to read data from packets received from the sideband endpoint and is further operative to analyze the data.Type: GrantFiled: September 18, 2015Date of Patent: February 26, 2019Assignee: International Business Machines CorporationInventors: Jean-Paul Aldebert, Claude Basso, Jean-Luc Frenoy, Fabrice J. Verplanken
-
Patent number: 10186069Abstract: A graphics processing system groups plural initial pilot shader programs into a set of initial pilot shader programs and associates the set of initial pilot shader programs with a set of indexes. The initial pilot shader programs each contain constant program expressions to be executed on behalf of an original shader program. The index for an initial pilot shader program is then used to obtain the instructions contained in the initial pilot shader program for executing the constant program expressions of the initial pilot shader program. The threads for executing a subset of the initial pilot shader programs are also grouped into a thread group and the threads of the thread group are executed in parallel. The graphics processing system provides for efficient preparation and execution of plural initial pilot shader programs.Type: GrantFiled: February 15, 2017Date of Patent: January 22, 2019Assignee: Arm LimitedInventors: Alexander Galazin, Jörg Wagner, Andreas Due Engh-Halstvedt
-
Patent number: 10152329Abstract: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced.Type: GrantFiled: February 9, 2012Date of Patent: December 11, 2018Assignee: NVIDIA CORPORATIONInventors: Michael Fetterman, Stewart Glenn Carlton, Jack Hilaire Choquette, Shirish Gadre, Olivier Giroux, Douglas J. Hahn, Steven James Heinrich, Eric Lyell Hill, Charles McCarver, Omkar Paranjape, Anjana Rajendran, Rajeshwaran Selvanesan
-
Patent number: 10073816Abstract: A native tensor processor calculates tensor contractions using a sum of outer products. In one implementation, the native tensor processor preferably is implemented as a single integrated circuit and includes an input buffer and a contraction engine. The input buffer buffers tensor elements retrieved from off-chip and transmits the elements to the contraction engine as needed. The contraction engine calculates the tensor contraction by executing calculations from an equivalent matrix multiplications, as if the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix mutiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.Type: GrantFiled: July 20, 2017Date of Patent: September 11, 2018Assignee: NovuMind LimitedInventors: Chien-Ping Lu, Yu-Shuen Tang
-
Patent number: 10061591Abstract: A method for reducing execution of redundant threads in a processing environment. The method includes detecting threads that include redundant work among many different threads. Multiple threads from the detected threads are grouped into one or more thread clusters based on determining same thread computation results. Execution of all but a particular one thread in each of the one or more thread clusters is suppressed. The particular one thread in each of the one or more thread clusters is executed. Results determined from execution of the particular one thread in each of the one or more thread clusters are broadcasted to other threads in each of the one or more thread clusters.Type: GrantFiled: February 26, 2015Date of Patent: August 28, 2018Assignee: Samsung Electronics Company, Ltd.Inventors: Boris Beylin, John Brothers, Santosh Abraham, Lingjie Xu, Maxim Lukyanov, Alex Grosul
-
Patent number: 10042641Abstract: An asynchronous processing system comprising an asynchronous scalar processor and an asynchronous vector processor coupled to the scalar processor. The asynchronous scalar processor is configured to perform processing functions on input data and to output instructions. The asynchronous vector processor is configured to perform processing functions in response to a very long instruction word (VLIW) received from the scalar processor. The VLIW comprises a first portion and a second portion, at least the first portion comprising a vector instruction.Type: GrantFiled: September 8, 2014Date of Patent: August 7, 2018Assignee: Huawei Technologies Co., Ltd.Inventors: Qifan Zhang, Wuxian Shi, Yiqun Ge, Tao Huang, Wen Tong
-
Patent number: 10019410Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.Type: GrantFiled: April 6, 2017Date of Patent: July 10, 2018Assignee: CORNAMI, INC.Inventors: Solomon Harsha, Paul Master
-
Patent number: 10019264Abstract: Methods and apparatuses relating to processors that contextually optimize instructions at runtime are disclosed. In one embodiment, a processors includes a fetch circuit to fetch an instruction from an instruction storage, a format of the instruction including an opcode, a first source operand identifier, and a second source operand identifier; wherein the instruction storage includes a sequence of sub-optimal instructions preceded by a start-of-sequence instruction and followed by an end-of-sequence instruction.Type: GrantFiled: February 24, 2016Date of Patent: July 10, 2018Assignee: Intel CorporationInventors: Taylor W. Kidd, Matt S. Walsh
-
Patent number: 10013652Abstract: Deep Neural Networks (DNNs) with many hidden layers and many units per layer are very flexible models with a very large number of parameters. As such, DNNs are challenging to optimize. To achieve real-time computation, embodiments disclosed herein enable fast DNN feature transformation via optimized memory bandwidth utilization. To optimize memory bandwidth utilization, a rate of accessing memory may be reduced based on a batch setting. A memory, corresponding to a selected given output neuron of a current layer of the DNN, may be updated with an incremental output value computed for the selected given output neuron as a function of input values of a selected few non-zero input neurons of a previous layer of the DNN in combination with weights between the selected few non-zero input neurons and the selected given output neuron, wherein a number of the selected few corresponds to the batch setting.Type: GrantFiled: April 29, 2015Date of Patent: July 3, 2018Assignee: Nuance Communications, Inc.Inventors: Jan Vlietinck, Stephan Kanthak, Rudi Vuerinckx, Christophe Ris
-
Patent number: 9979904Abstract: The present invention relates to reading-out sensor array pixels. In particular, the present invention provides an approach according to which only a region of interest is may be read out from the sensor array, thus leading to substantial time savings. In order to achieve this, a circuitry for configuring a region of interest for the sensor array is provided as well as a reading-out circuitry for reading-out pixels belonging to the region of interest. In addition, the corresponding methods for programming the region of interest and for reading-out the region of interest are provided. The circuitry for programming and/or reading-out the region of interest includes per pixel provided storage elements for storing an indication of whether a pixel belongs to a region of interest (ROI). These are configured by the programming circuitry and using when reading-out the ROI for only reading out the pixels of the ROI.Type: GrantFiled: January 24, 2014Date of Patent: May 22, 2018Assignee: INNOVACIONES MICROELECTRÓNICAS S.L. (ANAFOCUS)Inventors: Rafael Dominguez Castro, Sergio Morillas Castillo, Rafael Romay Juárez, Fernando Medeiro Hidalgo
-
Patent number: 9891912Abstract: An array processor includes a managing element having a load streaming unit coupled to multiple processing elements. The load streaming unit provides input data portions to each of a first subset of processing elements and receives output data from each of a second subset of the processing elements based on a comparatively sorted combination of the input data portions. Each processing element is configurable by the managing element to compare input data portions received from the load streaming unit or two or more of the other processing elements. Each processing unit can further select an input data portion to be output data based on the comparison, and in response to selecting the input data portion, remove a queue entry corresponding to the selected input data portion. Each processing element can provide the selected output data portion to the managing element or as an input to one of the processing elements.Type: GrantFiled: October 31, 2014Date of Patent: February 13, 2018Assignee: International Business Machines CorporationInventors: Ganesh Balakrishnan, Bartholomew Blaner, John J. Reilly, Jeffrey A. Stuecheli
-
Patent number: 9841957Abstract: An apparatus stores a program including a description of loop processing of iterating a plurality of instructions, and rearranges an execution sequence of the plurality of instructions in the program such that the loop processing is pipelined by software pipeline. The apparatus inserts an instruction to use a register for single instruction multiple data (SIMD) extension instruction, into the description of the loop processing in the program.Type: GrantFiled: April 19, 2016Date of Patent: December 12, 2017Assignee: FUJITSU LIMITEDInventor: Shun Kamatsuka
-
Patent number: 9830164Abstract: A system and method for efficiently processing instructions in hardware parallel execution lanes within a processor. In response to a given divergent point within an identified loop, a compiler arranges instructions within the identified loop into very large instruction words (VLIW's). At least one VLIW includes instructions intermingled from different basic blocks between the given divergence point and a corresponding convergence point. The compiler generates code wherein when executed assigns at runtime instructions within a given VLIW to multiple parallel execution lanes within a target processor. The target processor includes a single instruction multiple data (SIMD) micro-architecture. The assignment for a given lane is based on branch direction found at runtime for the given lane at the given divergent point. The target processor includes a vector register for storing indications indicating which given instruction within a fetched VLIW for an associated lane to execute.Type: GrantFiled: January 29, 2013Date of Patent: November 28, 2017Assignee: Advanced Micro Devices, Inc.Inventor: Reza Yazdani
-
Patent number: 9832478Abstract: One exemplary video encoding method has the following steps: determining a size of a parallel motion estimation region according to encoding related information; and encoding a plurality of pixels by at least performing motion estimation based on the size of the parallel motion estimation region. One exemplary video decoding method has the following steps: decoding a video parameter stream to obtain a decoded size of a parallel motion estimation region; checking validity of the decoded size of the parallel motion estimation region, and accordingly generating a checking result; when the checking result indicates that the decoded size of the parallel motion estimation region is invalid, entering an error handling process to decide a size of the parallel motion estimation; and decoding a plurality of pixels by at least performing motion estimation based on the decided size of the parallel motion estimation region.Type: GrantFiled: May 6, 2014Date of Patent: November 28, 2017Assignee: MEDIATEK INC.Inventors: Tung-Hsing Wu, Kun-Bin Lee
-
Patent number: 9804848Abstract: Method, apparatus, and program for performing a string comparison operation. The apparatus includes execution resources to execute a first instruction. In response to the first instruction, the execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.Type: GrantFiled: December 5, 2014Date of Patent: October 31, 2017Assignee: Intel CorporationInventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
-
Patent number: 9804826Abstract: System and method for pseudo-random number generation based on a recursion with significantly increased multithreaded parallelism. A single pseudo-random generator program is assigned with multiple threads to process in parallel. N state elements indexed incrementally are arranged into a matrix comprising x rows, where a respective adjacent pair of state elements in a same column are related by g=(M+j)mod N, wherein j and g represent indexes of the pair of state elements. x can be determined through an modular manipulative inverse of M and N. The matrix can be divided into sections with each section having a number of columns, and each thread is assigned with a section. In this manner, the majority of the requisite interactions among the state elements occur without expensive inter-thread communications, and further each thread may only need to communicate with a single other thread for a small number of times.Type: GrantFiled: December 5, 2014Date of Patent: October 31, 2017Assignee: Nvidia CorporationInventors: Przemyslaw Tredak, John Clifton Woolley, Jr.
-
Patent number: 9772864Abstract: When an OpenCL kernel is to be executed, a bitfield index representation to be used for the indices of the kernel invocations is determined based on the number of bits needed to represent the maximum value that will be needed for each index dimension for the kernel. A bitfield placement data structure 33 describing how the bitfield index representation is partitioned is then prepared together with a maximum value data structure 32 indicating the maximum index dimension values to be used for the kernel. A processor then executes the kernel invocations 36 across the index space indicated by the maximum value data structure 32. A bitfield index representation 35, 37, 38 configured in accordance with the bitfield placement data structure 33 is associated with each kernel invocation to indicate its index.Type: GrantFiled: April 16, 2013Date of Patent: September 26, 2017Assignee: ARM LIMITEDInventor: Jorn Nystad
-
Patent number: 9766888Abstract: A processor of an aspect includes packed data registers, and a decode unit to decode an instruction. The instruction may indicate a first source packed data to include at least four data elements, indicate a second source packed data to include at least four data elements, and indicate a destination storage location. An execution unit is coupled with the packed data registers and the decode unit. The execution unit, in response to the instruction, is to store a result packed data in the destination storage location. The result packed data may include at least four indexes that may identify corresponding data element positions in the first and second source packed data. The indexes may be stored in positions in the result packed data that are to represent a sorted order of corresponding data elements in the first and second source packed data.Type: GrantFiled: March 28, 2014Date of Patent: September 19, 2017Assignee: Intel CorporationInventors: Shay Gueron, Vlad Krasnov
-
Patent number: 9760530Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.Type: GrantFiled: March 31, 2017Date of Patent: September 12, 2017Assignee: CORNAMI, INC.Inventors: Solomon Harsha, Paul Master
-
Patent number: 9760531Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.Type: GrantFiled: April 6, 2017Date of Patent: September 12, 2017Assignee: CORNAMI, INC.Inventors: Solomon Harsha, Paul Master
-
Patent number: 9727380Abstract: Global register protection in a multi-threaded processor is described. In an embodiment, global resources within a multi-threaded processor are protected by performing checks, before allowing a thread to write to a global resource, to determine whether the thread has write access to the particular global resource. The check involves accessing one or more local control registers or a global control field within the multi-threaded processor and in an example, a local register associated with each other thread in the multi-threaded processor is accessed and checked to see whether it contains an identifier for the particular global resource. Only if none of the accessed local resources contain such an identifier, is the instruction issued and the thread allowed to write to the global resource. Otherwise, the instruction is blocked and an exception may be raised to alert the program that issued the instruction that the write failed.Type: GrantFiled: February 19, 2015Date of Patent: August 8, 2017Assignee: Imagination Technologies LimitedInventors: Guixin Wang, Hugh Jackson, Robert Graham Isherwood
-
Patent number: 9710171Abstract: Aspects include communicating synchronous input/output (I/O) commands between an operating system and a recipient. Communicating synchronous I/O commands includes issuing a first synchronous I/O command with a first initiation bit set, where the first synchronous I/O command cause a first mailbox command to be initiated by the recipient with respect to a first storage control unit. Further, communicating synchronous I/O commands issuing a second synchronous I/O command with a second initiation bit set, where the second synchronous I/O command causes a second mailbox command to be initiated by the recipient with respect to at least one subsequent storage control unit. Communicating synchronous I/O commands also includes issuing a third synchronous I/O command with a first completion bit set in response to the first mailbox command being initiated and issuing a fourth synchronous I/O command with a second completion bit set in response to the first mailbox command being initiated.Type: GrantFiled: October 1, 2015Date of Patent: July 18, 2017Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: David F. Craddock, Peter G. Sutton, Harry M. Yudenfriend
-
Patent number: 9710172Abstract: Aspects include communicating synchronous input/output (I/O) commands between an operating system and a recipient. Communicating synchronous I/O commands includes issuing a first synchronous I/O command with a first initiation bit set, where the first synchronous I/O command cause a first mailbox command to be initiated by the recipient with respect to a first storage control unit. Further, communicating synchronous I/O commands issuing a second synchronous I/O command with a second initiation bit set, where the second synchronous I/O command causes a second mailbox command to be initiated by the recipient with respect to at least one subsequent storage control unit. Communicating synchronous I/O commands also includes issuing a third synchronous I/O command with a first completion bit set in response to the first mailbox command being initiated and issuing a fourth synchronous I/O command with a second completion bit set in response to the first mailbox command being initiated.Type: GrantFiled: June 14, 2016Date of Patent: July 18, 2017Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: David F. Craddock, Peter G. Sutton, Harry M. Yudenfriend
-
Patent number: 9703966Abstract: A data processing system includes a single instruction multiple data register file and single instruction multiple processing circuitry. The single instruction multiple data processing circuitry supports execution of cryptographic processing instructions for performing parts of a hash algorithm. The operands are stored within the single instruction multiple data register file. The cryptographic support instructions do not follow normal lane-based processing and generate output operands in which the different portions of the output operand depend upon multiple different elements within the input operand.Type: GrantFiled: July 7, 2015Date of Patent: July 11, 2017Assignee: ARM LIMITEDInventors: Matthew James Horsnell, Richard Roy Grisenthwaite, Stuart David Biles, Daniel Kershaw
-
Patent number: 9652435Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.Type: GrantFiled: October 18, 2016Date of Patent: May 16, 2017Assignee: CORNAMI, INC.Inventors: Solomon Harsha, Paul Master
-
Patent number: 9645821Abstract: A processor includes a decoder logic to decode a compare instruction, and an execution unit to execute the compare instruction. The compare instruction is to cause the processor to compare integer data elements of a first 64-bit SIMD integer operand with integer data elements of a second 64-bit SIMD integer operand. The integer data elements of the first 64-bit SIMD integer operand to be compared with the integer data elements of the second 64-bit SIMD integer operand are to be in same data element positions. The compare instruction is also to cause the processor to store a plurality of indicators of whether the compared integer data elements of the first and second 64-bit SIMD integer operands are equal. The plurality of indicators are expanded data elements, each of a first multi-bit size.Type: GrantFiled: December 5, 2014Date of Patent: May 9, 2017Assignee: Intel CorporationInventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
-
Patent number: 9633407Abstract: A thread on one processor may be used to enable another processor to lock or release a mutex. For example, a central processing unit thread may be used by a graphics processing unit to secure a mutex for a shared memory.Type: GrantFiled: July 29, 2011Date of Patent: April 25, 2017Assignee: Intel CorporationInventors: Boris Ginzburg, Esfirush Natanzon, Ilya Osadchiy, Yoav Zach
-
Patent number: 9626402Abstract: Techniques for performing database operations using vectorized instructions are provided. In one technique, data compaction is performed using vectorized instructions to identify a shuffle mask based on matching bits and update an output array based on the shuffle mask and an input array. In a related technique, a hash table probe involves using vectorized instructions to determine whether each key in one or more hash buckets matches a particular input key.Type: GrantFiled: August 1, 2013Date of Patent: April 18, 2017Assignee: Oracle International CorporationInventors: Rajkumar Sen, Sam Idicula, Nipun Agarwal
-
Patent number: 9612838Abstract: Instructions for generating flags according to operands' data sizes, and instruction sets handled by a RISC data processor including an instruction capable of executing an operation on operands in more than one data size are disclosed. An identical operation process is conducted on the small-size operand and on low-order bits of the large-size operand, and flags are generated capable of coping with the respective data sizes regardless of the data size of each operand subjected to the operation. Thus, a reduction in instruction code space of the RISC data processor can be achieved.Type: GrantFiled: May 21, 2014Date of Patent: April 4, 2017Assignee: Renesas Electronics CorporationInventor: Fumio Arakawa
-
Patent number: 9600852Abstract: A graphical processing unit having an implementation of a hierarchical hash table thereon, a method of establishing a hierarchical hash table in a graphics processing unit and GPU computing system are disclosed herein. In one embodiment, the graphics processing unit includes: (1) a plurality of parallel processors, wherein each of the plurality of parallel processors includes parallel processing cores, a shared memory coupled to each of the parallel processing cores, and registers, wherein each one of the registers is uniquely associated with one of the parallel processing cores and (2) a controller configured to employ at least one of the registers to establish a hierarchical hash table for a key-value pair of a thread processing on one of the parallel processing cores.Type: GrantFiled: May 10, 2013Date of Patent: March 21, 2017Assignee: Nvidia CorporationInventor: Julien Demouth