Patents by Inventor Jeffrey T. Huynh
Jeffrey T. Huynh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11232016Abstract: Techniques disclosed herein relate generally to debugging complex computing systems, such as those executing neural networks. A neural network processor includes a processing engine configured to execute instructions to implement multiple layers of a neural network. The neural network processor includes a debugging circuit configured to generate error detection codes for input data to the processing engine or error detection codes for output data generated by the processing engine. The neural network processor also includes an interface to a memory device, where the interface is configured to save the error detection codes generated by the debugging circuit into the memory device. The error detection codes generated by the debugging circuit are compared with expected error detection codes generated using a function model of the neural network to identify defects of the neural network.Type: GrantFiled: September 21, 2018Date of Patent: January 25, 2022Assignee: Amazon Technologies, Inc.Inventors: Jeffrey T. Huynh, Ron Diamant, Sundeep Amirineni, Randy Renfu Huang
-
Patent number: 11218410Abstract: Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts.Type: GrantFiled: November 4, 2015Date of Patent: January 4, 2022Assignee: Marvell Asia PTE, LTD.Inventors: Jeffrey T. Huynh, Weihuang Wang, Tsahi Daniel, Srinath Atluri, Mohan Balan
-
Publication number: 20210247984Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.Type: ApplicationFiled: April 28, 2021Publication date: August 12, 2021Inventors: Jeffrey T. Huynh, Drazen Borkovic, Jindrich Zejda, Randy Renfu Huang, Ron Diamant
-
Publication number: 20210158131Abstract: Methods and apparatuses for hierarchical partitioning of operators of a neural network for execution on an acceleration engine are provided. Neural networks are built in machine learning frameworks using neural network operators. The neural network operators are compiled into executable code for the acceleration engine. Development of new framework-level operators can exceed the capability to map the newly developed framework-level operators onto the acceleration engine. To enable neural networks to be executed on an acceleration engine, hierarchical partitioning can be used to partition the operators of the neural network. The hierarchical partitioning can identify operators that are supported by a compiler for execution on the acceleration engine, operators to be compiled for execution on a host processor, and operators to be executed on the machine learning framework.Type: ApplicationFiled: November 27, 2019Publication date: May 27, 2021Inventors: Animesh Jain, Yizhi Liu, Hongbin Zheng, Jeffrey T. Huynh, Haichen Li, Drazen Borkovic, Jindrich Zejda, Richard John Heaton, Randy Renfu Huang, Zhi Chen, Yida Wang
-
Publication number: 20210158132Abstract: A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters.Type: ApplicationFiled: November 27, 2019Publication date: May 27, 2021Inventors: Jeffrey T. Huynh, Ron Diamant, Hongbin Zheng, Yizhi Liu, Animesh Jain, Yida Wang, Vinod Sharma, Richard John Heaton, Randy Renfu Huang, Sundeep Amirineni, Drazen Borkovic
-
Patent number: 11016775Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.Type: GrantFiled: June 26, 2019Date of Patent: May 25, 2021Assignee: Amazon Technologies, Inc.Inventors: Jeffrey T. Huynh, Drazen Borkovic, Jindrich Zejda, Randy Renfu Huang, Ron Diamant
-
Patent number: 11003429Abstract: Scheduling of the operations of an integrated circuit device such as a hardware accelerator, including scheduling of movement of data into and out of the accelerator, can be performed by a compiler that produces program code for the accelerator. The compiler can produce a graph that represents operations to be performed by the accelerator. Using the graph, the compiler can determine estimated execution times for the operations represented by each node in the graph. The compiler can schedule operations by determining an estimated execution time for set of dependent operations that depend from an operation. The compiler can then select an operation that has a shortest estimated execution time from among a set of operations and which has a set of dependent operations that has a longest estimated execution time as compared to other sets of dependent operations.Type: GrantFiled: February 4, 2019Date of Patent: May 11, 2021Assignee: Amazon Technologies, Inc.Inventors: Jindrich Zejda, Jeffrey T. Huynh, Tobias Joseph Kastulus Edler von Koch, Drazen Borkovic, Taemin Kim
-
Publication number: 20210097375Abstract: In one example, a neural network accelerator can execute a set of instructions to: load a first weight data element from a memory into a systolic array, the first weight data element having first coordinates; extract, from the instructions, information indicating a first subset of input data elements to be obtained from the memory, the first subset being based on a stride of a transposed convolution operation and second coordinates of first weight data element in a rotated array of weight data elements; based on the information, obtain the first subset of input data elements from the memory; load the first subset of input data elements into the systolic array; and control the systolic array to perform first computations based on the first weight data element and the first subset of input data elements to generate output data elements of an array of output data elements.Type: ApplicationFiled: September 27, 2019Publication date: April 1, 2021Inventors: Jeffrey T. Huynh, Vignesh Vivekraja
-
Publication number: 20210096823Abstract: Provided are integrated circuits and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.Type: ApplicationFiled: December 15, 2020Publication date: April 1, 2021Inventors: Haichen Li, Ron Diamant, Jeffrey T. Huynh, Yu Zhou, Se jong Oh
-
Patent number: 10884707Abstract: Provided are systems and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.Type: GrantFiled: June 27, 2019Date of Patent: January 5, 2021Assignee: Amazon Technologies, Inc.Inventors: Haichen Li, Ron Diamant, Jeffrey T. Huynh, Yu Zhou, Se jong Oh
-
Publication number: 20200409717Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.Type: ApplicationFiled: June 26, 2019Publication date: December 31, 2020Inventors: Jeffrey T. Huynh, Drazen Borkovic, Jindrich Zejda, Randy Renfu Huang, Ron Diamant
-
Publication number: 20200409664Abstract: Provided are systems and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.Type: ApplicationFiled: June 27, 2019Publication date: December 31, 2020Inventors: Haichen Li, Ron Diamant, Jeffrey T. Huynh, Yu Zhou, Se jong Oh
-
Publication number: 20200410036Abstract: In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.Type: ApplicationFiled: June 28, 2019Publication date: December 31, 2020Inventors: Jeffrey T. Huynh, Ron Diamant
-
Publication number: 20200410354Abstract: Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor.Type: ApplicationFiled: June 27, 2019Publication date: December 31, 2020Inventors: Jindrich Zejda, Jeffrey T. Huynh, Drazen Borkovic, Se jong Oh, Ron Diamant, Randy Renfu Huang
-
Patent number: 10678479Abstract: Provided are integrated circuits and methods for operating integrated circuits. An integrated circuit can include a plurality of memory banks and an execution engine including a set of execution components. Each execution component can be associated with a respective memory bank, and can read from and write to only the respective memory bank. The integrated circuit can further include a set of registers each associated with a respective memory bank from the plurality of memory banks. The integrated circuit can further be operable to load to or store from the set of registers in parallel, and load to or store from the set of registers serially. A parallel operation followed by a serial operation enables data to be moved from many memory banks into one memory bank. A serial operation followed by a parallel operation enables data to be moved from one memory bank into many memory banks.Type: GrantFiled: November 29, 2018Date of Patent: June 9, 2020Assignee: Amazon Technologies, Inc.Inventors: Ron Diamant, Randy Renfu Huang, Sundeep Amirineni, Jeffrey T. Huynh
-
Publication number: 20160134537Abstract: Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts.Type: ApplicationFiled: November 4, 2015Publication date: May 12, 2016Inventors: Jeffrey T. Huynh, Weihuang Wang, Tsahi Daniel, Srinath Atluri, Mohan Balan
-
Publication number: 20100202292Abstract: A processing engine to accomplish a multiplicity of tasks has a multiplicity of processing tribes, each tribe comprising a multiplicity of context register sets and a multiplicity of processing resources for concurrent processing of a multiplicity of threads to accomplish the tasks, a memory structure having a multiplicity of memory blocks, each block storing data for processing threads, and an interconnect structure and control system enabling tribe-to-tribe migration of contexts to move threads from tribe-to-tribe. The processing engine is characterized in that individual ones of the tribes have preferential access to individual ones of the multiplicity of memory blocks.Type: ApplicationFiled: January 29, 2010Publication date: August 12, 2010Inventors: Mario D. Nemirovsky, Enrique Musoll, Jeffrey T. Huynh, Stephen W. Melvin
-
Publication number: 20100205608Abstract: A mechanism is disclosed for implementing resource locking in a massively multi-threaded environment. The mechanism receives from a stream a request to obtain a lock on a resource. In response, the mechanism determines whether the resource is currently locked. If so, the mechanism adds the stream to a wait list. At some point, based upon the wait list, the mechanism determines that it is the stream's turn to lock the resource; thus, the mechanism grants the stream a lock. In this manner, the mechanism enables the stream to reserve and to obtain a lock on the resource. By implementing locking in this way, a stream is able to submit only one lock request. When it is its turn to obtain a lock, the stream is granted that lock. This lock reservation methodology makes it possible to implement resource locking efficiently in a massively multi-threaded environment.Type: ApplicationFiled: February 2, 2010Publication date: August 12, 2010Inventors: Mario D. Nemirovsky, Jeffrey T. Huynh