Patents by Inventor Joel Emer
Joel Emer has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11966835Abstract: A sparse convolutional neural network accelerator system that dynamically and efficiently identifies fine-grained parallelism in sparse convolution operations. The system determines matching pairs of non-zero input activations and weights from the compacted input activation and weight arrays utilizing a scalable, dynamic parallelism discovery unit (PDU) that performs a parallel search on the input activation array and the weight array to identify reducible input activation and weight pairs.Type: GrantFiled: January 23, 2019Date of Patent: April 23, 2024Assignee: NVIDIA CORP.Inventors: Ching-En Lee, Yakun Shao, Angshuman Parashar, Joel Emer, Stephen W. Keckler
-
Publication number: 20220076110Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.Type: ApplicationFiled: November 19, 2021Publication date: March 10, 2022Applicant: NVIDIA Corp.Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
-
Patent number: 11270197Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.Type: GrantFiled: November 4, 2019Date of Patent: March 8, 2022Assignee: NVIDIA Corp.Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
-
Publication number: 20200293867Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.Type: ApplicationFiled: November 4, 2019Publication date: September 17, 2020Applicant: NVIDIA Corp.Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
-
Publication number: 20190370645Abstract: A sparse convolutional neural network accelerator system that dynamically and efficiently identifies fine-grained parallelism in sparse convolution operations. The system determines matching pairs of non-zero input activations and weights from the compacted input activation and weight arrays utilizing a scalable, dynamic parallelism discovery unit (PDU) that performs a parallel search on the input activation array and the weight array to identify reducible input activation and weight pairs.Type: ApplicationFiled: January 23, 2019Publication date: December 5, 2019Inventors: Ching-En Lee, Yakun Shao, Angshuman Parashar, Joel Emer, Stephen W. Keckler
-
Patent number: 9740617Abstract: Methods and apparatuses to control cache line coherence are described. A hardware processor may include a first processor core with a cache to store a cache line, a second set of processor cores that each include a cache to store a copy of the cache line, and cache coherence logic to aggregate in a tag directory an acknowledgment message from each of the second set of processor cores in response to a request from the first processor core to modify the copy of the cache line in each of the second set of processor cores and send a consolidated acknowledgment message to the first processor core.Type: GrantFiled: December 23, 2014Date of Patent: August 22, 2017Assignee: Intel CorporationInventors: Samantika Sury, Simon Steely, Jr., William Hasenplaugh, Joel Emer, David Webb
-
Publication number: 20160179674Abstract: Methods and apparatuses to control cache line coherence are described. A hardware processor may include a first processor core with a cache to store a cache line, a second set of processor cores that each include a cache to store a copy of the cache line, and cache coherence logic to aggregate in a tag directory an acknowledgment message from each of the second set of processor cores in response to a request from the first processor core to modify the copy of the cache line in each of the second set of processor cores and send a consolidated acknowledgment message to the first processor core.Type: ApplicationFiled: December 23, 2014Publication date: June 23, 2016Inventors: Samantika Sury, Simon Steely, JR., William Hasenplaugh, Joel Emer, David Webb
-
Patent number: 9286128Abstract: A processor is described having an out-of-order core to execute a first thread and a non-out-of-order core to execute a second thread. The processor also includes statistics collection circuitry to support calculation of the following: the first thread's performance on the out-of-order core; an estimate of the first thread's performance on the non-out-of-order core; the second thread's performance on the non-out-of-order core; an estimate of the second thread's performance on the out-of-order core.Type: GrantFiled: March 15, 2013Date of Patent: March 15, 2016Assignee: Intel CorporationInventors: Aamer Jaleel, Kenzo Van Craeynest, Paolo Narvaez, Joel Emer
-
Publication number: 20140282565Abstract: A processor is described having an out-of-order core to execute a first thread and a non-out-of-order core to execute a second thread. The processor also includes statistics collection circuitry to support calculation of the following: the first thread's performance on the out-of-order core; an estimate of the first thread's performance on the non-out-of-order core; the second thread's performance on the non-out-of-order core; an estimate of the second thread's performance on the out-of-order core.Type: ApplicationFiled: March 15, 2013Publication date: September 18, 2014Inventors: AAMER JALEEL, KENZO VAN CRAEYNEST, PAOLO NARVAEZ, JOEL EMER
-
Publication number: 20140201506Abstract: A processing engine includes separate hardware components for control processing and data processing. The instruction execution order in such a processing engine may be efficiently determined in a control processing engine based on inputs received by the control processing engine. For each instruction of a data processing engine: a status of the instruction may be set to “ready” based on a trigger for the instruction and the input received in the control processing engine; and execution of the instruction in the data processing engine may be enabled if the status of the instruction is set to “ready” and at least one processing element of the data processing engine is available. The trigger for each instruction may be a function of one or more predicate register of the control processing engine, FIFO status signals or information regarding tags.Type: ApplicationFiled: December 30, 2011Publication date: July 17, 2014Inventors: Angshuman Parashar, Michael Pellauer, Michael Adler, Joel Emer
-
Patent number: 8769201Abstract: A technique to enable resource allocation optimization within a computer system. In one embodiment, a gradient partition algorithm (GPA) module is used to continually measure performance and adjust allocation to shared resources among a plurality of data classes in order to achieve optimal performance.Type: GrantFiled: December 2, 2008Date of Patent: July 1, 2014Assignee: Intel CorporationInventors: William Hasenplaugh, Joel Emer, Tryggve Fossum, Aamer Jaleel, Simon Steely
-
Patent number: 8707012Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: GrantFiled: October 12, 2012Date of Patent: April 22, 2014Assignee: Intel CorporationInventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Publication number: 20130036268Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: ApplicationFiled: October 12, 2012Publication date: February 7, 2013Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Patent number: 8316216Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: GrantFiled: October 21, 2009Date of Patent: November 20, 2012Assignee: Intel CorporationInventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Publication number: 20100138609Abstract: A technique to enable resource allocation optimization within a computer system. In one embodiment, a gradient partition algorithm (GPA) module is used to continually measure performance and adjust allocation to shared resources among a plurality of data classes in order to achieve optimal performance.Type: ApplicationFiled: December 2, 2008Publication date: June 3, 2010Inventors: William Hasenplaugh, Joel Emer, Tryggve Fossum, Aamer Jaleel, Simon Steely
-
Publication number: 20100042779Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: ApplicationFiled: October 21, 2009Publication date: February 18, 2010Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Patent number: 7627735Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: GrantFiled: October 21, 2005Date of Patent: December 1, 2009Assignee: Intel CorporationInventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Patent number: 7558920Abstract: A method and apparatus for partitioning a shared cache of a chip multi-processor are described. In one embodiment, the method includes a request of a cache block from system memory if a cache miss within a shared cache is detected according to a received request from a processor. Once the cache block is requested, a victim block within the shared cache is selected according to a processor identifier and a request type of the received request. In one embodiment, selection of the victim block according to a processor identifier and request type is based on a partition of a set-associative, shared cache to limit the selection of the victim block from a subset of available cache ways according to the cache partition. Other embodiments are described and claimed.Type: GrantFiled: June 30, 2004Date of Patent: July 7, 2009Assignee: Intel CorporationInventors: Matthew Mattina, Antonio Juan-Hormigo, Joel Emer, Ramon Matas-Navarro
-
Publication number: 20070094477Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.Type: ApplicationFiled: October 21, 2005Publication date: April 26, 2007Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
-
Publication number: 20070022348Abstract: Embodiments of apparatuses and methods for reducing the uncorrectable error rate in a lockstepped dual-modular redundancy system are disclosed. In one embodiment, an apparatus includes two processor cores, a micro-checker, a global checker, and fault logic. The micro-checker is to detect whether a value from a structure in one core matches a value from the corresponding structure in the other core. The global checker is to detect lockstep failures between the two cores. The fault logic is to cause the two cores to be resynchronized if there is a lockstep error but the micro-checker has detected a mismatch.Type: ApplicationFiled: June 30, 2005Publication date: January 25, 2007Inventors: Paul Racunas, Joel Emer, Arijit Biswas, Shubhendu Mukherjee, Steven Raasch