Patents by Inventor Joel Emer

Joel Emer has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11966835
    Abstract: A sparse convolutional neural network accelerator system that dynamically and efficiently identifies fine-grained parallelism in sparse convolution operations. The system determines matching pairs of non-zero input activations and weights from the compacted input activation and weight arrays utilizing a scalable, dynamic parallelism discovery unit (PDU) that performs a parallel search on the input activation array and the weight array to identify reducible input activation and weight pairs.
    Type: Grant
    Filed: January 23, 2019
    Date of Patent: April 23, 2024
    Assignee: NVIDIA CORP.
    Inventors: Ching-En Lee, Yakun Shao, Angshuman Parashar, Joel Emer, Stephen W. Keckler
  • Publication number: 20220076110
    Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.
    Type: Application
    Filed: November 19, 2021
    Publication date: March 10, 2022
    Applicant: NVIDIA Corp.
    Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
  • Patent number: 11270197
    Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.
    Type: Grant
    Filed: November 4, 2019
    Date of Patent: March 8, 2022
    Assignee: NVIDIA Corp.
    Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
  • Publication number: 20200293867
    Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.
    Type: Application
    Filed: November 4, 2019
    Publication date: September 17, 2020
    Applicant: NVIDIA Corp.
    Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
  • Publication number: 20190370645
    Abstract: A sparse convolutional neural network accelerator system that dynamically and efficiently identifies fine-grained parallelism in sparse convolution operations. The system determines matching pairs of non-zero input activations and weights from the compacted input activation and weight arrays utilizing a scalable, dynamic parallelism discovery unit (PDU) that performs a parallel search on the input activation array and the weight array to identify reducible input activation and weight pairs.
    Type: Application
    Filed: January 23, 2019
    Publication date: December 5, 2019
    Inventors: Ching-En Lee, Yakun Shao, Angshuman Parashar, Joel Emer, Stephen W. Keckler
  • Patent number: 9740617
    Abstract: Methods and apparatuses to control cache line coherence are described. A hardware processor may include a first processor core with a cache to store a cache line, a second set of processor cores that each include a cache to store a copy of the cache line, and cache coherence logic to aggregate in a tag directory an acknowledgment message from each of the second set of processor cores in response to a request from the first processor core to modify the copy of the cache line in each of the second set of processor cores and send a consolidated acknowledgment message to the first processor core.
    Type: Grant
    Filed: December 23, 2014
    Date of Patent: August 22, 2017
    Assignee: Intel Corporation
    Inventors: Samantika Sury, Simon Steely, Jr., William Hasenplaugh, Joel Emer, David Webb
  • Publication number: 20160179674
    Abstract: Methods and apparatuses to control cache line coherence are described. A hardware processor may include a first processor core with a cache to store a cache line, a second set of processor cores that each include a cache to store a copy of the cache line, and cache coherence logic to aggregate in a tag directory an acknowledgment message from each of the second set of processor cores in response to a request from the first processor core to modify the copy of the cache line in each of the second set of processor cores and send a consolidated acknowledgment message to the first processor core.
    Type: Application
    Filed: December 23, 2014
    Publication date: June 23, 2016
    Inventors: Samantika Sury, Simon Steely, JR., William Hasenplaugh, Joel Emer, David Webb
  • Patent number: 9286128
    Abstract: A processor is described having an out-of-order core to execute a first thread and a non-out-of-order core to execute a second thread. The processor also includes statistics collection circuitry to support calculation of the following: the first thread's performance on the out-of-order core; an estimate of the first thread's performance on the non-out-of-order core; the second thread's performance on the non-out-of-order core; an estimate of the second thread's performance on the out-of-order core.
    Type: Grant
    Filed: March 15, 2013
    Date of Patent: March 15, 2016
    Assignee: Intel Corporation
    Inventors: Aamer Jaleel, Kenzo Van Craeynest, Paolo Narvaez, Joel Emer
  • Publication number: 20140282565
    Abstract: A processor is described having an out-of-order core to execute a first thread and a non-out-of-order core to execute a second thread. The processor also includes statistics collection circuitry to support calculation of the following: the first thread's performance on the out-of-order core; an estimate of the first thread's performance on the non-out-of-order core; the second thread's performance on the non-out-of-order core; an estimate of the second thread's performance on the out-of-order core.
    Type: Application
    Filed: March 15, 2013
    Publication date: September 18, 2014
    Inventors: AAMER JALEEL, KENZO VAN CRAEYNEST, PAOLO NARVAEZ, JOEL EMER
  • Publication number: 20140201506
    Abstract: A processing engine includes separate hardware components for control processing and data processing. The instruction execution order in such a processing engine may be efficiently determined in a control processing engine based on inputs received by the control processing engine. For each instruction of a data processing engine: a status of the instruction may be set to “ready” based on a trigger for the instruction and the input received in the control processing engine; and execution of the instruction in the data processing engine may be enabled if the status of the instruction is set to “ready” and at least one processing element of the data processing engine is available. The trigger for each instruction may be a function of one or more predicate register of the control processing engine, FIFO status signals or information regarding tags.
    Type: Application
    Filed: December 30, 2011
    Publication date: July 17, 2014
    Inventors: Angshuman Parashar, Michael Pellauer, Michael Adler, Joel Emer
  • Patent number: 8769201
    Abstract: A technique to enable resource allocation optimization within a computer system. In one embodiment, a gradient partition algorithm (GPA) module is used to continually measure performance and adjust allocation to shared resources among a plurality of data classes in order to achieve optimal performance.
    Type: Grant
    Filed: December 2, 2008
    Date of Patent: July 1, 2014
    Assignee: Intel Corporation
    Inventors: William Hasenplaugh, Joel Emer, Tryggve Fossum, Aamer Jaleel, Simon Steely
  • Patent number: 8707012
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Grant
    Filed: October 12, 2012
    Date of Patent: April 22, 2014
    Assignee: Intel Corporation
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Publication number: 20130036268
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Application
    Filed: October 12, 2012
    Publication date: February 7, 2013
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Patent number: 8316216
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Grant
    Filed: October 21, 2009
    Date of Patent: November 20, 2012
    Assignee: Intel Corporation
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Publication number: 20100138609
    Abstract: A technique to enable resource allocation optimization within a computer system. In one embodiment, a gradient partition algorithm (GPA) module is used to continually measure performance and adjust allocation to shared resources among a plurality of data classes in order to achieve optimal performance.
    Type: Application
    Filed: December 2, 2008
    Publication date: June 3, 2010
    Inventors: William Hasenplaugh, Joel Emer, Tryggve Fossum, Aamer Jaleel, Simon Steely
  • Publication number: 20100042779
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Application
    Filed: October 21, 2009
    Publication date: February 18, 2010
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Patent number: 7627735
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Grant
    Filed: October 21, 2005
    Date of Patent: December 1, 2009
    Assignee: Intel Corporation
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Patent number: 7558920
    Abstract: A method and apparatus for partitioning a shared cache of a chip multi-processor are described. In one embodiment, the method includes a request of a cache block from system memory if a cache miss within a shared cache is detected according to a received request from a processor. Once the cache block is requested, a victim block within the shared cache is selected according to a processor identifier and a request type of the received request. In one embodiment, selection of the victim block according to a processor identifier and request type is based on a partition of a set-associative, shared cache to limit the selection of the victim block from a subset of available cache ways according to the cache partition. Other embodiments are described and claimed.
    Type: Grant
    Filed: June 30, 2004
    Date of Patent: July 7, 2009
    Assignee: Intel Corporation
    Inventors: Matthew Mattina, Antonio Juan-Hormigo, Joel Emer, Ramon Matas-Navarro
  • Publication number: 20070094477
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Application
    Filed: October 21, 2005
    Publication date: April 26, 2007
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Publication number: 20070022348
    Abstract: Embodiments of apparatuses and methods for reducing the uncorrectable error rate in a lockstepped dual-modular redundancy system are disclosed. In one embodiment, an apparatus includes two processor cores, a micro-checker, a global checker, and fault logic. The micro-checker is to detect whether a value from a structure in one core matches a value from the corresponding structure in the other core. The global checker is to detect lockstep failures between the two cores. The fault logic is to cause the two cores to be resynchronized if there is a lockstep error but the micro-checker has detected a mismatch.
    Type: Application
    Filed: June 30, 2005
    Publication date: January 25, 2007
    Inventors: Paul Racunas, Joel Emer, Arijit Biswas, Shubhendu Mukherjee, Steven Raasch