Patents by Inventor Choong Ng

Choong Ng has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20240046088
    Abstract: A machine learning hardware accelerator architecture and associated techniques are disclosed. The architecture features multiple memory banks of very wide SRAM that may be concurrently accessed by a large number of parallel operational units. Each operational unit supports an instruction set specific to machine learning, including optimizations for performing tensor operations and convolutions. Optimized addressing, an optimized shift reader and variations on a multicast network that permutes and copies data and associates with an operational unit that support those operations are also disclosed.
    Type: Application
    Filed: October 16, 2023
    Publication date: February 8, 2024
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 11816572
    Abstract: A machine learning hardware accelerator architecture and associated techniques are disclosed. The architecture features multiple memory banks of very wide SRAM that may be concurrently accessed by a large number of parallel operational units. Each operational unit supports an instruction set specific to machine learning, including optimizations for performing tensor operations and convolutions. Optimized addressing, an optimized shift reader and variations on a multicast network that permutes and copies data and associates with an operational unit that support those operations are also disclosed.
    Type: Grant
    Filed: October 14, 2021
    Date of Patent: November 14, 2023
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 11790267
    Abstract: An architecture and associated techniques of an apparatus for hardware accelerated machine learning are disclosed. The architecture features multiple memory banks storing tensor data. The tensor data may be concurrently fetched by a number of execution units working in parallel. Each operational unit supports an instruction set specific to certain primitive operations for machine learning. An instruction decoder is employed to decode a machine learning instruction and reveal one or more of the primitive operations to be performed by the execution units, as well as the memory addresses of the operands of the primitive operations as stored in the memory banks. The primitive operations, upon performed or executed by the execution units, may generate some output that can be saved into the memory banks. The fetching of the operands and the saving of the output may involve permutation and duplication of the data elements involved.
    Type: Grant
    Filed: October 14, 2020
    Date of Patent: October 17, 2023
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 11704548
    Abstract: In one embodiment, a system to deterministically transfer partitions of contiguous computer readable data in constant time includes a computer readable memory and a modulo address generator. The computer readable memory is organized into D banks, to contain contiguous data including a plurality of data elements of size M which are constituent data elements of a vector with N data elements, the data elements to start at an offset address O. The modulo address generator is to generate the addresses of the data elements of a vector with i data elements stored in the computer readable memory, the modulo address generator including at least one forward permutaton to permute data elements with addresses of the form O+M*i where 0<=i<N. Other embodiments are described and claimed.
    Type: Grant
    Filed: August 10, 2021
    Date of Patent: July 18, 2023
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20220067522
    Abstract: A machine learning hardware accelerator architecture and associated techniques are disclosed. The architecture features multiple memory banks of very wide SRAM that may be concurrently accessed by a large number of parallel operational units. Each operational unit supports an instruction set specific to machine learning, including optimizations for performing tensor operations and convolutions. Optimized addressing, an optimized shift reader and variations on a multicast network that permutes and copies data and associates with an operational unit that support those operations are also disclosed.
    Type: Application
    Filed: October 14, 2021
    Publication date: March 3, 2022
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20210374512
    Abstract: In one embodiment, a system to deterministically transfer partitions of contiguous computer readable data in constant time includes a computer readable memory and a modulo address generator. The computer readable memory is organized into D banks, to contain contiguous data including a plurality of data elements of size M which are constituent data elements of a vector with N data elements, the data elements to start at an offset address O. The modulo address generator is to generate the addresses of the data elements of a vector with i data elements stored in the computer readable memory, the modulo address generator including at least one forward permutaton to permute data elements with addresses of the form O+M*i where 0<=i<N.
    Type: Application
    Filed: August 10, 2021
    Publication date: December 2, 2021
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 11170294
    Abstract: A machine learning hardware accelerator architecture and associated techniques are disclosed. The architecture features multiple memory banks of very wide SRAM that may be concurrently accessed by a large number of parallel operational units. Each operational unit supports an instruction set specific to machine learning, including optimizations for performing tensor operations and convolutions. Optimized addressing, an optimized shift reader and variations on a multicast network that permutes and copies data and associates with an operational unit that support those operations are also disclosed.
    Type: Grant
    Filed: January 5, 2017
    Date of Patent: November 9, 2021
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 11120329
    Abstract: Neural network specific hardware acceleration optimizations are disclosed, including an optimized multicast network and an optimized DRAM transfer unit to perform in constant or linear time. The multicast network is a set of switch nodes organized into layers and configured to operate as a Beneš network. Configuration data may be accessed by all switch nodes in the network. Each layer is configured to perform a Beneš network transformation of the -previous layer within a computer instruction. Since the computer instructions are pipelined, the entire network of switch nodes may be configured in constant or linear time. Similarly a DRAM transfer unit configured to access memory in strides organizes memory into banks indexed by prime or relatively prime number amounts. The index value is selected as not to cause memory address collisions. Upon receiving a memory specification, the DRAM transfer unit may calculate out strides thereby accessing an entire tile of a tensor in constant or linear time.
    Type: Grant
    Filed: May 5, 2017
    Date of Patent: September 14, 2021
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20210049508
    Abstract: An architecture and associated techniques of an apparatus for hardware accelerated machine learning are disclosed. The architecture features multiple memory banks storing tensor data. The tensor data may be concurrently fetched by a number of execution units working in parallel. Each operational unit supports an instruction set specific to certain primitive operations for machine learning. An instruction decoder is employed to decode a machine learning instruction and reveal one or more of the primitive operations to be performed by the execution units, as well as the memory addresses of the operands of the primitive operations as stored in the memory banks. The primitive operations, upon performed or executed by the execution units, may generate some output that can be saved into the memory banks. The fetching of the operands and the saving of the output may involve permutation and duplication of the data elements involved.
    Type: Application
    Filed: October 14, 2020
    Publication date: February 18, 2021
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 10817802
    Abstract: An architecture and associated techniques of an apparatus for hardware accelerated machine learning are disclosed. The architecture features multiple memory banks storing tensor data. The tensor data may be concurrently fetched by a number of execution units working in parallel. Each operational unit supports an instruction set specific to certain primitive operations for machine learning. An instruction decoder is employed to decode a machine learning instruction and reveal one or more of the primitive operations to be performed by the execution units, as well as the memory addresses of the operands of the primitive operations as stored in the memory banks. The primitive operations, upon performed or executed by the execution units, may generate some output that can be saved into the memory banks. The fetching of the operands and the saving of the output may involve permutation and duplication of the data elements involved.
    Type: Grant
    Filed: May 5, 2017
    Date of Patent: October 27, 2020
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 10592213
    Abstract: Techniques to preprocess tensor operations prior to code generation to optimize compilation are disclosed. A computer readable representation of a linear algebra or tensor operation is received. A code transformation software component performs transformations include output reduction and fraction removal. The result is a set of linear equations of a single variable with integer coefficients. Such a set lends itself to more efficient code generation during compilation by a code generation software component. Use cases disclosed include targeting a machine learning hardware accelerator, receiving code in the form of an intermediate language generated by a cross-compiler with multiple front ends supporting multiple programming languages, and cloud deployment and execution scenarios.
    Type: Grant
    Filed: October 18, 2017
    Date of Patent: March 17, 2020
    Assignee: Intel Corporation
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20190200369
    Abstract: A facility for performing employing multiple frequencies in a secure distributed hierarchical convergence network is described. The facility receives a signal in a first frequency, converts the received signal to an internal representation, applies a business rule to the converted signal, and, when the business rule indicates that the signal should be transmitted in a second frequency, causes the internal representation of the signal to be translated to a second frequency and transmitted in the second frequency.
    Type: Application
    Filed: October 8, 2018
    Publication date: June 27, 2019
    Inventors: Mark L. Tucker, Jeremy Bruestle, Riley Eller, Brian Retford, Choong Ng
  • Patent number: 10098132
    Abstract: A facility for performing employing multiple frequencies in a secure distributed hierarchical convergence network is described. The facility receives a signal in a first frequency, converts the received signal to an internal representation, applies a business rule to the converted signal, and, when the business rule indicates that the signal should be transmitted in a second frequency, causes the internal representation of the signal to be translated to a second frequency and transmitted in the second frequency.
    Type: Grant
    Filed: October 19, 2015
    Date of Patent: October 9, 2018
    Assignee: COCO COMMUNICATIONS CORP
    Inventors: Mark L Tucker, Jeremy Bruestle, Riley Eller, Brian Retford, Choong Ng
  • Publication number: 20180107456
    Abstract: Techniques to preprocess tensor operations prior to code generation to optimize compilation are disclosed. A computer readable representation of a linear algebra or tensor operation is received. A code transformation software component performs transformations include output reduction and fraction removal. The result is a set of linear equations of a single variable with integer coefficients. Such a set lends itself to more efficient code generation during compilation by a code generation software component. Use cases disclosed include targeting a machine learning hardware accelerator, receiving code in the form of an intermediate language generated by a cross-compiler with multiple front ends supporting multiple programming languages, and cloud deployment and execution scenarios.
    Type: Application
    Filed: October 18, 2017
    Publication date: April 19, 2018
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20170337468
    Abstract: Neural network specific hardware acceleration optimizations are disclosed, including an optimized multicast network and an optimized DRAM transfer unit to perform in constant or linear time. The multicast network is a set of switch nodes organized into layers and configured to operate as a Bene{hacek over (s)} network. Configuration data may be accessed by all switch nodes in the network. Each layer is configured to perform a Bene{hacek over (s)} network transformation of the -previous layer within a computer instruction. Since the computer instructions are pipelined, the entire network of switch nodes may be configured in constant or linear time. Similarly a DRAM transfer unit configured to access memory in strides organizes memory into banks indexed by prime or relatively prime number amounts. The index value is selected as not to cause memory address collisions.
    Type: Application
    Filed: May 5, 2017
    Publication date: November 23, 2017
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20170323224
    Abstract: An architecture and associated techniques of an apparatus for hardware accelerated machine learning are disclosed. The architecture features multiple memory banks storing tensor data. The tensor data may be concurrently fetched by a number of execution units working in parallel. Each operational unit supports an instruction set specific to certain primitive operations for machine learning. An instruction decoder is employed to decode a machine learning instruction and reveal one or more of the primitive operations to be performed by the execution units, as well as the memory addresses of the operands of the primitive operations as stored in the memory banks. The primitive operations, upon performed or executed by the execution units, may generate some output that can be saved into the memory banks. The fetching of the operands and the saving of the output may involve permutation and duplication of the data elements involved.
    Type: Application
    Filed: May 5, 2017
    Publication date: November 9, 2017
    Inventors: Jeremy Bruestle, Choong Ng
  • Publication number: 20170200094
    Abstract: A machine learning hardware accelerator architecture and associated techniques are disclosed. The architecture features multiple memory banks of very wide SRAM that may be concurrently accessed by a large number of parallel operational units. Each operational unit supports an instruction set specific to machine learning, including optimizations for performing tensor operations and convolutions. Optimized addressing, an optimized shift reader and variations on a multicast network that permutes and copies data and associates with an operational unit that support those operations are also disclosed.
    Type: Application
    Filed: January 5, 2017
    Publication date: July 13, 2017
    Inventors: Jeremy Bruestle, Choong Ng
  • Patent number: 9374277
    Abstract: A facility for publishing information in a distributed network without a central management infrastructure is described. In various embodiments, the facility receives an indication of a new node and a destination node, the new node omitted from a contact list associated with the destination node, the contact list having an approximately logarithmic distribution of neighboring nodes; introduces the new node to the destination node via a permanent circuit; and causes the destination node to add the new node to the contact list when adding the new node improves the logarithmic distribution of neighboring nodes.
    Type: Grant
    Filed: March 30, 2015
    Date of Patent: June 21, 2016
    Assignee: CoCo Communications Corp.
    Inventors: Mark L. Tucker, Jeremy Bruestle, Riley Eller, Brian Retford, Choong Ng
  • Publication number: 20160143075
    Abstract: A facility for performing employing multiple frequencies in a secure distributed hierarchical convergence network is described. The facility receives a signal in a first frequency, converts the received signal to an internal representation, applies a business rule to the converted signal, and, when the business rule indicates that the signal should be transmitted in a second frequency, causes the internal representation of the signal to be translated to a second frequency and transmitted in the second frequency.
    Type: Application
    Filed: October 19, 2015
    Publication date: May 19, 2016
    Inventors: Mark L. Tucker, Jeremy Bruestle, Riley Eller, Brian Retford, Choong Ng
  • Patent number: D905333
    Type: Grant
    Filed: November 7, 2019
    Date of Patent: December 15, 2020
    Assignee: A&S DISTRIBUTION SDN. BHD.
    Inventor: Yew Choong Ng