Patents by Inventor Sean Lie

Sean Lie has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

WAVELET REPRESENTATION FOR ACCELERATED DEEP LEARNING

Publication number: 20190258919

Abstract: Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element with dedicated storage and a routing element. Each router enables communication with nearest neighbors in a 2D mesh. The communication is via wavelets in accordance with a representation comprising an index specifier, a virtual channel specifier, a task specifier, a data element specifier, and an optional control/data specifier. The virtual channel specifier and the task specifier are associated with one or more instructions. The index specifier and the data element are optionally associated with operands of the one or more instructions.

Type: Application

Filed: April 15, 2018

Publication date: August 22, 2019

Applicant: Cerebras Systems Inc.

Inventors: Sean LIE, Gary R. LAUTERBACH, Michael Edwin JAMES, Michael MORRISON, Srikanth AREKAPUDI
ACCELERATED DEEP LEARNING

Publication number: 20180314941

Abstract: Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency, such as accuracy of learning, accuracy of prediction, speed of learning, performance of learning, and energy efficiency of learning. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a respective compute element and a respective routing element. Each compute element has processing resources and memory resources. Each router enables communication via wavelets with at least nearest neighbors in a 2D mesh. Stochastic gradient descent, mini-batch gradient descent, and continuous propagation gradient descent are techniques usable to train weights of a neural network modeled by the processing elements. Reverse checkpoint is usable to reduce memory usage during the training.

Type: Application

Filed: February 23, 2018

Publication date: November 1, 2018

Inventors: Sean LIE, Michael MORRISON, Michael Edwin JAMES, Gary R. LAUTERBACH, Srikanth AREKAPUDI
Thin provisioning architecture for high seek-time devices

Patent number: 9734081

Abstract: A compute server accomplishes physical address to virtual address translation to optimize physical storage capacity via thin provisioning techniques. The thin provisioning techniques can minimize disk seeks during command functions by utilizing a translation table and free list stored to both one or more physical storage devices as well as to a cache. The cached translation table and free list can be updated directly in response to disk write procedures. A read-only copy of the cached translation table and free list can be created and stored to physical storage device for use in building the cached translation table and free list upon a boot of the compute server. The copy may also be used to repair the cached translation table in the event of a power failure or other event affecting the cache.

Type: Grant

Filed: December 10, 2014

Date of Patent: August 15, 2017

Assignee: Advanced Micro Devices, Inc.

Inventor: Sean Lie
Distributed packet switching in a source routed cluster server

Patent number: 9331958

Abstract: A cluster compute server includes nodes coupled in a network topology via a fabric that source routes packets based on location identifiers assigned to the nodes, the location identifiers representing the locations in the network topology. Host interfaces at the nodes may be associated with link layer addresses that do not reflect the location identifier associated with the nodes. The nodes therefore implement locally cached link layer address translations that map link layer addresses to corresponding location identifiers in the network topology. In response to originating a packet directed to one of these host interfaces, the node accesses the local translation cache to obtain a link layer address translation for a destination link layer address of the packet. When a node experiences a cache miss, the node queries a management node to obtain the specified link layer address translation from a master translation table maintained by the management node.

Type: Grant

Filed: December 31, 2012

Date of Patent: May 3, 2016

Assignee: Advanced Micro Devices, Inc.

Inventors: Sean Lie, Vikrama Ditya, Gary R. Lauterbach
THIN PROVISIONING ARCHITECTURE FOR HIGH SEEK-TIME DEVICES

Publication number: 20150331807

Abstract: A compute server accomplishes physical address to virtual address translation to optimize physical storage capacity via thin provisioning techniques. The thin provisioning techniques can minimize disk seeks during command functions by utilizing a translation table and free list stored to both one or more physical storage devices as well as to a cache. The cached translation table and free list can be updated directly in response to disk write procedures. A read-only copy of the cached translation table and free list can be created and stored to physical storage device for use in building the cached translation table and free list upon a boot of the compute server. The copy may also be used to repair the cached translation table in the event of a power failure or other event affecting the cache.

Type: Application

Filed: December 10, 2014

Publication date: November 19, 2015

Inventor: Sean Lie
Hop-by-hop error detection in a server system

Patent number: 9176799

Abstract: A server system performs error detection on a hop-by-hop basis at multiple compute nodes, thereby facilitating the detection of a compute node experiencing failure. The server system communicates a packet from an originating node (the originating node) to a destination node by separating the packet into multiple flow control digits (flits) and routing the flits using a series of hops over a set of intermediate nodes. The packet's final flit includes error detection information, such as checksum data. As each intermediate node receives the final flit, it performs error detection using the error detection information. The pattern of nodes that detect an error indicates which intermediate node has experienced a failure.

Type: Grant

Filed: December 31, 2012

Date of Patent: November 3, 2015

Assignee: Advanced Micro Devices, Inc.

Inventors: Min Xu, Sean Lie, Gene Shen
PASS-THROUGH ROUTING AT INPUT/OUTPUT NODES OF A CLUSTER SERVER

Publication number: 20150036681

Abstract: Node locations in the topology of a cluster computer server are designated as input/output (I/O) nodes that provide input and output for the cluster computer server. Examples of I/O nodes include network nodes that provide an interface for the cluster computer server to an external network, and storage nodes that provide access to storage devices for the cluster compute server. The I/O nodes are configured to analyze received messages and identify whether the message is targeted to the receiving I/O node or to another node of the cluster compute server. Those messages targeted to the I/O node are provided to a processing module of the I/O node for processing.

Type: Application

Filed: May 7, 2014

Publication date: February 5, 2015

Applicant: Advanced Micro Devices, Inc.

Inventors: Sean Lie, Timothy Botsford, Min Xu
HOP-BY-HOP ERROR DETECTION IN A SERVER SYSTEM

Publication number: 20140189443

Abstract: A server system performs error detection on a hop-by-hop basis at multiple compute nodes, thereby facilitating the detection of a compute node experiencing failure. The server system communicates a packet from an originating node (the originating node) to a destination node by separating the packet into multiple flow control digits (flits) and routing the flits using a series of hops over a set of intermediate nodes. The packet's final flit includes error detection information, such as checksum data. As each intermediate node receives the final flit, it performs error detection using the error detection information. The pattern of nodes that detect an error indicates which intermediate node has experienced a failure.

Type: Application

Filed: December 31, 2012

Publication date: July 3, 2014

Applicant: Advanced Micro Devices, Inc.

Inventors: Min Xu, Sean Lie, Gene Shen
DISTRIBUTED PACKET SWITCHING IN A SOURCE ROUTED CLUSTER SERVER

Publication number: 20140185611

Abstract: A cluster compute server includes nodes coupled in a network topology via a fabric that source routes packets based on location identifiers assigned to the nodes, the location identifiers representing the locations in the network topology. Host interfaces at the nodes may be associated with link layer addresses that do not reflect the location identifier associated with the nodes. The nodes therefore implement locally cached link layer address translations that map link layer addresses to corresponding location identifiers in the network topology. In response to originating a packet directed to one of these host interfaces, the node accesses the local translation cache to obtain a link layer address translation for a destination link layer address of the packet. When a node experiences a cache miss, the node queries a management node to obtain the specified link layer address translation from a master translation table maintained by the management node.

Type: Application

Filed: December 31, 2012

Publication date: July 3, 2014

Applicant: Advanced Micro Devices, Inc.

Inventors: Sean Lie, Vikrama Ditya, Gary R. Lauterbach
RAW FABRIC INTERFACE FOR SERVER SYSTEM WITH VIRTUALIZED INTERFACES

Publication number: 20140188996

Abstract: A server system allows system's nodes to access a fabric interconnect of the server system directly, rather than via an interface that virtualizes the fabric interconnect as a network or storage interface. The server system also employs controllers to provide an interface to the fabric interconnect via a standard protocol, such as a network protocol or a storage protocol. The server system thus facilitates efficient and flexible transfer of data between the server system's nodes.

Type: Application

Filed: December 31, 2012

Publication date: July 3, 2014

Applicant: Advanced Micro Devices, Inc.

Inventors: Sean Lie, Gary Lauterbach
UNIFIED SCHEDULER FOR A PROCESSOR MULTI-PIPELINE EXECUTION UNIT AND METHODS

Publication number: 20120144173

Abstract: A unified scheduler for a processor execution unit and methods are disclosed for providing faster throughput of micro-instruction/operation execution with respect to a multi-pipeline processor execution unit. In one example, an execution unit has a plurality of pipelines that operate at a predetermined clock rate, each pipeline configured to process a selected subset of microinstructions. The execution unit has a scheduler that includes a unified queue configured to queue microinstructions for all of the pipelines and a picker configured to direct a queued microinstruction to an appropriate pipeline for processing based on an indication of readiness for picking. Preferably, when all of the pipelines are ready to receive a microinstruction for processing and there is at least one microinstruction queued that is ready for picking for each pipeline, the picker picks and directs a queued microinstructions to each of the pipelines in a single clock cycle.

Type: Application

Filed: December 1, 2010

Publication date: June 7, 2012

Applicant: ADVANCED MICRO DEVICES, INC.

Inventors: Mike Butler, Ganesh Venkataramanan, Sean Lie
LOAD BALANCING WHEN ASSIGNING OPERATIONS IN A PROCESSOR

Publication number: 20120110594

Abstract: A method and apparatus for assigning operations in a processor are provided. An incoming instruction is received. The incoming instruction is capable of being processed: only by a first processing unit (PU), only by a second PU or by either first and second PUs. The processing of first and second PUs is load balanced by assigning the received instructions capable of being processed by either the first and the second PUs based on a metric representing differential loads placed on the first and the second PUs.

Type: Application

Filed: October 28, 2010

Publication date: May 3, 2012

Applicant: ADVANCED MICRO DEVICES, INC.

Inventors: Emil Talpes, Ganesh Venkataramanan, Sean Lie
Processing pipeline having stage-specific thread selection and method thereof

Patent number: 8086825

Abstract: One or more processor cores of a multiple-core processing device each can utilize a processing pipeline having a plurality of execution units (e.g., integer execution units or floating point units) that together share a pre-execution front-end having instruction fetch, decode and dispatch resources. Further, one or more of the processor cores each can implement dispatch resources configured to dispatch multiple instructions in parallel to multiple corresponding execution units via separate dispatch buses. The dispatch resources further can opportunistically decode and dispatch instruction operations from multiple threads in parallel so as to increase the dispatch bandwidth. Moreover, some or all of the stages of the processing pipelines of one or more of the processor cores can be configured to implement independent thread selection for the corresponding stage.

Type: Grant

Filed: December 31, 2007

Date of Patent: December 27, 2011

Assignee: Advanced Micro Devices, Inc.

Inventors: Gene Shen, Sean Lie, Marius Evers
Immediate and displacement extraction and decode mechanism

Patent number: 7908463

Abstract: An extraction and decode mechanism for acquiring and processing instructions and the corresponding constant(s) embedded within the instructions. The extraction and decode mechanism may be included within a processing unit, and may comprise an instruction decode unit and at least one constant steer network. During operation, the instruction decode unit may obtain and decode instructions which are to be executed by the processing unit. For each instruction, the instruction decode unit may also determine the location of one or more constants embedded within the instruction. The constant steer network may receive the location information from the instruction decode unit. While the instruction decode unit decodes the instruction, the constant steer network may obtain the constant(s) embedded within the instruction based on the location information and store the constant(s). The constant(s) embedded within the instruction may be immediate or displacement (imm/disp) constant(s).

Type: Grant

Filed: June 26, 2007

Date of Patent: March 15, 2011

Assignee: GLOBALFOUNDRIES Inc.

Inventor: Sean Lie
Mechanism for using performance counters to identify reasons and delay times for instructions that are stalled during retirement

Patent number: 7895421

Abstract: A system and method of accounting for lost clock cycles in a microprocessor. A method includes detecting a first reason which prevents exit of an entry from an instruction retirement queue, and incrementing a first count corresponding to the first reason, wherein the first count is incremented while the first reason prevents exit of the entry from the queue. A first point in time is determined when said first reason no longer prevents exit of the entry from the queue. A second reason which prevents exit of the entry from the queue is detected, wherein the second reason came into existence prior to said first point in time. A second count corresponding to the second reason is incremented, wherein incrementing the second count begins at the first point in time.

Type: Grant

Filed: July 12, 2007

Date of Patent: February 22, 2011

Assignee: GLOBALFOUNDRIES Inc.

Inventors: Nhon Quach, Sean Lie
Method and apparatus for length decoding variable length instructions

Patent number: 7818542

Abstract: A mechanism for superscalar decode of variable length instructions. The decode mechanism may be included within a processing unit, and may comprise a length decode unit. The length decode unit may obtain a plurality of instruction bytes. The instruction bytes may be associated with a plurality of variable length instructions, which are to be executed by the processing unit. The length decode unit may perform a length decode operation for each of the plurality of instruction bytes. For each instruction byte, the length decode unit may estimate the instruction length of a current variable length instruction associated with a current instruction byte. Furthermore, during the length decode operation, for each instruction byte, the length decode unit may estimate the start of a next variable length instruction based on the estimated instruction length of the current variable length instruction, and store a first pointer to the estimated start of the next variable length instruction.

Type: Grant

Filed: July 10, 2007

Date of Patent: October 19, 2010

Assignee: GLOBALFOUNDRIES Inc.

Inventors: Gene W. Shen, Sean Lie
Method and apparatus for length decoding and identifying boundaries of variable length instructions

Patent number: 7818543

Abstract: A mechanism for superscalar decode of variable length instructions. A length decode unit may obtain a plurality of instruction bytes based on a scan window of a predetermined size. The instruction bytes may be associated with a plurality of variable length instructions, which are scheduled to be executed by a processing unit. The length decode unit may, for each instruction byte, estimate the start of a next variable length instruction following a current variable length instruction, and store a first pointer. A pre-pick unit may, for each instruction byte, use the first pointer to estimate the start of a subsequent variable length instruction following the next variable length instruction within the scan window, and store a second pointer. A pick unit may use a start pointer and related first and second pointers to determine the actual start of the variable length instructions within the scan window, and generate instruction pointers.

Type: Grant

Filed: July 10, 2007

Date of Patent: October 19, 2010

Assignee: GlobalFoundries Inc.

Inventors: Gene W. Shen, Sean Lie
Processing pipeline having parallel dispatch and method thereof

Patent number: 7793080

Abstract: One or more processor cores of a multiple-core processing device each can utilize a processing pipeline having a plurality of execution units (e.g., integer execution units or floating point units) that together share a pre-execution front-end having instruction fetch, decode and dispatch resources. Further, one or more of the processor cores each can implement dispatch resources configured to dispatch multiple instructions in parallel to multiple corresponding execution units via separate dispatch buses. The dispatch resources further can opportunistically decode and dispatch instruction operations from multiple threads in parallel so as to increase the dispatch bandwidth. Moreover, some or all of the stages of the processing pipelines of one or more of the processor cores can be configured to implement independent thread selection for the corresponding stage.

Type: Grant

Filed: December 31, 2007

Date of Patent: September 7, 2010

Inventors: Gene Shen, Sean Lie
Multiple-core processor with hierarchical microcode store

Patent number: 7743232

Abstract: A multiple-core processor having a hierarchical microcode store. A processor may include multiple processor cores, each configured to independently execute instructions defined according to a programmer-visible instruction set architecture (ISA). Each core may include a respective local microcode unit configured to store microcode entries. The processor may also include a remote microcode unit accessible by each of the processor cores. Any given one of the processor cores may be configured to generate a given microcode entrypoint corresponding to a particular microcode entry including one or more operations to be executed by the given processor core, and to determine whether the particular microcode entry is stored within the respective local microcode unit of the given core. In response to determining that the particular microcode entry is not stored within the respective local microcode unit, the given core may convey a request for the particular microcode entry to the remote microcode unit.

Type: Grant

Filed: July 18, 2007

Date of Patent: June 22, 2010

Assignee: Advanced Micro Devices, Inc.

Inventors: Gene W. Shen, Bruce R. Holloway, Sean Lie, Michael G. Butler
Distributed dispatch with concurrent, out-of-order dispatch

Patent number: 7725690

Abstract: In one embodiment, a processor comprises an instruction buffer and a pick unit. The instruction buffer is coupled to receive instructions fetched from an instruction cache. The pick unit is configured to select up to N instructions from the instruction buffer for concurrent transmission to respective slots of a plurality of slots, where N is an integer greater than one. Additionally, the pick unit is configured to transmit an oldest instruction of the selected instructions to any of the plurality of slots even if a number of the selected instructions is greater than one. The pick unit is configured to concurrently transmit other ones of the selected instructions to other slots of the plurality of slots based on the slot to which the oldest instruction is transmitted. Some embodiments comprise a computer system including the processor and a communication device configured to communicate with another computer system.

Type: Grant

Filed: February 13, 2007

Date of Patent: May 25, 2010

Assignee: Advanced Micro Devices, Inc.

Inventors: Gene W. Shen, Sean Lie

prev 1 2 3 4 5 next