Patents by Inventor Yun Du

Yun Du has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

PROGRAMMABLE BLENDING IN A GRAPHICS PROCESSING UNIT

Publication number: 20080094410

Abstract: Techniques for implementing blending equations for various blending modes with a base set of operations are described. Each blending equation may be decomposed into a sequence of operations. In one design, a device includes a processing unit that implements a set of operations for multiple blending modes and a storage unit that stores operands and results. The processing unit receives a sequence of instructions for a sequence of operations for a blending mode selected from the plurality of blending modes and executes each instruction in the sequence to perform blending in accordance with the selected blending mode. The processing unit may include (a) an ALU that performs at least one operation in the base set, e.g., a dot product, (b) a pre-formatting unit that performs gamma correction and alpha scaling of inbound color values, and (c) a post-formatting unit that performs gamma compression and alpha scaling of outbound color values.

Type: Application

Filed: October 19, 2006

Publication date: April 24, 2008

Inventors: Guofang Jiao, Chun Yu, Lingjun Chen, Yun Du
3-D CLIPPING IN A GRAPHICS PROCESSING UNIT

Publication number: 20080094412

Abstract: A graphics processing unit (GPU) efficiently performs 3-dimensional (3-D) clipping using processing units used for other graphics functions. The GPU includes first and second hardware units and at least one buffer. The first hardware unit performs 3-D clipping of primitives using a first processing unit used for a first graphics function, e.g., an ALU used for triangle setup, depth gradient setup, etc. The first hardware unit may perform 3-D clipping by (a) computing clip codes for each vertex of each primitive, (b) determining whether to pass, discard or clip each primitive based on the clip codes for all vertices of the primitive, and (c) clipping each primitive to be clipped against clipping planes. The second hardware unit computes attribute component values for new vertices resulting from the 3-D clipping, e.g., using an ALU used for attribute gradient setup, attribute interpolation, etc. The buffer(s) store intermediate results of the 3-D clipping.

Type: Application

Filed: October 23, 2006

Publication date: April 24, 2008

Inventors: Guofang Jiao, Chun Yu, Lingjun Chen, Yun Du
GRAPHICS PROCESSING UNIT WITH UNIFIED VERTEX CACHE AND SHADER REGISTER FILE

Publication number: 20080074430

Abstract: Techniques are described for processing computerized images with a graphics processing unit (GPU) using a unified vertex cache and shader register file. The techniques include creating a shared shader coupled to the GPU pipeline and a unified vertex cache and shader register file coupled to the shared shader to substantially eliminate data movement within the GPU pipeline. The GPU pipeline sends image geometry information based on an image geometry for an image to the shared shader. The shared shader performs vertex shading to generate vertex coordinates and attributes of vertices in the image. The shared shader then stores the vertex attributes in the unified vertex cache and shader register file, and sends only the vertex coordinates of the vertices back to the GPU pipeline. The GPU pipeline processes the image based on the vertex coordinates, and the shared shader processes the image based on the vertex attributes.

Type: Application

Filed: September 27, 2006

Publication date: March 27, 2008

Inventors: Guofang Jiao, Chun Yu, Yun Du
Graphics Processors With Parallel Scheduling and Execution of Threads

Publication number: 20080074433

Abstract: A graphics processor capable of parallel scheduling and execution of multiple threads, and techniques for achieving parallel scheduling and execution, are described. The graphics processor may include multiple hardware units and a scheduler. The hardware units are operable in parallel, with each hardware unit supporting a respective set of operations. The hardware units may include an ALU core, an elementary function core, a logic core, a texture sampler, a load control unit, some other hardware unit, or a combination thereof. The scheduler dispatches instructions for multiple threads to the hardware units concurrently. The graphics processor may further include an instruction cache to store instructions for threads and register banks to store data. The instruction cache and register banks may be shared by the hardware units.

Type: Application

Filed: September 21, 2006

Publication date: March 27, 2008

Inventors: Guofang Jiao, Yun Du, Chun Yu
DEPENDENT INSTRUCTION THREAD SCHEDULING

Publication number: 20080059966

Abstract: A thread scheduler includes context units for managing the execution of threads where each context unit includes a load reference counter for maintaining a counter value indicative of a difference between a number of data requests and a number of data returns associated with the particular context unit. A context controller of the thread context unit is configured to refrain from forwarding an instruction of a thread when the counter value is nonzero and the instruction includes a data dependency indicator indicating the instruction requires data returned by a previous instruction.

Type: Application

Filed: August 29, 2006

Publication date: March 6, 2008

Inventors: Yun Du, Guofang Jiao, Chun Yu
RELATIVE ADDRESS GENERATION

Publication number: 20080059756

Abstract: Techniques to efficiently handle relative addressing are described. In one design, a processor includes an address generator and a storage unit. The address generator receives a relative address comprised of a base address and an offset, obtains a base value for the base address, sums the base value with the offset, and provides an absolute address corresponding to the relative address. The storage unit receives the base address and provides the base value to the address generator. The storage unit also receives the absolute address and provides data at this address. The address generator may derive the absolute address in a first clock cycle of a memory access. The storage unit may provide the data in a second clock cycle of the memory access. The storage unit may have multiple (e.g., two) read ports to support concurrent address generation and data retrieval.

Type: Application

Filed: August 31, 2006

Publication date: March 6, 2008

Inventors: Yun Du, Chun Yu, Guofang Jiao
Processing of Command Sub-Lists by Multiple Graphics Processing Units

Publication number: 20080055326

Abstract: Techniques to allow multiple graphics processing units to operate in parallel, even with limited storage space, are described. An apparatus includes first and second processing units and a memory. The first processing unit performs pre-processing on a batch of graphics application data for an image (e.g., for vertices in the image) and generates command sub-lists for the batch. The second processing unit performs post-processing on the command sub-lists (e.g., for pixels of the image) and generates output data for the image. The first and second processing units may operate in parallel on different command sub-lists. The memory stores the command sub-lists and may also store a header for each command sub-list, a look-up table of memory addresses for the command sub-lists, a write counter indicating the most recently generated command sub-list, and a read counter indicating the most recently post-processed command sub-list.

Type: Application

Filed: September 5, 2006

Publication date: March 6, 2008

Inventors: Yun Du, Chun Yu, Guofang Jiao, Lingjun Chen
Multi-stage floating-point accumulator

Publication number: 20080046495

Abstract: A multi-stage floating-point accumulator includes at least two stages and is capable of operating at higher speed. In one design, the floating-point accumulator includes first and second stages. The first stage includes three operand alignment units, two multiplexers, and three latches. The three operand alignment units operate on a current floating-point value, a prior floating-point value, and a prior accumulated value. A first multiplexer provides zero or the prior floating-point value to the second operand alignment unit. A second multiplexer provides zero or the prior accumulated value to the third operand alignment unit. The three latches couple to the three operand alignment units. The second stage includes a 3-operand adder to sum the operands generated by the three operand alignment units, a latch, and a post alignment unit.

Type: Application

Filed: August 18, 2006

Publication date: February 21, 2008

Inventors: Yun Du, Chun Yu, Guofang Jiao
Graphics processing unit with extended vertex cache

Publication number: 20080030513

Abstract: Techniques are described for processing computerized images with a graphics processing unit (GPU) using an extended vertex cache. The techniques include creating an extended vertex cache coupled to a GPU pipeline to reduce an amount of data passing through the GPU pipeline. The GPU pipeline receives an image geometry for an image, and stores attributes for vertices within the image geometry in the extended vertex cache. The GPU pipeline only passes vertex coordinates that identify the vertices and vertex cache index values that indicate storage locations of the attributes for each of the vertices in the extended vertex cache to other processing stages along the GPU pipeline. The techniques described herein defer the setup of attribute gradients to just before attribute interpolation in the GPU pipeline. The vertex attributes may be retrieved from the extended vertex cache for attribute gradient setup just before attribute interpolation in the GPU pipeline.

Type: Application

Filed: August 3, 2006

Publication date: February 7, 2008

Inventors: Guofang Jiao, Brian Evan Ruttenberg, Chun Yu, Yun Du
GRAPHICS PROCESSING UNIT WITH SHARED ARITHMETIC LOGIC UNIT

Publication number: 20080030512

Abstract: This disclosure describes a graphics processing unit (GPU) pipeline that uses one or more shared arithmetic logic units (ALUs). In order to facilitate such sharing of ALUs, the stages of the disclosed GPU pipeline may be rearranged relative to conventional GPU pipelines. In addition, by rearranging the stages of the GPU pipeline, efficiencies may be achieved in the image processing. Unlike conventional GPU pipelines, for example, an attribute gradient setup stage can be located much later in the pipeline, and the attribute interpolator stage may immediately follow the attribute gradient setup stage. This allows sharing of an ALU by the attribute gradient setup and attribute interpolator stages. Several other techniques and features for the GPU pipeline are also described, which may improve performance and possibly achieve additional processing efficiencies.

Type: Application

Filed: October 17, 2006

Publication date: February 7, 2008

Inventors: Guofang Jiao, Brian Ruttenberg, Chun Yu, Yun Du
Tiled cache for multiple software programs

Publication number: 20080028152

Abstract: Caching techniques for storing instructions, constant values, and other types of data for multiple software programs are described. A cache provides storage for multiple programs and is partitioned into multiple tiles. Each tile is assignable to one program. Each program may be assigned any number of tiles based on the program's cache usage, the available tiles, and/or other factors. A cache controller identifies the tiles assigned to the programs and generates cache addresses for accessing the cache. The cache may be partitioned into physical tiles. The cache controller may assign logical tiles to the programs and may map the logical tiles to the physical tiles within the cache. The use of logical and physical tiles may simplify assignment and management of the tiles.

Type: Application

Filed: July 25, 2006

Publication date: January 31, 2008

Inventors: Yun Du, Guofang Jiao, Chun Yu, De Dzwo Hsu
Unified virtual addressed register file

Publication number: 20070296729

Abstract: A multi-threaded processor is provided, such as a shader processor, having an internal unified memory space that is shared by a plurality of threads and is dynamically assigned to threads as needed. A mapping table that maps virtual registers to available internal addresses in the unified memory space so that thread registers can be stored in contiguous or non-contiguous memory addresses. Dynamic sizing of the virtual registers allows flexible allocation of the unified memory space depending on the type and size of data in a thread register. Yet another feature provides an efficient method for storing graphics data in the unified memory space to improve fetch and store operations from the memory space. In particular, pixel data for four pixels in a thread are stored across four memory devices having independent input/output ports that permit the four pixels to be read in a single clock cycle for processing.

Type: Application

Filed: June 21, 2006

Publication date: December 27, 2007

Inventors: Yun Du, Guofang Jiao, Chun Yu, De Dzwo Hsu
Convolution filtering in a graphics processor

Publication number: 20070292047

Abstract: Techniques for performing convolution filtering using hardware normally available in a graphics processor are described. Convolution filtering of an arbitrary H×W grid of pixels is achieved by partitioning the grid into smaller sections, performing computation for each section, and combining the intermediate results for all sections to obtain a final result. In one design, a command to perform convolution filtering on a grid of pixels with a kernel of coefficients is received, e.g., from a graphics application. The grid is partitioned into multiple sections, where each section may be 2×2 or smaller. Multiple instructions are generated for the multiple sections, with each instruction performing convolution computation on at least one pixel in one section. Each instruction may include pixel position information and applicable kernel coefficients. Instructions to combine the intermediate results from the multiple instructions are also generated.

Type: Application

Filed: June 14, 2006

Publication date: December 20, 2007

Inventors: Guofang Jiao, Yun Du, Chun Yu, Lingjun Chen
Processor core stack extension

Publication number: 20070282928

Abstract: In general, the disclosure is directed to techniques for controlling stack overflow. The techniques described herein utilize a portion of a common cache or memory located outside of the processor core as a stack extension. A processor core monitors a stack within the processor core and transfers the content of the stack to the stack extension outside of the processor core when the processor core stack exceeds a maximum number of entries. When the processor core determines the stack within the processor core falls below a minimum number of entries the processor core transfers at least a portion of the content maintained in the stack extension into the stack within the processor core. The techniques prevent malfunction and crash of threads executing within the processor core by utilizing stack extensions outside of the processor core.

Type: Application

Filed: June 6, 2006

Publication date: December 6, 2007

Inventors: Guofang Jiao, Yun Du, Chun Yu
Multi-threaded processor with deferred thread output control

Publication number: 20070283356

Abstract: A multi-threaded processor is provided that internally reorders output threads thereby avoiding the need for an external output reorder buffer. The multi-threaded processor writes its thread results back to an internal memory buffer to guarantee that thread results are outputted in the same order in which the threads are received. A thread scheduler within the multi-threaded processor manages thread ordering control to avoid the need for an external reorder buffer. A compiler for the multi-threaded processor converts instructions that would normally send processed results directly to an external reorder buffer so that the processed thread results are instead sent to the internal memory buffer of the multi-threaded processor.

Type: Application

Filed: May 31, 2006

Publication date: December 6, 2007

Inventors: Yun Du, Guofang Jiao, Chun Yu
Graphics processor with arithmetic and elementary function units

Publication number: 20070273698

Abstract: A graphics processor capable of efficiently performing arithmetic operations and computing elementary functions is described. The graphics processor has at least one arithmetic logic unit (ALU) that can perform arithmetic operations and at least one elementary function unit that can compute elementary functions. The ALU(s) and elementary function unit(s) may be arranged such that they can operate in parallel to improve throughput. The graphics processor may also include fewer elementary function units than ALUs, e.g., four ALUs and a single elementary function unit. The four ALUs may perform an arithmetic operation on (1) four components of an attribute for one pixel or (2) one component of an attribute for four pixels. The single elementary function unit may operate on one component of one pixel at a time. The use of a single elementary function unit may reduce cost while still providing good performance.

Type: Application

Filed: May 25, 2006

Publication date: November 29, 2007

Inventors: Yun Du, Guofang Jiao, Chun Yu, Alexei V. Bourd
Graphics system with dynamic reposition of depth engine

Publication number: 20070268289

Abstract: A graphics system includes a graphics processor comprising a plurality of units configured to process a graphics image and a depth engine configured to receive and process data selected from one of two units based on a selection value.

Type: Application

Filed: May 16, 2006

Publication date: November 22, 2007

Inventors: Chun Yu, Brian Ruttenberg, Guofang Jiao, Yun Du
Graphics system with configurable caches

Publication number: 20070252843

Abstract: A graphics system includes a graphics processor and a cache memory system. The graphics processor includes processing units that perform various graphics operations to render graphics images. The cache memory system may include fully configurable caches, partially configurable caches, or a combination of configurable and dedicated caches. The cache memory system may further include a control unit, a crossbar, and an arbiter. The control unit may determine memory utilization by the processing units and assign the configurable caches to the processing units based on memory utilization. The configurable caches may be assigned to achieve good utilization of these caches and to avoid memory access bottleneck. The crossbar couples the processing units to their assigned caches. The arbiter facilitates data exchanges between the caches and a main memory.

Type: Application

Filed: April 26, 2006

Publication date: November 1, 2007

Inventors: Chun Yu, Guofang Jiao, Yun Du

prev … 3 4 5 6 7