Patents by Inventor Yun Du

Yun Du has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20080094412
    Abstract: A graphics processing unit (GPU) efficiently performs 3-dimensional (3-D) clipping using processing units used for other graphics functions. The GPU includes first and second hardware units and at least one buffer. The first hardware unit performs 3-D clipping of primitives using a first processing unit used for a first graphics function, e.g., an ALU used for triangle setup, depth gradient setup, etc. The first hardware unit may perform 3-D clipping by (a) computing clip codes for each vertex of each primitive, (b) determining whether to pass, discard or clip each primitive based on the clip codes for all vertices of the primitive, and (c) clipping each primitive to be clipped against clipping planes. The second hardware unit computes attribute component values for new vertices resulting from the 3-D clipping, e.g., using an ALU used for attribute gradient setup, attribute interpolation, etc. The buffer(s) store intermediate results of the 3-D clipping.
    Type: Application
    Filed: October 23, 2006
    Publication date: April 24, 2008
    Inventors: Guofang Jiao, Chun Yu, Lingjun Chen, Yun Du
  • Publication number: 20080074430
    Abstract: Techniques are described for processing computerized images with a graphics processing unit (GPU) using a unified vertex cache and shader register file. The techniques include creating a shared shader coupled to the GPU pipeline and a unified vertex cache and shader register file coupled to the shared shader to substantially eliminate data movement within the GPU pipeline. The GPU pipeline sends image geometry information based on an image geometry for an image to the shared shader. The shared shader performs vertex shading to generate vertex coordinates and attributes of vertices in the image. The shared shader then stores the vertex attributes in the unified vertex cache and shader register file, and sends only the vertex coordinates of the vertices back to the GPU pipeline. The GPU pipeline processes the image based on the vertex coordinates, and the shared shader processes the image based on the vertex attributes.
    Type: Application
    Filed: September 27, 2006
    Publication date: March 27, 2008
    Inventors: Guofang Jiao, Chun Yu, Yun Du
  • Publication number: 20080074433
    Abstract: A graphics processor capable of parallel scheduling and execution of multiple threads, and techniques for achieving parallel scheduling and execution, are described. The graphics processor may include multiple hardware units and a scheduler. The hardware units are operable in parallel, with each hardware unit supporting a respective set of operations. The hardware units may include an ALU core, an elementary function core, a logic core, a texture sampler, a load control unit, some other hardware unit, or a combination thereof. The scheduler dispatches instructions for multiple threads to the hardware units concurrently. The graphics processor may further include an instruction cache to store instructions for threads and register banks to store data. The instruction cache and register banks may be shared by the hardware units.
    Type: Application
    Filed: September 21, 2006
    Publication date: March 27, 2008
    Inventors: Guofang Jiao, Yun Du, Chun Yu
  • Publication number: 20080055326
    Abstract: Techniques to allow multiple graphics processing units to operate in parallel, even with limited storage space, are described. An apparatus includes first and second processing units and a memory. The first processing unit performs pre-processing on a batch of graphics application data for an image (e.g., for vertices in the image) and generates command sub-lists for the batch. The second processing unit performs post-processing on the command sub-lists (e.g., for pixels of the image) and generates output data for the image. The first and second processing units may operate in parallel on different command sub-lists. The memory stores the command sub-lists and may also store a header for each command sub-list, a look-up table of memory addresses for the command sub-lists, a write counter indicating the most recently generated command sub-list, and a read counter indicating the most recently post-processed command sub-list.
    Type: Application
    Filed: September 5, 2006
    Publication date: March 6, 2008
    Inventors: Yun Du, Chun Yu, Guofang Jiao, Lingjun Chen
  • Publication number: 20080059756
    Abstract: Techniques to efficiently handle relative addressing are described. In one design, a processor includes an address generator and a storage unit. The address generator receives a relative address comprised of a base address and an offset, obtains a base value for the base address, sums the base value with the offset, and provides an absolute address corresponding to the relative address. The storage unit receives the base address and provides the base value to the address generator. The storage unit also receives the absolute address and provides data at this address. The address generator may derive the absolute address in a first clock cycle of a memory access. The storage unit may provide the data in a second clock cycle of the memory access. The storage unit may have multiple (e.g., two) read ports to support concurrent address generation and data retrieval.
    Type: Application
    Filed: August 31, 2006
    Publication date: March 6, 2008
    Inventors: Yun Du, Chun Yu, Guofang Jiao
  • Publication number: 20080059966
    Abstract: A thread scheduler includes context units for managing the execution of threads where each context unit includes a load reference counter for maintaining a counter value indicative of a difference between a number of data requests and a number of data returns associated with the particular context unit. A context controller of the thread context unit is configured to refrain from forwarding an instruction of a thread when the counter value is nonzero and the instruction includes a data dependency indicator indicating the instruction requires data returned by a previous instruction.
    Type: Application
    Filed: August 29, 2006
    Publication date: March 6, 2008
    Inventors: Yun Du, Guofang Jiao, Chun Yu
  • Publication number: 20080046495
    Abstract: A multi-stage floating-point accumulator includes at least two stages and is capable of operating at higher speed. In one design, the floating-point accumulator includes first and second stages. The first stage includes three operand alignment units, two multiplexers, and three latches. The three operand alignment units operate on a current floating-point value, a prior floating-point value, and a prior accumulated value. A first multiplexer provides zero or the prior floating-point value to the second operand alignment unit. A second multiplexer provides zero or the prior accumulated value to the third operand alignment unit. The three latches couple to the three operand alignment units. The second stage includes a 3-operand adder to sum the operands generated by the three operand alignment units, a latch, and a post alignment unit.
    Type: Application
    Filed: August 18, 2006
    Publication date: February 21, 2008
    Inventors: Yun Du, Chun Yu, Guofang Jiao
  • Publication number: 20080030512
    Abstract: This disclosure describes a graphics processing unit (GPU) pipeline that uses one or more shared arithmetic logic units (ALUs). In order to facilitate such sharing of ALUs, the stages of the disclosed GPU pipeline may be rearranged relative to conventional GPU pipelines. In addition, by rearranging the stages of the GPU pipeline, efficiencies may be achieved in the image processing. Unlike conventional GPU pipelines, for example, an attribute gradient setup stage can be located much later in the pipeline, and the attribute interpolator stage may immediately follow the attribute gradient setup stage. This allows sharing of an ALU by the attribute gradient setup and attribute interpolator stages. Several other techniques and features for the GPU pipeline are also described, which may improve performance and possibly achieve additional processing efficiencies.
    Type: Application
    Filed: October 17, 2006
    Publication date: February 7, 2008
    Inventors: Guofang Jiao, Brian Ruttenberg, Chun Yu, Yun Du
  • Publication number: 20080030513
    Abstract: Techniques are described for processing computerized images with a graphics processing unit (GPU) using an extended vertex cache. The techniques include creating an extended vertex cache coupled to a GPU pipeline to reduce an amount of data passing through the GPU pipeline. The GPU pipeline receives an image geometry for an image, and stores attributes for vertices within the image geometry in the extended vertex cache. The GPU pipeline only passes vertex coordinates that identify the vertices and vertex cache index values that indicate storage locations of the attributes for each of the vertices in the extended vertex cache to other processing stages along the GPU pipeline. The techniques described herein defer the setup of attribute gradients to just before attribute interpolation in the GPU pipeline. The vertex attributes may be retrieved from the extended vertex cache for attribute gradient setup just before attribute interpolation in the GPU pipeline.
    Type: Application
    Filed: August 3, 2006
    Publication date: February 7, 2008
    Inventors: Guofang Jiao, Brian Evan Ruttenberg, Chun Yu, Yun Du
  • Publication number: 20080028152
    Abstract: Caching techniques for storing instructions, constant values, and other types of data for multiple software programs are described. A cache provides storage for multiple programs and is partitioned into multiple tiles. Each tile is assignable to one program. Each program may be assigned any number of tiles based on the program's cache usage, the available tiles, and/or other factors. A cache controller identifies the tiles assigned to the programs and generates cache addresses for accessing the cache. The cache may be partitioned into physical tiles. The cache controller may assign logical tiles to the programs and may map the logical tiles to the physical tiles within the cache. The use of logical and physical tiles may simplify assignment and management of the tiles.
    Type: Application
    Filed: July 25, 2006
    Publication date: January 31, 2008
    Inventors: Yun Du, Guofang Jiao, Chun Yu, De Dzwo Hsu
  • Publication number: 20070296729
    Abstract: A multi-threaded processor is provided, such as a shader processor, having an internal unified memory space that is shared by a plurality of threads and is dynamically assigned to threads as needed. A mapping table that maps virtual registers to available internal addresses in the unified memory space so that thread registers can be stored in contiguous or non-contiguous memory addresses. Dynamic sizing of the virtual registers allows flexible allocation of the unified memory space depending on the type and size of data in a thread register. Yet another feature provides an efficient method for storing graphics data in the unified memory space to improve fetch and store operations from the memory space. In particular, pixel data for four pixels in a thread are stored across four memory devices having independent input/output ports that permit the four pixels to be read in a single clock cycle for processing.
    Type: Application
    Filed: June 21, 2006
    Publication date: December 27, 2007
    Inventors: Yun Du, Guofang Jiao, Chun Yu, De Dzwo Hsu
  • Publication number: 20070292047
    Abstract: Techniques for performing convolution filtering using hardware normally available in a graphics processor are described. Convolution filtering of an arbitrary H×W grid of pixels is achieved by partitioning the grid into smaller sections, performing computation for each section, and combining the intermediate results for all sections to obtain a final result. In one design, a command to perform convolution filtering on a grid of pixels with a kernel of coefficients is received, e.g., from a graphics application. The grid is partitioned into multiple sections, where each section may be 2×2 or smaller. Multiple instructions are generated for the multiple sections, with each instruction performing convolution computation on at least one pixel in one section. Each instruction may include pixel position information and applicable kernel coefficients. Instructions to combine the intermediate results from the multiple instructions are also generated.
    Type: Application
    Filed: June 14, 2006
    Publication date: December 20, 2007
    Inventors: Guofang Jiao, Yun Du, Chun Yu, Lingjun Chen
  • Publication number: 20070282928
    Abstract: In general, the disclosure is directed to techniques for controlling stack overflow. The techniques described herein utilize a portion of a common cache or memory located outside of the processor core as a stack extension. A processor core monitors a stack within the processor core and transfers the content of the stack to the stack extension outside of the processor core when the processor core stack exceeds a maximum number of entries. When the processor core determines the stack within the processor core falls below a minimum number of entries the processor core transfers at least a portion of the content maintained in the stack extension into the stack within the processor core. The techniques prevent malfunction and crash of threads executing within the processor core by utilizing stack extensions outside of the processor core.
    Type: Application
    Filed: June 6, 2006
    Publication date: December 6, 2007
    Inventors: Guofang Jiao, Yun Du, Chun Yu
  • Publication number: 20070283356
    Abstract: A multi-threaded processor is provided that internally reorders output threads thereby avoiding the need for an external output reorder buffer. The multi-threaded processor writes its thread results back to an internal memory buffer to guarantee that thread results are outputted in the same order in which the threads are received. A thread scheduler within the multi-threaded processor manages thread ordering control to avoid the need for an external reorder buffer. A compiler for the multi-threaded processor converts instructions that would normally send processed results directly to an external reorder buffer so that the processed thread results are instead sent to the internal memory buffer of the multi-threaded processor.
    Type: Application
    Filed: May 31, 2006
    Publication date: December 6, 2007
    Inventors: Yun Du, Guofang Jiao, Chun Yu
  • Publication number: 20070273698
    Abstract: A graphics processor capable of efficiently performing arithmetic operations and computing elementary functions is described. The graphics processor has at least one arithmetic logic unit (ALU) that can perform arithmetic operations and at least one elementary function unit that can compute elementary functions. The ALU(s) and elementary function unit(s) may be arranged such that they can operate in parallel to improve throughput. The graphics processor may also include fewer elementary function units than ALUs, e.g., four ALUs and a single elementary function unit. The four ALUs may perform an arithmetic operation on (1) four components of an attribute for one pixel or (2) one component of an attribute for four pixels. The single elementary function unit may operate on one component of one pixel at a time. The use of a single elementary function unit may reduce cost while still providing good performance.
    Type: Application
    Filed: May 25, 2006
    Publication date: November 29, 2007
    Inventors: Yun Du, Guofang Jiao, Chun Yu, Alexei V. Bourd
  • Publication number: 20070268289
    Abstract: A graphics system includes a graphics processor comprising a plurality of units configured to process a graphics image and a depth engine configured to receive and process data selected from one of two units based on a selection value.
    Type: Application
    Filed: May 16, 2006
    Publication date: November 22, 2007
    Inventors: Chun Yu, Brian Ruttenberg, Guofang Jiao, Yun Du
  • Publication number: 20070252843
    Abstract: A graphics system includes a graphics processor and a cache memory system. The graphics processor includes processing units that perform various graphics operations to render graphics images. The cache memory system may include fully configurable caches, partially configurable caches, or a combination of configurable and dedicated caches. The cache memory system may further include a control unit, a crossbar, and an arbiter. The control unit may determine memory utilization by the processing units and assign the configurable caches to the processing units based on memory utilization. The configurable caches may be assigned to achieve good utilization of these caches and to avoid memory access bottleneck. The crossbar couples the processing units to their assigned caches. The arbiter facilitates data exchanges between the caches and a main memory.
    Type: Application
    Filed: April 26, 2006
    Publication date: November 1, 2007
    Inventors: Chun Yu, Guofang Jiao, Yun Du