Bit plane memory system

Info

Patent number: 9858902
Type: Grant
Filed: Mar 12, 2015
Date of Patent: Jan 2, 2018
Patent Publication Number: 20150262557
Assignee: BRASS ROOTS TECHNOLOGIES, LLC (Plano, TX)
Inventors: Matthew John Fritz (Salida, CO), Bradley William Walker (Dallas, TX)
Primary Examiner: Ke Xiao
Assistant Examiner: Kim-Thanh T Tran
Application Number: 14/656,143

Abstract

A data processing system comprising a transpose system configured to receive video ordered pixel data and to generate bit plane blocks of data. A write buffer configured to receive the bit plane blocks of data and to generate bit plane data frames. A memory controller configured to receive the bit plane data frames and to write a first bit plane data frame to a memory while simultaneously reading a second bit plane data frame from memory. A read buffer configured to receive the second bit plane data frame and to convert the second bit plane data frame to a digital display device format.

Description

Description

RELATED APPLICATIONS

The present application claims priority to U.S. Provisional Patent Application No. 61/951,942, filed Mar. 12, 2014, which is hereby incorporated by reference for all purposes as if set forth herein in its entirety.

TECHNICAL FIELD

The present disclosure pertains generally to image display systems using display devices, including spatial light modulators and emissive displays, and more particularly, to formatting, storing, and delivering data to the display or modulator.

BACKGROUND OF THE INVENTION

Display devices convert electrical signals into light levels that make up a displayed image. Digital display devices are a subset of display devices, and are capable of displaying a finite number of discrete light levels, or gray shades, at any instance in time. Binary (two state) displays are a subset of digital display devices that can display only two light levels at any instance in time, the two light levels being fully on or fully off.

Video display devices can show a sequence of images, providing the appearance of moving pictures. Video display devices can be analog, digital, or binary.

Examples of digital display devices include: the Digital Micromirror Device (DMD) from Texas Instruments (Dallas, Tex.), the Digital Liquid Crystal on Silicon (D-LCOS) device, the VueG8 technology from Syndiant (Dallas, Tex.), and the Plasma Display Panel (PDP), and light emitting diode (LED) displays. Some analog imaging devices can also be operated as a digital display, including the D-ILA device from JVC-Kenwood (Kanagawa, Japan).

A video source provides video signals to a display. The video is comprised of a sequence of images, or frames. Video signals can be analog or digital and can be converted between analog and digital forms.

An image or video frame is composed of rows and columns of pixels. The rows are also referred to as lines. Each pixel of a digital video frame has associated data that represents the light intensity and, in multicolor displays, the color of the pixel. The data is comprised of one or more binary bits (zeroes or ones). The value each bit represents may be a binary weighting (powers of 2), or some other, possibly arbitrary, weighting.

The typical structure of digital video feeding a display is a stream of digital values representing the light level to be displayed for each video pixel. Typically, the order of the pixels in the stream is from left to right for an entire line (i.e. row) and then moving down one line and repeating for every line in the display. This ordering of pixels is referred to herein as “video order.”

Each image, or video frame, is displayed for an amount of time called the frame time. The frame time can be subdivided into time slots, known as bit segments. A digital display shows each bit segment for an amount of time that is proportional to the desired weight of the bit segment. The bit segments can be all the same weight (i.e. length in time), or they can vary by segment. If the illumination is variable, this will also affect the weight of the bit segments. Some digital displays (e.g. DMD) can produce shorter bit segments if one or more adjacent bit segments are lengthened. Short bit segments are desired for high effective bit depth, but require more data bandwidth and device speed.

SUMMARY OF THE INVENTION

One aspect of the present disclosure is a bit plane memory system using SDRAM while achieving high efficiency. The bit plane memory system converts video ordered image data to bit plane ordered data, stores it to SDRAM, and then reads it out as bit planes to supply data to a digital imaging device. The system comprises several processing blocks, including transpose, write buffer, memory controller, SDRAM, and read buffer. The transpose, write buffer, and memory controller each perform part of the function of transforming video-ordered pixel data into bit plane data. A unified memory structure may be used, instead of a double buffer structure, avoiding duplication of controllers, memories, buffers, and other resources.

An advantage of the disclosure is that it enables the use of standard SDRAM for data storage. While many different types of memory could be used in a bit plane memory system, DDR3 is desirable because it has large storage capacity, low cost, high speed, and is readily available. However, using DDR3 requires special care to construct a memory system structure and control scheme to avoid excessive overhead that wastes memory bandwidth. For the implementation of high performance video display systems at a reasonable cost, memory efficiencies above 90% are desired.

Note that the present disclosure could instead use DDR2 memory, with slight performance reductions and perhaps some increase in cost. Newer and planned versions of DDR SDRAM, such as DDR4 and successors, can also be used effectively with the techniques disclosed herein.

An example embodiment of the disclosure is a DDR3-based bit plane memory system that operates with memory efficiency (bus utilization) greater than 94%.

BRIEF DESCRIPTION OF DRAWINGS

Aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views, and in which:

FIG. 1 is a view of bit planes contained in two video frames;

FIG. 2 is a block diagram of a digital display system;

FIG. 3 is a block diagram of a bit plane memory system;

FIG. 4 shows an example transpose operation for 4-bit data;

FIG. 5 shows rotation of a generic rectangular block by three shear operations;

FIG. 6 shows rotation of a block of data by three shear operations;

FIG. 7 shows an example transpose of a square block of pixel data, implemented by using only shear operations and data reordering;

FIG. 8 shows an example transpose of a non-square block of pixel data, implemented by using only shear operations and data reordering;

FIG. 9 shows an example transpose embodiment;

FIG. 10 shows a timing diagram of an idealized 4-bit transpose, showing input versus output;

FIG. 11 shows a block diagram of a write buffer with a single data input;

FIG. 12 shows an example dual-port memory address map that has four write addresses for every read address;

FIG. 13 shows an example address mapping for a write buffer;

FIG. 14 shows a block diagram of a write buffer with four data inputs;

FIG. 15 shows a block diagram of a read buffer;

FIG. 16 shows an example dual-port memory address map that has one write address for every four read addresses;

FIG. 17 shows a block diagram of a memory controller and associated SDRAM;

FIG. 18 represents two logical views of the address signals for a SDRAM device;

FIG. 19 shows a mapping of plane, chunk, and frame signals to SDRAM address signals, LSBs first;

FIG. 20 shows a mapping of plane, chunk, and frame signals to SDRAM address signals with frame interleave;

FIG. 21 shows the transformation from separate plane and chunk numbers to z-order;

FIG. 22 shows a mapping of Z-ordered shows a mapping of plane, chunk, and frame signals to SDRAM address signals;

FIG. 23 shows a first example embodiment of the disclosure;

FIG. 24 shows a second example embodiment of the disclosure;

FIG. 25 shows an SDRAM address mapping;

FIG. 26 shows a mapping of bit plane number and chunk number to SDRAM locations;

FIG. 27 shows the order in which bit plane chunks are written into a DDR3 SDRAM;

FIG. 28 shows the order in which bit plane chunks are read from a DDR3 SDRAM.

DETAILED DESCRIPTION OF THE INVENTION

In the description that follows, like parts are marked throughout the specification and drawings with the same reference numerals. The drawing figures might not be to scale and certain components can be shown in generalized or schematic form and identified by commercial designations in the interest of clarity and conciseness.

In order to enable a digital display system to show more gray shades than the intrinsic capabilities of the digital imaging device, some sort of modulation in time of the digital imaging device can be used (e.g. pulse width modulation, or PWM). Typically, the digital imaging device is modulated with a signal such that the intensities of the displayed pixels average to the desired gray shade over a time frame short enough that the human vision system will perceive these average pixel levels, rather than the modulated signal.

One approach to generating this digital imaging device modulating signal is to convert the incoming video levels into bit planes. A bit plane may also be referred to as a plane. If each pixel is represented by an N-bit value, each image frame will have N bit planes. Each bit plane represents a portion of the video level. This portion of the video level is typically referred to as the weight, or bit weight, of the bit plane. Bit weights are often binary (i.e. powers of two), but are not limited to binary ratios. For example, a binary 4-bit video signal may have 4 bit planes, with bit weights of 0.5, 0.25, 0.125, and 0.0625. Equivalently, the weights may be stated in integer form: (8, 4, 2, and 1), as it is the ratio of the bit weights that is the salient aspect.

A bit plane comprises one bit of the incoming video data for every pixel within the video frame. For example, every pixel could show bit 3 of the associated video data. Each bit plane is displayed in one or more bit segments, with the weight of each bit plane being equal to the sum of the weights of the associated bit segments. A bit plane that is shown in more than one bit segment in a given frame is said to be ‘split’ or ‘repeated’. The weight of a bit plane is proportional to both the length of time each bit plane is displayed, as well the integral of the illumination during the bit plane display. For a binary display, during a bit segment, the pixels will be ON or OFF depending on the related bit plane data.

The arrangement of the bit segments in time and their associated bit weights and bit planes is called the bit sequence (the “sequence”). The design of bit sequences involves reconciling various aspects of display image quality, including bit depth, dither noise, bandwidth, light efficiency, color artifacts, and motion artifacts.

Using multi-level halftoning (multitoning), the incoming image data can be converted to a representation using more, or fewer, bits per pixel. Multitoning can also convert from a binary (bits are powers of 2) representation to a representation with arbitrary weights per bit. This provides the ability to use arbitrary numbers of bit planes, with arbitrary bit weights.

Typically, not all possible combinations of bit planes are used. For example, a cinema display running at 24 frames per second and using a DMD with an average bit segment of 170 us can display about 260 bit segments per frame. Using one bit plane per bit segment, if every possible combination of bit planes was used, there would be 2²⁶⁰combinations, or about 10⁷⁸. This is obviously more than is required or practical. In addition, many combinations are redundant, as they have the same or very similar bit weight. In practice, a subset of combinations is chosen, with a total count ranging from dozens to hundreds of combinations. Each chosen combination of bit planes, termed a “bit code,” has an aggregate bit weight, and thus a gray level, as well as a bit vector representing the bit planes that should be ON, or ‘1’.

A bit plane memory system converts the incoming video stream from video order to bit plane order, stores the bit planes in memory, and recalls the bit planes from memory in the order required by the bit sequence, and sends the bit plane data to the display device.

Frame memories are normally constructed from one of the families of Synchronous Dynamic Random Access Memory (SDRAM), often referred to simply as DRAM. Existing bit plane memory systems either use slow and obsolete Double Data Rate SDRAM (DDR), or use expensive and limited-availability memory technologies such as Reduced Latency DRAM (RLDRAM), Low Latency DRAM (LLDRAM), Rambus DRAM (RDRAM), or Extreme Data Rate DRAM (XDR-DRAM). More available and less expensive technologies, such as Double Data Rate SDRAM 2 (DDR2) and Double Data Rate SDRAM 3 (DDR3) memories have generally not been used, due to difficulties achieving adequate data bandwidth.

Effective memory speed is determined by the raw speed of the memory devices, as well as the bus utilization of the memory system. In this context, bus utilization means the percent of time that the memory is able to either read or write data.

FIG. 1 shows an example of how 5-bit video can be converted to a simple sequence, comprising 5 bit segments, one bit segment per bit plane. Two video frames are depicted. The Y axis of this graph represents the displayed light intensity, and for binary imaging devices, this light intensity is either on or off, depending on the state of the bit plane data. Note that each bit plane consists of one bit of data for all pixels, so the bit plane data for all pixels must be available to the display, in bit plane order, rather than in conventional pixel order.

More complicated sequences that include many split and repeated bit planes can be designed to improve image quality of the video display, but the fundamental requirement of converting incoming video ordered data to bit planes for every pixel in the video frame remains.

The purposes of a bit plane memory system include:

- 1. Convert the incoming video stream from video ordered pixel data to bit plane order,
- 2. Store the bit plane ordered data in memory,
- 3. Recall bit planes from memory in the order required by the bit sequence, and
- 4. Send the bit plane data to the display device.

FIG. 2 shows an exemplary location of bit plane memory system 201 within a typical display system based on a digital imaging device. Video input receiver 203, video processing 204, bit plane memory processor 205, and display device interface 207 can be implemented using a Field Programmable Gate Array (FPGA), an Application-Specific Integrated Circuit (ASIC) 209, discrete devices or in other suitable manners. SDRAM 206 as it is used for the bit plane memory system can be located outside of the FPGA, ASIC, discrete devices or other components, but could be located internally.

Given the structure of incoming video and the definition of a bit plane, the bit plane memory system typically must be large enough to store at least two entire frames of video. One frame of storage is used to store incoming image data while the other frame is used to read out stored video bit planes to feed the digital imaging device.

For optimal video display system performance, the bit plane memory system can be fast enough to handle high speed video inputs at the same time that it delivers bit planes to the display device as fast as the display device can accept the bit plane data. The input video can be converted to bit plane format, written to the memory, and then later read from the memory as bit planes. In bit plane memory applications, the memory can be continuously switching between writing new video data into memory and reading bit planes from memory.

While many different types of memory could be used for a bit plane memory system, the large size and high speed requirements can be met by Double Data Rate 3 (DDR3) SDRAM. DDR3 SDRAM can be used because it has large storage capacity, low cost, high speed, and is readily available, due to the use of DDR3 SDRAM in many consumer and commercial products. However, using DDR3 SDRAM does require special care to construct a memory system structure and control scheme to avoid excessive overhead that degrades effective memory speed (i.e. memory bus utilization).

Effective memory speed is determined by the raw speed of the memory devices, as well as the bus utilization of the memory system. In this context, bus utilization can mean the percent of time that the memory is able to either read or write data. In prior bit plane memory applications, conventional DDR3 SDRAM systems suffered from operational overhead that reduces the attainable bus utilization, typically to less than 50%. This low efficiency is mainly due to the cost (in time) of the DDR3's mode switching between read and write, addressing, refresh, and other non-data operations.

One purpose of the bit plane memory system is to convert streaming video-ordered pixel data into bit planes that are the width and height of the video frame. If the input video is X pixels wide and Y pixels high with N bits per pixel, then each input frame of video consists of X by Y by N bits. After conversion to bit planes, there will be N bit planes that are X by Y by 1 bit. The bit plane memory can support simultaneously writing video ordered pixel data and reading bit plane data, as both the video input and the bit plane output are expected to run concurrently. A bit plane memory system could also alternate between writing and reading on a bit-by-bit or a pixel-by-pixel basis, or in other suitable manners.

If conversion from pixels to bit planes cannot efficiently take place in one step, such as due to speed limitations of the memory, a cascade approach can be used. FIG. 3 is a block diagram of a bit plane memory system that efficiently fulfills the requirements of a bit plane memory. FIG. 3 shows a generic (non-specific) block diagram of the cascade approach, which comprises of one or more of each of the following elements: transpose 301, write buffer 302, memory controller 303, memory 304, and read buffer 305. At each stage, the data is transformed to become more like the desired data format (i.e. bit plane ordered pixel data). Note that each of these elements can be replicated to provide processing of more than one pixel per clock, if required. The bit plane memory processor 201 of FIG. 2 can correspond to box 306 in FIG. 3.

Referring to FIG. 3, the first stage is transpose 301. It performs a matrix transpose on successive blocks of video ordered pixels, converting them to bit plane blocks. A bit plane block may also be referred to as a block. Each input pixel is N bits wide, with a new pixel arriving every clock cycle. Transpose 301 processes T pixels at a time. Every clock cycle, transpose 301 produces an output bit plane block T pixels wide that contains the same bit plane for each pixel. Thus, transpose 301 converts from N bits of one pixel per clock to T pixels of one bit per clock. Typically, N is equal to T and is relatively small, for example 32 or 128 pixels. Note that multiple transpose units may be used to process more than one pixel per clock cycle.

The second stage is a write buffer 302, which gathers the bit plane blocks from transpose(s) 301 into larger blocks of bit plane data, called bit plane chunks, of width W. Bit plane chunks are wider than T, typically an integer multiple of T, such as 512. A bit plane chunk may also be referred to as a chunk.

The third stage is memory controller 303, which gathers the bit plane chunks into full frame-size bit planes in memory 304, using a scatter-write, linear-read approach. After this point, the data in memory 304 comprises contiguous full frame-size bit planes, one after another, ready to be read-out linearly by memory controller 303. Note that a linear-write, scatter read could be used, with similar results.

The fourth stage is read buffer 305, which provides data continuously to the digital display device, even when memory controller 303 is temporarily busy and cannot provide data to the read buffer. This can occur during memory write operations, as well as for required memory overhead operations, such as refresh and calibration, or in other suitable manners. Read buffer 305 can also be used to convert from the bus width R of memory controller 303 to the bus width D of the digital display device.

To minimize memory and other resource use, a unified memory structure can be used for the SDRAM, instead of a double buffered structure. This requires memory controller 303 to arbitrate memory access between writes, reads, and memory maintenance operations such as refresh and calibration. The arbitration scheme, along with the elasticity of a large read buffer and an appropriately sized write buffer, allows the system to simultaneously ingest a high speed video stream and continuously supply bit planes of data to the digital display device at maximum speed.

Transpose Detail

One method to generate the transpose of a block of data is to first rotate the block of data by 90 degrees. Then the transposed data is obtained by reordering the rotated data. Note that reordering does not consume any circuit resources. FIG. 4 shows an example transpose operation for 4-bit data. Referring to FIG. 4, input data block 401 is four pixels wide by 4 bit planes high. After rotating 90 degrees counterclockwise, rotated data 402 is 4 bit planes wide by 4 pixels high, but the pixel order is reversed. Reordering the data gives output data block 403.

One method to rotate a block of data is to perform three successive shear operations (see Paeth, A. W. (1986) “A Fast Algorithm for General Raster Rotation,” Proceedings, Graphics Interface '86, Canadian Information Processing Society, Vancouver, pp 77-81). FIG. 5 shows rotation of a generic rectangular block by three shear operations. Referring to FIG. 5, input block 501 has corners labeled A, B, C, and D. A first shearing in the X (horizontal) direction results in block 502. A second shearing in the Y (vertical) direction produces block 503. A third shearing in the X (horizontal) direction produces block 504, which is the input block 501 rotated counterclockwise by 90 degrees.

FIG. 6 shows rotation of a block of data by three shear operations. Referring to FIG. 6, input block 601 has corners labeled A, D, P, and M. A first shearing in the X (horizontal) direction results in block 602. A second shearing in the Y (vertical) direction produces block 603. A third shearing in the X (horizontal) direction produces block 604, which is the input block 601 rotated counterclockwise by 90 degrees.

FIG. 7 shows an example transpose of a square block of pixel data, implemented by using only shear operations and data reordering. Referring to FIG. 7, input block 701 is 4 pixels wide and 4 bit planes high. The labeling of each bit is of the form ‘Pixel BitPlane’, e.g. B2 is pixel B and bit plane 2. A first shearing in the X (horizontal) direction produces block 702. A second shearing in the Y (vertical) direction produces block 703. A third shearing in the X (horizontal) direction produces block 704, which is input block 701 rotated counterclockwise by 90 degrees. Reordering the bits produces block 705, which is the transpose of input block 701. Note that reordering data does not increase circuit resource usage.

FIG. 8 shows an example transpose of a non-square block of pixel data, implemented by using only shear operations and data reordering. Referring to FIG. 8, input block 801 is 5 pixels wide and 4 bit planes high. The labeling of each bit is of the form ‘Pixel BitPlane’, e.g. B2 is pixel B and bit plane 2. A first shearing in the X (horizontal) direction produces block 802. A second shearing in the Y (vertical) direction produces block 803. A third shearing in the X (horizontal) direction produces block 804, which is input block 801 rotated counterclockwise by 90 degrees. Reordering the bits produces block 805, which is the transpose of input block 801. This demonstrates that this technique works for square and rectangular blocks of data.

An advantage of using only shear operations is that they can be implemented in digital logic using only shift registers (i.e. delay lines) for X shear and barrel shifters for Y shear. Prior art methods require expensive multi-port memories (refer to U.S Patent No.) and/or multiple sets of memories to perform the transpose operation (refer to U.S. Pat. No. 5,255,100, U.S. Pat. No. 5,663,749 and U.S. Pat. No. 6,118,500).

FIG. 9 shows an example transpose embodiment. Referring to FIG. 9, input bus 901 provides video-ordered pixel data with 4 bits per pixel. Input delay 902 delays each bit of the pixel data by a different amount, proportional to the number of the bit (e.g. bit 2 of each pixel is delayed by 2 clocks), providing a shear in the X (horizontal) direction. Input Pixel Number 907 provides a pixel number, resetting to zero at the beginning of the frame (or at the beginning of each video line if desired). Modulo unit 908 calculates PN mod T, where PN is the pixel number and T is the transpose output width. For this example T is 4. The output of Modulo Unit 908 counts from zero to three and then repeats. Complement unit 909 inverts the count from Modulo Unit 908. The output of Complement unit 909 counts from three down to zero and then repeats. Barrel shifter 903 rotates the delayed data from input delay 902 by the amount selected by the output of Complement unit 909, providing a shear in the Y (vertical) direction. Note that the complement unit could be eliminated by connecting the barrel shifter output data 910 in reverse bit order. Output delay 904 delays each bit by an amount which cancels the delay of input delay 902, providing a shear in the X (horizontal) direction. The output of Output Delay 904 has been sheared in X, Y, and X and is input bus 901 data rotated by 90 degrees. The data from Output Delay 904 is connected in reverse bit order in box 905, which produces the transpose of the input data 901 at output bus 906. Note that different output width values (T) can be used and typical system may have T of 16 or 18 or 24 or 32, or some other size.

The transpose processing technique as shown in FIG. 9 provides flow-thru operation, meaning that a new block of bit plane ordered data is produced every clock. The transpose can use clocked shift registers or data registers as delay lines for processing. During startup (start of each frame), it takes a number of clock cycles for the data registers to fill and start producing outputs. The number of startup clock cycles is approximately equal to the number of bits per input pixel. The transpose data registers are typically kept full between video lines. At the end of the last active video line, the data registers will still contain partially processed pixel data, so the transpose must process the data contained in the data registers, even though no new data is available at the input of the transpose. This is called a flush. A convenient signal to initiate the flush is the leading edge of the video vertical sync signal. Alternately, the transpose could process the data, regardless of sync and the availability of video data, removing the requirement for a flush operation.

FIG. 10 shows a timing diagram of an idealized 4-bit transpose, showing input versus output, neglecting pipeline delays and other practicalities. The pixel data is labelled with an integer plus decimal fraction. The integer part represents the pixel number of the data and the decimal fraction part represents the bit plane number of the data. For example, 1.00 represents pixel 1 bit plane 0 data and 5.3 represents pixel 5 bit plane 3 data.

Due to limitations in the speed of processing of FPGAs or ASICs and the need to support very high video frame rates, more than one transpose module can be used in parallel, providing the ability to process more than one pixel per clock.

Write Buffer Detail

A write buffer assembles bit plane blocks (of width T) received from one or more transpose units into larger blocks of bit plane data, called bit plane chunks (of width W). The bit plane blocks consist of data from a single bit plane from T pixels. Bit plane chunks are wider than T bits, typically an integer multiple of T, such as 512.

The width W of the bit plane chunks is primarily derived from the size of the smallest block of data that can be written by the memory controller to the SDRAM. This size is determined by the SDRAM bus width and burst mode. For example, a DDR3 SDRAM with an I/O bus width of n bits has a prefetch buffer size of 8n (i.e. eight data words per memory access), results in a burst size of 8 words per memory access. For a DDR3 SDRAM-based memory of width 64 and operated in a burst 8 mode, a block of 512 bits is the smallest block that can be uniquely addressed within the DDR3 SDRAM. In this case, a bit plane chunk width W of 512 would be appropriate, as would integer multiples of 512.

The write buffer is typically also used to cross clock domains from the input pixel clock domain to the memory system clock domain.

FIG. 11 shows a block diagram of a write buffer with a single data input.

Referring to FIG. 11, bit plane blocks 1101 are the input to dual-port memory 1102. The bit plane block data is T bits wide. Bit plane chunks 1103 are the output from the write buffer. Synchronization signals 1105 and 1106 are used to control the transport of data into and out of the write buffer. Controller 1104 uses the incoming synchronization signals to generate write enable and write address for the write side of dual port memory 1102. Controller 1104 uses synchronization signals to generate read enable and read address for the read side of dual port memory 1102.

Referring again to FIG. 11, one function of the dual-port memory 1102 is to translate from the data width T of the incoming bit plane block data 1101 to the data width W of the outgoing bit plane chunk data 1103. FIG. 12 shows an example dual-port memory address map that has four write addresses for every read address. Referring to FIG. 12, the same memory is represented by write address map 1201 and read address map 1202. For example, data written to write addresses 0, 1, 2, and 3 can be read from read address 0, as the concatenation of the four written data values. This is normal operation for dual-port memories with different data widths on the two ports.

FIG. 13 shows an example address mapping for a write buffer that that inputs bit plane blocks (width T=32 pixels and N=32 bit planes) and outputs bit plane chunks of width W=128. Referring to FIG. 13, the same memory locations are represented by both write address map 1301 and read address map 1302. Bit plane blocks are written 32 pixels at a time and bit plane chunks are read 128 pixels at a time. Each write or read has data for a single bit plane. For writing, bit plane 0 pixels 0-31 are written to write address 0. Bit plane 1 pixels 32-63 are written to write address 5, and so on. For reading, bit plane 0 pixels 0-127 are read from read address 0. Bit plane 22 pixels 0-127 are read from read address 22.

FIG. 14 shows a block diagram of a write buffer with four data inputs. Referring to FIG. 14, bit plane blocks 1401 are the inputs to dual-port memory 1402. Each bit plane block is T bits wide, giving a total of 4T input bits. The concatenation of the four inputs 1401 is written to the dual-port memory. Bit plane chunks 1403 are the output from the write buffer. Synchronization signals 1405 and 1406 are used to control the transport of data into and out of the write buffer. Controller 1404 uses the incoming synchronization signals 1405 to generate write_enable and write_address signals for the write side of dual port memory 1402. Controller 1404 uses synchronization signals 1406 to generate read_enable and read_address for the read side of dual port memory 1402.

Typically, the memory locations in the dual-port memory are logically divided into two buffers, each half the size of the total memory. These buffers are used in a ping-pong fashion, with one being written to and the other being read. The buffers are swapped (write becomes read and vice versa) when incoming data has completely filled the buffer that is being used for writing. At any point in time, one buffer is used for reading and the other is used for writing. After the buffer used for writing is full, the roles of the two buffers are swapped. The write buffer controller (FIG. 11, 1104 or FIG. 14, 1404) generates a data_ready signal that indicated that unread data is available in the buffer that is being used for reads.

Instead of dividing the memory locations into two logical buffers, an alternate write buffer embodiment uses the dual-port memory as a first-in-first-out (FIFO) memory. This configuration has the advantage that less total memory is needed, but control of the memory is more complicated.

It is desirable for the size of each logical buffer to be as small as possible, to minimize the time required to empty the write buffer. Multiple constraints limit how small the buffers can be. At a minimum, each buffer should be large enough to hold N bit plane chunks of width W. Less than this can prevent or hinder the assembly of bit plane chunks for all N bit planes, as the last bit plane block of the first bit plane chunk arrives at the write buffer just after the next to last bit plane chunk of the last bit plane chunk. Referring again to FIG. 13, recall that for this example N=32, T=32, and W=128. Write order 1303 shows the order that the locations in the write buffer are filled. Note that the final block of the bit plane chunk containing bit plane 0 pixels 0-127 are not written into the buffer until the 97^thclock, at write address 3. Reading of read address 0 cannot begin until after this point. Note that a FIFO structure could use less memory, but would require the ability to predict in advance that enough data is present to begin SDRAM writes.

Referring again to FIG. 3, during maximum bandwidth operation, the memory controller 303 must empty the half of write buffer 302 used for reads in less time than it takes the data from transpose 301 to fill the half of the write buffer 302 used for writes to prevent buffer overflow. Bit plane block data is streamed into write buffer 302 (i.e. cannot be throttled or interrupted) and write buffer 302 can respond immediately. Data is burst out of write buffer 302 at a high rate when memory controller 303 is ready to accept it. Since the width of write buffer 302 output W is normally larger than the width of write buffer 302 input T, emptying the buffer happens faster than filling, subject to the write and read clock speeds. Another constraint on the minimum logical buffer size for write buffer 302 is that low level SDRAM activities, such as refresh and periodic read calibration, can preempt reading from write buffer 302 from emptying. This can result in write buffer overflow and must be avoided. Increasing the size of the logical buffers, usually by an integer factor (i.e. 2) gives extra time for interruptions from low level operations.

Memory Controller Detail

Referring again to FIG. 3, memory controller 303, in conjunction with the SDRAM 304, performs the third stage of the cascade operation. Using a scatter writing and reading scheme, memory controller 303 scatter-writes bit plane chunks of width W from write buffer 302 to SDRAM 304 in chunk order, such that reading from SDRAM 304 in bit plane order results in contiguous full frame bit planes of data. Memory controller 303 accepts bit plane chunks of size W from write buffer 302 and delivers full video frame size bit plane data to the display device or display device interface, via the read buffer. Memory controller 303 performs this block reordering function by manipulating SDRAM write and read addresses appropriately.

FIG. 17 shows a block diagram of a memory controller and associated SDRAM. Referring to FIG. 17, bit plane chunk data 1701 are the input to memory controller 1704. The bit plane chunk data is W bits wide. Bit plane ordered data 1703 of width R is output to the read buffer. Synchronization signals 1105 and 1106 are used to control the transport of data into and out of memory controller 1704. Memory controller 1704 transfers data to or from SDRAM 1702, as well as performing periodic refresh and other low level maintenance operations.

Referring again to FIG. 17, one function of memory controller 1704 is to translate from the data width W of the incoming bit plane chunk data 1701 to the data width M of the SDRAM 1702. Another function of memory controller 1704 is to translate from the data width M of SDRAM 1702 to the data width R of the outgoing bit plane ordered data 1703.

Memory overhead can be minimized for highest possible system performance, but care must be taken in how bit planes are mapped in to the SDRAM address space. The memory map can be defined in such a way that both write and read access to the SDRAM can occur without interruption when SDRAM bank or row changes are required.

SDRAM are typically addressed using three types of address signals, each in the form of a binary number. Row address signals select a row of internal memory, which is the fundamental internal data block of SDRAM, i.e. changing the row address requires accessing the internal memory, which is a relatively slow operation. Column address signals select a portion of a row. Column addresses can be changed much more quickly than row addresses, usually changing for every burst of data to or from the SDRAM. Bank address signals select between multiple memory subsystems in the SDRAM device, where theses subsystems can operate mostly in parallel. Bank addresses can also change for every burst of data to or from the SDRAM. Common DDR3 SDRAM typically has 10 column address signals, 3 bank address signals, and enough row address signals such that all of the memory can be addressed.

FIG. 18 represents two logical views of the address signals for an SDRAM device. Referring to FIG. 18, separate row, bank, and column address signals 1801 are depicted concatenated with row in the most significant bits, bank in the middle bits, and column in the least significant bits. Other orderings are possible, but this is a common ordering used with SDRAM devices. Logical address signals 1802 are shown to indicate that address signals 1801 are mapped to the equivalent logical address signals 1802, with no loss of generality. Note that, depending of the minimum burst length of a SDRAM device, one or more of the least significant column address bits may be set to a constant value (usually zero), for maximum transfer speed. For example, if the minimum burst length is 8, the least significant column address bits may be either unused or set to a constant value. For bit plane memory applications, typically all transfers are a full SDRAM burst, to achieve maximum speed and efficiency. For unusual size transfers, the least significant column address bits can be used, but at the cost of reduced speed and efficiency.

Since SDRAM row address signals require the longest delay between changes, the other address signals (bank and column) can be used to separate row address changes by as much time as required by the SDRAM device, in order to achieve maximum efficiency. This separation can be accomplished by mapping the bit plane chunk data to the SDRAM address space in a particular way, as will be described below. Note that there are several ways to accomplish this mapping that meet the requirement of separating row address changes by sufficient time for the SDRAM device.

There are ChunksPerFrame bit plane chunks in each bit plane of an image frame, where ChunksPerFrame=W/(Pixels per Frame). Bit plane chunks are numbered from 0 to (ChunksPerFrame−1). For example, a 4096 by 2160 image has 8,847,360 pixels. If W is 512, then there are 17,280 bit plane chunks per frame per bit plane. These bit plane chunks are numbered 0 to 17,269. Bit planes are numbered from 0 to N−1.

The bit plane number and bit plane chunk number are both represented in binary form. Binary numbers are made up of binary bits. Each binary bit can be either 0 or 1. Each binary bit represents a power of 2, with more significant bits representing higher powers of two. It is customary for the bits in a binary number to be represented in increasing power of 2 and significance from right to left. Thus, bits to the left are higher in significance than bits to the right. A well-known property of binary numbers is that when a binary number counts either up or down, the least significant bits change first and fastest. This means that while counting, before a given bit in a binary number changes, all the bits of lesser significance will have counted through all possible combinations of binary values of the bits of lesser significance.

During writes to the SDRAM, bit plane chunks containing W pixels of a single bit plane are written to the SDRAM. Each successive bit plane chunk contains pixels of a different bit plane. After a bit plane chunk has been written for each bit plane, the next chunk is written for each bit plane, starting at bit plane zero. This means the bit plane number is changing with each write, and the bit plane chunk number changes after N bit plane chunks have been written, where N is the number of bit planes in the data. Thus, during writes, the bit plane number changes the fastest, with the bit plane chunk number changing slower by a factor of N.

During reads from the SDRAM, bit plane chunks, containing R pixels of a single bit plane, are read from the SDRAM. Each successive bit plane chunk is from the same bit plane, until an entire frame of bit plane data has been read. This means that the bit plane chunk number is changing with each read, and the bit plane number changes only after all chunks of that bit plane have been read. Thus, during reads, the bit plane chunk number changes the fastest, with the bit plane number changing slower by a factor of ChunksPerFrame.

Bit plane chunk data must be mapped to a SDRAM location that is unique for each bit plane number, each bit plane chunk number, and each image frame. One method for this mapping is to use subsections of the binary numbering of the plane, chunk, and frame to the SDRAM address signals. FIG. 19 shows a mapping of plane, chunk, and frame signals to SDRAM address signals. Referring to FIG. 19, the least significant four bits of the chunk number 1901 and the least three significant bits of the bit plane number 1902 are connected to SDRAM column address signals 1903. The least significant three bits of SDRAM column address signals are driven by constant zeroes 1904. Note that the most significant 7 column address signals 1903 are completely interchangeable with each other, i.e. the order in which the least significant bit plane number 1902 signals and least significant chunk number signals is arbitrary and has not effect on performance. The next most significant two bits of the bit plane number 1905 are connected to two of the SDRAM bank address signals 1907. The next most significant bit of the chunk number signals 1906 is connected to the third SDRAM bank address signal. The remainder of SDRAM address signals are driven by the remaining bit plane number bits, bit plane chunk number bits, and frame number bits. For both bit plane number and chunk number, this meets the requirement that least significant bits are connected to column address signals and that the next most significant bit is connected to a bank signal, resulting in the greatest separation in time between row address changes, as bit plane number and chunk number increment or decrement.

A second method for mapping bit plane chunk data to SDRAM locations is to again use subsections of the binary numbering of the plane, chunk, and frame to the SDRAM address signals. FIG. 20 shows a mapping of plane, chunk, and frame signals to SDRAM address signals. Referring to FIG. 20, the least significant four bits of the chunk number 2001 and the least three significant bits of the bit plane number 2002 are connected to SDRAM column address signals 2003. The least significant three bits of SDRAM column address signals are driven by constant zeroes 2004. Note again that the most significant 7 column address signals 2003 are completely interchangeable with each other, i.e. the order in which the least significant bit plane number 2002 signals and least significant chunk number signals is arbitrary and has not effect on performance. The next most significant bit of the bit plane number 2005 is connected to a first SDRAM bank address signal 2008. The next most significant bit of the chunk number signals 2006 is connected to a second SDRAM bank address signal. The least significant frame count bit 2007 is connected to a third SDRAM bank address signal. The remainder of SDRAM address signals are driven by the remaining bit plane number bits, bit plane chunk number bits, and frame number bits. For both bit plane number and chunk number, this configuration meets the requirement that least significant bits are connected to column address signals and that the next most significant bit is connected to a bank signal, resulting in the greatest separation in time between row address changes, as bit plane number and chunk number increment or decrement. In addition, the connection of the least significant frame count bit to an SDRAM bank address signal causes reads and writes to use a different subset of the available banks. This configuration provides for zero overhead switching between read and write operations. Since the read and write frame numbers normally differ by one, the least significant bit of the frame count will be different for reads and writes to the SDRAM device.

A third method for mapping bit plane chunk data to SDRAM locations is to use Z-ordered, or Morton ordered addressing. Z-order, or Morton order, is a function which maps multi-dimensional data to one dimension while preserving locality of the data points, see “Morton, G. M. (1966), A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing, Technical Report, Ottawa, Canada: IBM Ltd.” The z-value of a point in multi-dimensions can be calculated by interleaving the binary representations of its coordinate values. In the case of bit plane chunk data, there are two dimensions: the bit plane number and the bit plane chunk number. FIG. 21 shows the transformation from separate plane and chunk numbers to z-order. Referring to FIG. 21, eight bits of bit plane number 2101 and 16 bits of bit plane chunk number 2102 are combined to give Z-order 2103 by interleaving the binary bits of the bit plane number and bit plane chunk number. The bits of the Z-ordered bits correspond to the logical address 2104, which can be used to address SDRAM. The frame number could be considered a third dimension and included in the Z-ordered address, but it changes slowly enough to have little effect on memory efficiency. If zero-overhead switching between read and write operations is required, the least significant bit of the frame count may be connected to an SDRAM bank address signal.

FIG. 22 shows a Z-ordered mapping of plane, chunk, and frame signals to SDRAM address signals. Referring to FIG. 22, the SDRAM address signals 2201 are connected to three constant zero bits 2202, the least significant Z-order bits 2203, the least significant frame count bit 2204, and the most significant Z-order bits 2205. Note again that the most significant 7 column address signals 2003 are completely interchangeable with each other. Due to the characteristics of the Z-order (preservation of locality), the separation in time between row address changes is maximized. The connection of the least significant frame count bit to an SDRAM bank address signal causes reads and writes to use a different subset of the available banks. This provides for zero overhead switching between read and write operations. Since the read and write frame numbers normally differ by one, the least significant bit of the frame count will be different for reads and writes to the SDRAM device.

SDRAM write address generation is based on video counters (memory block, line and frame counters) that are advanced by the incoming video syncs. SDRAM read address generation is based on bit plane requests from a display controller and on a read memory block counter that counts memory block reads since a display controller request. The read memory block counter is reset upon a new display controller request. The memory controller includes a display controller request FIFO and can hold up to multiple pending requests. The memory controller read address generator compares the read memory block counter to the calculated memory block stop count (calculated from the display controller request parameters) and asserts a read stop signal when the requested number of read blocks have been transfer to the display device. The memory controller can swap read and write frames in SDRAM based on incoming video vertical syncs.

In addition to reordering bit plane blocks via SDRAM address manipulation, the memory controller must also arbitrate between writes and reads to the SDRAM, as well as SDRAM low-level maintenance operations, such as refresh and periodic read calibration. The memory controller arbiter monitors write buffer and read buffer data ready signals and starts SDRAM writes or reads as required. The memory access priority scheme is defined in Table 1.

TABLE 1 Priority Memory Burst Size Memory Operation Highest Smallest SDRAM low level functions (refresh, periodic read, etc. Medium Medium Bit Plane Data Writes Lowest Largest Bit Plane Data Reads

In general, memory operation priority is in inverse order to the length of time each operation takes. Low level SDRAM maintenance functions are the fastest and the highest priority. Writing bit plane data from the write buffer is the next highest priority. In general, the write buffer is sized as small as possible while still being able to gather transpose blocks into bit plane chunks, but it must also be large enough to buffer the input video stream during low level SDRAM functions. Additional constraints on write transactions include minimum number of writes between switching SDRAM banks and the time required for the SDRAM to switch between writing and reading. Reading bit plane data to the read buffer is the lowest priority and takes the longest time per transaction. The read buffer should be sized large enough to buffer up enough data to deliver a continuous stream of bit plane data to the display device while periodically being interrupted by burst of write data into the SDRAM and low level SDRAM functions. For maximum performance, the read buffer should be large enough such that one read buffer worth of data delivered to the display device spans at least the time interval of one input video line. This allows the input video horizontal blanking time to be used to prevent the read buffer from under-flowing during maximum SDRAM bandwidth operation. The write buffer converts the input video stream into small high frequency bursts of data to be written into the SDRAM.

Read Buffer Detail

A read buffer accepts bit plane ordered data (of width R) received from a memory controller. The read buffer outputs bit plane ordered data (of width D) to a digital display device or a digital display device interface. The bit plane ordered data usually comprises a full frames of single bit planes.

The width R of the bit plane data to the read buffer is primarily derived from the size of the smallest block of data that can be read by the memory controller from the SDRAM. This is generally the same width as W, the output of the write buffer. This size is determined by the SDRAM bus width and burst mode. For example, a DDR3 SDRAM with an I/O bus width of n bits has a prefetch buffer size of 8n (i.e. eight data words per memory access), resulting in a burst size of 8 words per memory access. For a DDR3 SDRAM of width 64 and operated in a burst 8 mode, a block of 512 bits is the smallest block that can be uniquely addressed within a DDR3 SDRAM. In this case, a bit plane data width R of 512 would be appropriate.

The read buffer is typically also used to cross clock domains from the memory system clock domain to the digital display device clock domain.

FIG. 15 shows a block diagram of a read buffer. Referring to FIG. 15, bit plane data 1501 are the input to dual-port memory 1502. The bit plane data is R bits wide. Bit plane data 1503 is the output from the read buffer. Synchronization signals 1505 and 1506 are used to control the transport of data into and out of the read buffer. Controller 1504 uses the incoming synchronization signals to generate write enable and write address for the write side of dual port memory 1502. Controller 1504 uses synchronization signals to generate read enable and read address for the read side of dual port memory 1502.

Referring again to FIG. 15, one function of the dual-port memory 1502 is to translate from the data width R of the incoming bit plane data 1501 to the data width D of the outgoing bit plane data 1503. FIG. 16 shows an example dual-port memory address map that has one write address for every four read addresses. Referring to FIG. 16, the same memory is represented by write address map 1601 and read address map 1602. For example, data written to write address 0 can be read from read addresses 0, 1, 2, and 3, as the concatenation of the four read data values. This is normal operation for dual-port memories with different data widths on the two ports, as will be apparent to one skilled in the art.

Typically, the memory locations in the read buffer's dual-port memory are logically divided into two buffers, each half the size of the total memory. These buffers are used in a ping-pong fashion, with one being written to and the other being read. The buffers are swapped (write becomes read and vice versa) when incoming data has completely filled the buffer that is being used for writing. At any point in time, one buffer is used for reading and the other is used for writing. After the buffer used for writing is full, the roles of the two buffers are swapped. Referring again to FIG. 15, the read buffer controller 1504 generates a data_request signal that indicated that space is available for writing to the dual-port memory 1502.

Instead of dividing the memory locations into two logical buffers, an alternate read buffer embodiment uses the dual-port memory as a first-in-first-out (FIFO) memory. This has the advantage that less total memory is needed, but control of the memory is more complicated.

Referring again to FIG. 3, during maximum bandwidth operation, the memory controller 303 must fill the half of read buffer 305 used for writes in less time than it takes the data to the display device to empty the half of the read buffer 302 used for reads, to prevent buffer overflow. Bit plane block data is streamed out of the read buffer (i.e. cannot be throttled or interrupted) and the read buffer must respond immediately. Data is burst into the read buffer 305 at a high rate from memory controller 303 when read buffer 305 is ready to accept it. The primary purpose of the read buffer 305 is to provides bit plane data continuously to the digital display device, even when the memory controller 303 is temporarily busy and cannot provide data to the read buffer. This can occur during memory write operations, as well as for required memory overhead operations, such as refresh and calibration. The read buffer 305 can also be used to convert from the bus width R of the memory controller 303 to the bus width D of the digital display device. Additional functions that may be included in the read buffer include vertical flip, horizontal flip, and reordering of lines of the image bit plane data.

A First Example Embodiment

FIG. 23 shows a first example embodiment of the disclosure. Referring to FIG. 23, this example system processes 32-bit pixels 2309 and uses one rank of 64-bit DDR3 SDRAM 2304. Typically, multiple DDR3 memory chips are used in parallel, e.g. (4) 16-bit DDR3 chips for a 64-bit bus. Bit plane ordered data 2314 is supplied to a digital display device by a 128 bit bus. All signal processing is synchronized by one or more clock signals. Other data sizes and SDRAM sizes and types could be used, as will be apparent to those skilled in the art.

Referring to FIG. 23, the transformation of video ordered pixel data to bit plane data occurs in a cascade of operations. The first stage of the cascade operation is performed in the 32-bit transpose 2301, which transposes video ordered pixels of 32 bits 2309 into bit plane blocks of 32 pixels 2310. The second stage of the cascade operation is performed by write buffer 2302, which gathers 32-pixel wide bit plane blocks 2310 into 512-pixel wide bit plane chunks 2311. The input to the write buffer is 32 bits wide, matching the output of the transpose. The output from the write buffer is 512 bits wide 2311, matching the input of memory controller 2303. The memory controller, in conjunction with the DDR3 SDRAM 2304, performs the third stage of the cascade operation. Using a scatter writing and reading scheme, the memory controller scatter-writes bit plane chunks of width W=512 from the write buffer 2302 to SDRAM 2304 in chunk order, such that subsequent reading from the SDRAM in bit plane order results in contiguous, full frame bit planes of data. The fourth stage is read buffer 2305, which provides data continuously to the digital display device, even when the memory controller 2303 is temporarily busy and cannot provide data to the read buffer. This can occur during memory write operations, as well as for required memory overhead operations, such as refresh and calibration. The read buffer also converts from the 512 bit bus of the memory controller to the 128 bit bus 2314 of the digital display device. Other combinations of data widths could be used at each stage of the cascade, as will be apparent to those skilled in the art. Also double-data-rate (DDR) techniques may be used to reduce the number of interconnecting signals.

A Second Example Embodiment

FIG. 24 shows a second example embodiment of the disclosure. Referring to FIG. 24, this example system processes four 32-bit pixel streams 2409 and uses one rank of 64-bit DDR3 SDRAM 2304. Typically, multiple DDR3 memory chips are used in parallel, e.g. (4) 16-bit DDR3 chips for a 64-bit bus. Bit plane ordered data 2414 is supplied to a digital display device by a 128 bit bus. For this example, a 4096×2160 pixel display will be assumed. All signal processing is synchronized by one or more clock signals. Other data sizes and SDRAM sizes and types could be used, as will be apparent to those skilled in the art.

Referring to FIG. 24, the transformation of video ordered pixel data to bit plane data occurs in a cascade of operations. The first stage of the cascade operation is performed in the four 32-bit transpose modules 2401, each of which transposes video ordered pixels of 32 bits 2409 into bit plane blocks of 32 pixels 2410, for a total of 128 bits of bit plane block data per pixel clock 2406. The second stage of the cascade operation is performed by write buffer 2402, which gathers four 32-pixel wide bit plane blocks 2410 into 512-pixel wide bit plane chunks 2411. Each of the four inputs to the write buffer is 32 bits wide, matching the output of the transpose modules. The output from the write buffer is 512 bits wide 2411, matching the input of memory controller 2403. The memory controller, in conjunction with the DDR3 SDRAM 2404, performs the third stage of the cascade operation. Using a scatter writing and reading scheme, the memory controller scatter-writes bit plane chunks of width W=512 from the write buffer 2402 to SDRAM 2404 in chunk order, such that subsequent reading from the SDRAM in bit plane order results in contiguous, full frame bit planes of data. The fourth stage is read buffer 2405, which provides data continuously to the digital display device, even when the memory controller 2403 is temporarily busy and cannot provide data to the read buffer. This can occur during memory write operations, as well as for required memory overhead operations, such as refresh and calibration. The read buffer also converts from the 512 bit bus of the memory controller to the 128 bit bus 2414 of the digital display device. Other combinations of data widths could be used at each stage of the cascade, as will be apparent to those skilled in the art. Also double-data-rate (DDR) techniques may be used to reduce the number of interconnecting signals.

In general, the write buffer is sized as small as possible to gather transpose blocks into bit plane chunks. An additional constraint on the minimum logical buffer size for the write buffer is that low level SDRAM activities, such as refresh and periodic read calibration, can preempt reading from the write buffer. This can result in write buffer overflow and must be avoided. Referring to FIG. 24, the write buffer accepts bit plane blocks 2410 with an aggregate width T=128, clocked at 300 MHz by Pixel Clock 2406. The write buffer outputs bit plane chunks 2411 with a width W=512, clocked at 233 MHz by Memory Clock 2407. The number of bit planes is N=32. The minimum size of a logical buffer in the write buffer is given by 128=N*W/T bit plane blocks, or by 32=N bit plane chunks. To write 128 blocks into the write buffer is t_W=128/300 MHz, or about 427 ns. To read 32 chunks from the write buffer is t_R=32/233 MHz, or about 137 ns. If t_OHis the worst case duration of low level overhead operations, then t_OHmust be less than 290 ns=(t_W−t_R) to avoid write buffer overflow. Since DDR3 SDRAM has a worst case t_OHof about 300 ns, the size of the logical buffers must be increased above the minimum for gathering bit plane blocks into bit plane chunks. Increasing the size of the logical buffers, usually by an integer factor (i.e. 2) gives extra time for interruptions from low level operations. In this example embodiment the write buffer size is doubled to allow for low level SDRAM operations.

The DDR3 SDRAM 2404 includes four 2 Gbit DDR3 SDRAM parts, each organized as 128M×16 and operating at 933 MHz. DDR3 SDRAM with an I/O bus width of n bits has a prefetch buffer size of 8n (i.e. eight data words per memory access), resulting in a burst size of 8 words per memory access. For an SDRAM of width 64 and operated in a burst 8 mode, a block of 512 bits is the smallest block that can be uniquely addressed. Since for highest possible system performance, bit plane memory overhead must be minimized, care must be taken in how bit planes are mapped in to the SDRAM address space. This memory map must be defined in such a way that both write and read access to the SDRAM can occur without interruption when SDRAM bank and row address changes are required. A mapping that accomplishes this is shown in FIG. 25. Referring to FIG. 25, SDRAM address signals 2502 are divided into three classes: column address signals Column[9:0] 2503, bank address signals Bank[2:0] 2504, and row address signals Row[13:0] 2505. Bit plane chunk number Chunk[14:0], bit plane number Plane[4:0] and frame count Frame[3:0] are used to drive the SDRAM address signals as follows: The least significant three column address signals, Column[2:0], are set to zeroes 2506. Column[6:3] are driven Chunk[3:0] 2507. Column[9:7] are driven by Plane[2:0] 2508. Bank[1:0] is driven by Plane[4:3] 2509. Bank[2] is driven by Chunk[4] 2510. Row[9:0] is driven by Chunk[14:5] 2513. Row[13:10] is driven by Frame[3:0] 2514.

FIG. 26 shows a mapping of bit plane number and chunk number to SDRAM locations resulting from the connections of FIG. 25. Referring to FIG. 26, the intersection of the column labeled AddrH and the row labeled AddrL represents memory location at address=AddrH+AddrL. The entry is in the form ‘plane.chunk’ where plane represents the bit plane number and chunk represents the bit plane chunk number. For example, the intersection of AddrH=128 and AddrL=33 contains 10.01, meaning that bit plane 10 chunk number 1 is stored at memory location 161.

Assuming a 4096 wide by 2160 high display device, there are 8 bit plane chunks per bit plane per display row (i.e. video line). There are 17280 bit plane chunks per frame per bit plane. Each DDR3 SDRAM row is logically divided into 8 slots. Each slot holds 16 bit plane chunks, where each bit plane chunk contains 512 bits of a single bit plane. Each slot is used for a different bit plane. DDR3 SDRAM banks are grouped into sets. Set 1 consists of banks 0 and 4. Set 2 consists of banks 1 and 5. Set 3 consists of banks 2 and 6. Set 4 consists of banks 3 and 7. Each memory bank set is used for different bit planes. For all video frames, set 1 holds bit planes 0 to 7. Set 2 holds bit planes 8 to 15. Set 3 holds bit planes 16 to 23. Set 4 holds bit planes 24 to 31. For memory writes, bank usage order is 0, 1, 2, 3, 0, 1, 2, 3 . . . until 32 bit plane block of every bit plane have been written and then 4, 5, 6, 7, 4, 5, 6, 7 . . . for the next 32 bit plane blocks of every bit plane. Then the bank usage pattern repeats. For memory reads, bank usage order is 0, 4, 0, 4 . . . or 1, 5, 1, 5 . . . or 2, 6, 2, 6 . . . or 3, 7, 3, 7 . . . depending on which bit plane is being fetched. The above given memory bank usage pattern allows seamless memory bank changes (no overhead) for both reads and writes.

FIG. 27 shows the order in which bit plane chunks are written into the DDR3 SDRAM. Referring to FIG. 27, only 8 writes into the DDR3 SDRAM are performed before a bank change is required. Four DDR3 SDRAM banks are used in sequence before reusing a bank. This time spent between writes to the same bank is used to hide the bank setup time.

FIG. 28 shows the order in which bit plane chunks are read from the DDR3 SDRAM. Referring to FIG. 28, sixteen reads from DDR3 SDRAM are performed before a bank change, allowing cycling between only two banks while providing adequate time to fully hide the bank setup time for reads.

DDR3 SDRAM write address generation is based on video counters (memory block, line and frame counters) that are advanced by incoming video syncs. DDR3 SDRAM read address generation is based on bit plane and display device data requests from the display device controller and on a read memory block counter that counts memory block reads since a display device controller request. The read memory block counter is reset upon a new display device controller request.

The read buffer is sized large enough to buffer up enough data to deliver a continuous stream of bit plane data to the display device while constantly being interrupted by burst of write data into the SDRAM. For maximum performance, the read buffer must be large enough that one read buffer worth of data delivered to the display device spans at least one input video line. This allows the input video horizontal blanking time to be used to prevent the read buffer from under-flowing during maximum DDR3 SDRAM bandwidth operation.

REFERENCES U.S. Patent Documents

U.S. Pat. No. 5,255,100
U.S. Pat. No. 5,663,749
U.S. Pat. No. 6,118,500

Other Documents

Paeth, A. W. (1986) “A Fast Algorithm for General Raster Rotation,” Proceedings, Graphics Interface '86, Canadian Information Processing Society, Vancouver, pp 77-81
Morton, G. M. (1966), A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing, Technical Report, Ottawa, Canada: IBM Ltd.

Claims

1. A data processing system comprising:

a transpose system configured to receive video ordered pixel data and to generate bit plane blocks of data using a shift register that delays each pixel of a plurality of pixels by a different amount;

a write buffer configured to receive the bit plane blocks of data using a pixel clock having a first clock rate and to generate bit plane data frames using a memory clock having a second clock rate;

a memory controller configured to receive the bit plane data frames and to write a first bit plane data frame to a memory while simultaneously reading a second bit plane data frame from memory; and

a read buffer configured to receive the second bit plane data frame and to convert the second bit plane data frame to a digital display device format;

wherein the memory controller is configured to use a scatter reading and writing process to write the first bit plane data frame to the memory while simultaneously reading the second bit plane data frame from the memory to perform a third stage of a cascade operation to write bit plane chunks from the write buffer to memory in chunk order and to subsequently read contiguous, full frame bit planes of data from the memory in a bit plane order.

2. The data processing system of claim 1 wherein the memory is no larger than two frames of video data and the memory controller arbitrates between writes and reads to the memory.

3. The data processing system of claim 1 wherein the memory comprises a Double Data Rate 3 Synchronous Dynamic Random Access Memory block.

4. The data processing system of claim 1 wherein the input to the write buffer is wider than the output of the write buffer.

5. The data processing system of claim 4 wherein the memory controller assigns a lowest priority to bit plane data writes, a medium priority to bit plane data writes and a highest priority to low level memory functions, and arbitrates between writes and reads to the memory based on priority.

6. The data processing system of claim 1 wherein the read buffer is coupled to a memory clock and to a display clock, and the memory controller arbitrates operations involving the memory based on a priority, where the priority is in inverse order to a length of time each operation takes.

7. The data processing system of claim 6 wherein the first clock rate and the second clock rate are different from a third clock rate of the display clock.

8. The data processing system of claim 1 wherein the transpose system further comprises:

a plurality of delays, each delay having a different delay value; and

a barrel shifting circuit coupled to the plurality of delays.

9. The data processing system of claim 1 wherein the transpose system further comprises:

a first plurality of delays;

a barrel shifting circuit having an input coupled to the first plurality of delays and an output; and

a second plurality of delays coupled to the barrel shifting circuit.

10. The data processing system of claim 1 wherein the transpose system further comprises:

a first delay having a delay value of zero;

a second delay having a delay value of one; and

a barrel shifting circuit coupled to the first delay and the second delay.

11. The data processing system of claim 1 wherein the write buffer is coupled to a pixel clock and to a memory clock, and the read buffer is coupled to the memory clock and to a display clock.

12. The data processing system of claim 7 wherein the first clock rate is different from a display clock rate, and the display clock rate is different from the second clock rate.

13. The data processing system of claim 1 wherein the transpose buffer comprises a plurality of parallel transpose buffers that each receive a separate stream of video ordered pixel data and the memory controller performs a third stage of a cascade operation.

14. In a data processing system having a transpose system configured to receive video ordered pixel data and to generate bit plane blocks of data, a write buffer configured to receive the bit plane blocks of data and to generate bit plane data frames, a memory controller configured to receive the bit plane data frames and to write a first bit plane data frame to a memory while simultaneously reading a second bit plane data frame from memory, a read buffer configured to receive the second bit plane data frame and to convert the second bit plane data frame to a digital display device format, wherein the memory is no larger than two frames of video data, wherein the memory comprises a Double Data Rate 3 Synchronous Dynamic Random Access Memory block, wherein the memory controller uses a scatter reading and writing process to write the first bit plane data frame to the memory while simultaneously reading the second bit plane data frame from the memory, wherein the write buffer is coupled to a first pixel clock and to a second memory clock, wherein a first pixel clock rate is different from a second memory clock rate, wherein the read buffer is coupled to a first memory clock and to a second display clock, wherein a first memory clock rate is different from a second display clock rate, wherein the transpose system includes a plurality of delays, a barrel shifting circuit coupled to the plurality of delays, and a second plurality of delays coupled to the barrel shifting circuit, wherein the plurality of delays includes a first delay having a delay value of zero and a second delay having a delay value of one, wherein the write buffer is coupled to a first pixel clock and to a second memory clock, and the read buffer is coupled to the second memory clock and to a third display clock, and wherein the transpose buffer comprises a plurality of parallel transpose buffers that each receive a separate stream of video ordered pixel data, a method, comprising:

receiving the video ordered pixel data at the transpose system and generating the bit plane blocks of data with the transpose system;

receiving the bit plane blocks of data at the write buffer and generating the bit plane data frames with the write buffer;

receiving the bit plane data frames at the memory controller and writing the first bit plane data frame to the memory while simultaneously reading the second bit plane data frame from memory using the memory controller;

receiving the second bit plane data frame at a read buffer and converting the second bit plane data frame to a digital display device format using the read buffer;

wherein writing the first bit plane data frame to the memory while simultaneously reading the second bit plane data frame from memory using the memory controller comprises using the scatter reading and writing process to write the first bit plane data frame to the memory while simultaneously reading the second bit plane data frame from the memory;

receiving the first pixel clock signal and the second memory clock signal at the write buffer;

wherein the first pixel clock signal is received at a different rate than the second memory clock rate signal; and

wherein generating the bit plane blocks of data with the transpose system comprises:

delaying different bits of the video ordered pixel data with the plurality of delays; and

shifting the different bits of delayed video ordered pixel data with the barrel shifting circuit.

15. A data processing system comprising:

a transpose system configured to receive video ordered pixel data and to generate bit plane blocks of data using a shift register that delays each pixel of a plurality of pixels by a different amount;

a write buffer configured to receive the bit plane blocks of data using a pixel clock having a first clock rate and to generate bit plane data frames using a memory clock having a second clock rate;

a memory controller configured to receive the bit plane data frames and to write a first bit plane data frame to a memory while simultaneously reading a second bit plane data frame from memory; and

a read buffer configured to receive the second bit plane data frame and to convert the second bit plane data frame to a digital display device format;

wherein the read buffer is coupled to a memory clock and to a display clock, and the memory controller arbitrates operations involving the memory based on a priority, where the priority is in inverse order to a length of time each operation takes;

wherein the first clock rate and the second clock rate are different from a third clock rate of the display clock.

16. The data processing system of claim 15 wherein the memory is no larger than two frames of video data and the memory controller arbitrates between writes and reads to the memory.

17. The data processing system of claim 15 wherein the memory comprises a Double Data Rate 3 Synchronous Dynamic Random Access Memory block.

18. The data processing system of claim 15 wherein the input to the write buffer is wider than the output of the write buffer.

19. The data processing system of claim 18 wherein the memory controller assigns a lowest priority to bit plane data writes, a medium priority to bit plane data writes and a highest priority to low level memory functions, and arbitrates between writes and reads to the memory based on priority.

20. The data processing system of claim 15 wherein the transpose system further comprises:

a plurality of delays, each delay having a different delay value; and

a barrel shifting circuit coupled to the plurality of delays.