SCALABILITY OF MULTI-DIRECTIONAL VIDEO STREAMING

Info

Publication number: 20210227236
Type: Application
Filed: Apr 2, 2021
Publication Date: Jul 22, 2021
Inventors: Alexandros Tourapis (Milpitas, CA), Dazhong Zhang (Milpitas, CA), Hang Yuan (San Jose, CA), Hsi-Jung Wu (Cupertino, CA), Jae Hoon Kim (San Jose, CA), Jiefu Zhai (San Jose, CA), Ming Chen (Cupertino, CA), Xiaosong Zhou (Campbell, CA)
Application Number: 17/221,299

Abstract

Aspects of the present disclosure provide techniques for reducing latency and improving image quality of a viewport extracted from multi-directional video communications. According to such techniques, first streams of coded video data are received from a source. The first streams include coded data for each of a plurality of tiles representing a multi-directional video, where each tile corresponding to a predetermined spatial region of the multi-directional video, and at least one tile of the plurality of tiles in the first streams contains a current viewport location at a receiver. The techniques include decoding the first streams and displaying the tile containing the current viewport location. When the viewport location at the receiver changes to include a new tile of the plurality of tiles, retrieving and decoding first streams for the new tile, displaying the decoded content for the changed viewport location, and transmitting the changed viewport location to the source.

Description

Description

BACKGROUND

The present disclosure relates to coding techniques for multi-directional imaging applications.

Some modern imaging applications capture image data from multiple directions about a camera. Some cameras pivot during image capture, which allows a camera to capture image data across an angular sweep that expands the camera's effective field of view. Some other cameras have multiple imaging systems that capture image data in several different fields of view. In either case, an aggregate image may be created that merges image data captured from these multiple views.

A variety of rendering applications are available for multi-directional content. One rendering application involves extraction and display of a subset of the content contained in a multi-directional image. For example, a viewer may employ a head mounted display and change the orientation of the display to identify a portion of the multi-directional image in which the viewer is interested. Alternatively, a viewer may employ a stationary display and identify a portion of the multi-directional image in which the viewer is interested through user interface controls. In these rendering applications, a display device extracts a portion of image content from the multi-directional image (called a “viewport” for convenience) and displays it. The display device would not display other portions of the multi-directional image that are outside an area occupied by the viewport.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a system according to an aspect of the present disclosure.

FIG. 2 figuratively illustrates a rendering application for a sink terminal according to an aspect of the present disclosure.

FIG. 3 illustrates an exemplary partitioning scheme in which a frame is partitioned into non-overlapping tiles.

FIG. 4 illustrates a coded data stream that may be developed from coding of a single tile 410, according to an aspect of the present disclosure.

FIG. 5 illustrates a method according to an aspect of the present disclosure.

FIG. 6. illustrates a method according to an aspect of the present disclosure.

FIG. 7. illustrates example data flows of FIG. 6.

FIG. 8 illustrates a frame of omnidirectional video that may be coded by a source terminal.

FIG. 9 illustrates a frame of omnidirectional video that may be coded by a source terminal.

FIG. 10 is a simplified block diagram of an example video distribution system.

FIG. 11 illustrates a frame 1100 of multi-directional video with a moving viewport.

FIG. 12 is a functional block diagram of a coding system according to an aspect of the present disclosure.

FIG. 13 is a functional block diagram of a decoding system according to an aspect of the present disclosure.

FIG. 14 illustrates an exemplary multi-directional image projection format according to one aspect.

FIG. 15 illustrates an exemplary multi-directional image projection format according to another aspect.

FIG. 16 illustrates another exemplary multi-directional projection image format 1630.

FIG. 17 illustrates an exemplary prediction reference pattern.

FIG. 18 illustrates two exemplary multi-directional projections for combining.

FIG. 19 illustrates an exemplary system for creating a residual from two different multi-directional projections.

DETAILED DESCRIPTION

In communication applications, aggregate source image data at a transmitter exceeds the data that is needed to display a rendering of a viewport at a receiver. Coding techniques for transmitting source data may account for a current viewport of the receiving rendering device. However, when accounting for a moving viewport, these coding techniques incur coding and transmission latency and coding inefficiency.

Aspects of the present disclosure provide techniques for reducing latency and improving image quality of a viewport extracted from multi-directional video communications. According to such techniques, first streams of coded video data are received from a source. The first streams include coded data for each of a plurality of tiles representing a multi-directional video, where each tile corresponding to a predetermined spatial region of the multi-directional video, and at least one tile of the plurality of tiles in the first streams contains a current viewport location at a receiver. The techniques include decoding the first streams corresponding to the at least one tile containing the current viewport location, and displaying the decoded content for the current viewport location. When the viewport location at the receiver changes to include a new tile of the plurality of tiles, retrieving first streams for the new tile, decoding the retrieved first streams, displaying the decoded content for the changed viewport location, and transmitting information representing the changed viewport location to the source.

FIG. 1 illustrates a system 100 according to an aspect of the present disclosure. There, the system 100 is shown as including a source terminal 110 and a sink terminal 120 interconnected by a network 130. The source terminal 110 may transmit a coded representation of omnidirectional video to the sink terminal 120. The sink terminal 120 may receive the coded video, decode it, and display a selected portion of the decoded video.

FIG. 1 illustrates the source terminal 110 as a multi-directional camera that captures image data of a local environment before coding it. In another aspect, the source terminal 110 may receive omni-directional video from an external source (not shown), such as a streaming service or storage device.

The sink terminal 120 may determine a viewport location in a three-dimensional space represented by the multi-directional image. The sink terminal 120 may select a portion of decoded video to be displayed, for example, based on the terminal's orientation in free space. FIG. 1 illustrates the sink terminal 120 as a head mounted display but, in other aspects, the sink terminal 120 may be another type of display device, such as a stationary flat panel display, smartphone, tablet computer, gaming device or portable media player. Different types of user controls may be provided with each such display type through which a viewer identifies the viewport. The sink terminal's device type is immaterial to the present discussion unless otherwise noted herein.

The network 130 represents any number of computer and/or communication networks that extend from the source terminal 110 to the sink terminal 120. The network 130 may include one or a combination of circuit-switched and/or packet-switched communication networks. The network 130 may communicate data between the source terminal 110 and the sink terminal 120 by any number of wireline and/or wireless communication media. The architecture and operation of the network 130 is immaterial to the present discussion unless otherwise noted herein.

FIG. 1 illustrates a communication configuration in which coded video data is transmitted in a single direction from the source terminal 110 to the sink terminal 120. Aspects of the present disclosure find application with communication equipment that exchange coded video data in a bidirectional fashion, from terminal 110 to terminal 120 and also from terminal 120 to terminal 110. The principles of the present disclosure find application with both unidirectional and bidirectional exchange of video.

FIG. 2 figuratively illustrates a rendering application for a sink terminal 200 according to an aspect of the present disclosure. There, omnidirectional video is represented as if it exists along a spherical surface 210 provided about the sink terminal 200. Based on the orientation of the sink terminal 200, the terminal 200 may select a portion of the video (called, a “viewport” for convenience) and display the selected portion. As the orientation of the sink terminal 200 changes, the terminal 200 may select different portions from the video. For example, FIG. 2 illustrates the viewport changing from a first location 230 to a second location 240 along the surface 210.

Aspects of the present disclosure may apply video compression techniques according to any of a number of coding protocols. For example, the source terminal 110 (FIG. 1) may code video data according to an ITU-T/ISO MPEG coding protocol such as H.265 (HEVC), H.264 (AVC), and the upcoming H.266 (VVC) standard, an AOM coding protocol such as AV1, or a predecessor coding protocol. Typically, such protocols parse individual frames of video into spatial arrays of video, called “pixel blocks” herein, and may code the pixel blocks in a regular coding order such as a raster scan order.

In an aspect, individual frames of multi-directional content may be parsed into individual spatial regions, herein called “tiles”, and coded as independent data streams. FIG. 3 illustrates an exemplary partitioning scheme in which a frame 300 is partitioned into non-overlapping tiles 310.0-310.11. In a case where the frame 300 represents omnidirectional content (e.g., it represents image content in a perfect 360° field of view, the image content will be continuous across opposing left and right edges 320, 322 of the frame 300).

In an aspect, the tiles described here may be a special case of the tiles used in some standards, such as HEVC. In this aspect, the tiles used herein may be “motion constrained tile sets,” where all frames are segmented using the exact same tile partitioning, and each tile in every frame is only permitted to use prediction from co-located tiles in other frames. Filtering in the decoder loop may also be disallowed across tiles, providing decoding independency between tiles.

FIG. 4 illustrates a coded data stream that may be developed from coding of a single tile 410, according to an aspect of the present disclosure. The coded tile 410 may be coded in several representations 420-450, labeled “tier 0,” “tier 1,” “tier 2,” and “tier 3” respectively, each corresponding to a predetermined bandwidth constraint. For example, a tier 0 coding may be generated for a 500 kbps representation, a tier 1 coding may be generated for a 2 Mbps representation, a tier 2 coding may be generated for a 4 Mbps representation, and a tier 3 coding may be generated for an 8 Mbps representation. In practice, the number of tiers and the selection of target bandwidth may be tuned to suit individual application needs.

The coded tile 410 also may contain a number of differential codings 460-480, each coded differentially with respect to the coded data of the tier 0 representation and each having a bandwidth tied to the bandwidth of another bandwidth tier. Thus, in an example where the tier 0 coding is generated at a 500 Kbps representation and the tier 1 coding is generated at a 2 Mbps representation, the tier 1 differential coding 460 may be coded at a 1.5 Mbps representation (1.5 Mbps=2 Mbps-500 Kbps). The other differential codings 470, 480 may have data rates that match the differences between the data rates of their base tiers 440, 450 and the data rate of the tier 0 coding 420. In an aspect, elements of the differential codings 460, 470, 480 may be coded predictively using content from a corresponding chunk of the tier 0 coding as a prediction reference; in such an embodiment, the differential codings 460, 470, 480 may be generated as enhancement layers according to a scalable coding protocol in which tier 0 serves as a base layer for those encodings.

The codings 420-480 of the tile are shown as partitioned into individual chunks (e.g., chunks 420.1-420.N for tier 0 420, chunks 430.1-430.N for tier 1 430, etc.). Each chunk may be referenced by its own network identifier. During operation, a client device 120 (FIG. 1) may select individual chunks for download and request the chunks from a source terminal 120 (FIG. 1).

FIG. 5 illustrates a method 500 according to an aspect of the present disclosure. According to the method 500, terminal 110 may transmit high quality coding for tiles included in a current viewport (msg. 510) and low quality coding for other tiles (msg. 520) from source terminal 110 to sink terminal 120. Sink terminal 120 may then decode and render data of the current viewport (box 530). If the viewport does not move to include different tiles (box 540), terminal 120 repeats decoding and rendering the current tiles (back to box 530). Alternately, if the viewport moves such that the tiles included in the viewport change, then the change in the viewport is reported back to the source terminal 110 (msg. 550). The source terminal 110 then repeats by sending high quality coding for the tiles of the new viewport location (back to msg. 510), and low quality tiles that do not include the new viewport location (msg. 520).

The operations illustrated in FIG. 5 are expected to provide low latency rendering of new viewports of multi-directional video in the presence of communication latencies between a source terminal 110 and a sink terminal 120. By transmitting low quality codings of tiles that do not belong to a current viewport, a sink terminal 120 may buffer the data locally. If/when a viewport changes to a spatial location that coincides with one of the formerly non-viewed viewports, the locally-buffered video may be decoded and displayed. The decoding and display can occur without incurring latencies involved with round-trip communication from the sink terminal 120 to the source terminal 110, which would be needed if data of the non-viewed viewport(s) were not prefetched to the sink device 120.

In an embodiment, a sink terminal 120 may identify a location of current viewport by identifying a spatial location within the multiview image at which the viewport is located, for example, by identifying its location within a coordinate space defined for the image (see, FIG. 2). In another aspect, a sink terminal 120 may identify tier(s) of a multi-directional image (FIG. 3) in which its current viewport is located and request chunk(s) from the tiers (FIG. 4) based on this identification.

FIG. 6 illustrates a method 600 of exemplary tile download according to an aspect of the present disclosure. FIG. 6 illustrates download operations that may occur for a tile that is not being viewed initially but to which the viewport moves during operation. Thus, a sink terminal 120 may issue requests for the tile at a tier 0 level of services, which are downloaded to the terminal 120 from a source terminal 110. FIG. 6 illustrates a request 610 for a chunk Y of the tile, from the tier 0 level of service. The terminal 110 may provide content of the chunk Y in a response message 630. The request and response messages 610, 630 for the chunk Y may be interleaved with other requests and responses exchanged by the source and sink terminals 110 (shown in phantom), 120 relating to chunks of other tiles, including both the tile in which the viewport is located and other tiles that are not being viewed.

In the example of FIG. 6, the viewport changes (box 620) from a prior tile to the tile that was requested in msg. 610. The viewport may change either while a request (msg. 610) for chunk Y is pending or after the content of chunk Y has been received (msg. 630). The example of FIG. 6 illustrates the viewport change (box 620) as occurring while msg. 610 is pending. In response to the viewport change, the terminal 120 may determine, from a history of prior requests, that a chunk Y at a tier 0 service level either has been requested or already has been received and is stored locally at the terminal 120. The terminal 120 may estimate whether there is time to request additional data of chunk Y (a differential tier) before the chunk Y must be rendered. If so, the terminal 120 may issue a request for chunk Y of the new tile using a differential tier (msg. 640).

If the source terminal 110 provides the media content of the differential tier (msg. 650) before the chunk Y must be rendered, the sink terminal 120 may render chunk Y (box 660) using content developed from the content provided in messages 630 and 650. If not, the sink terminal 120 may render chunk Y (box 660) using content developed from the tier 0 level of service (msg. 630).

FIG. 7 illustrates a rendering timeline of chunks that may occur according to the foregoing aspects of the present disclosure. FIG. 7 includes a data stream for a prior tile 710, for example for the tile of a viewport location prior to the change of the viewport location as in box 620 of FIG. 6, and FIG. 7 includes a data stream for a new tile 720, for example for the tile that includes the new viewport location after box 620 of FIG. 6. Data for prior tile 710 includes chunks Y−3 to Y+1, and data for new tile includes chucks Y−3 to Y+4. In this example, chunks Y−3 to Y−1 for the prior tile are shown having been retrieved at a relatively high level of service or quality (shown as tier 3) and, prior to a viewport switch, being rendered. When a viewport switch occurs from the prior tile 710 to the new tile 720 in the midst of chunk Y−1, a tier 0 level of service may be rendered for tile 720 at chunk Y−1. This may occur, for example, if a sink device 120 estimates that insufficient time exists to download a differential tier for new tile 720 at chunk Y−1, or if the sink device 120 requested a differential tier for the chunk but it was not received in time to be rendered.

The example of FIG. 7 illustrates rendering of tile 720 at chunks Y to Y+2 using data from both tier 0 and from differential tiers. This may occur, for example, if a sink device 120 had already requested the tier 0 levels of service for the chunks Y to Y+2 prior to the viewport switch and (for example, see request 610 in FIG. 6), after the switch, the sink device retrieved differential tiers for those chunks Y to Y+2 (for example, see response 650 in FIG. 6).

The example of FIG. 7 illustrates rendering of tile 720 from tier 3 starting from chunk Y+3.

A switch from differential tiers to higher quality tiers (e.g., tier 3) may occur for chunks for which download requests are made after the viewport switch occurs. Thus, when a viewport changes from one tile to another, a sink terminal 120 may determine what tiers to request for the new tile from its operating state and the transmission latency in the system. In some cases there will be a transitional period after the viewport moves and before the sink terminal can render the new viewport location at a high quality of service (such as tier 3 for chunk Y+3 and later in FIG. 7). The transitional period may include rendering the new viewport location from a lower quality of service (such as tier 0 for chunk Y−1 in FIG. 7). The transitional period may also include rendering the new viewport location from an enhanced lower quality of services (such as tier 0 enhanced by the differential tier for chunks Y to Y+2 in FIG. 7).

FIG. 8 illustrates a frame 800 of omnidirectional video that may be coded by a source terminal 110. There, the frame 800 is illustrated as having been parsed into a plurality of tiles 810.0-810.n. Each tile may be coded in raster scan order. Thus, content of tile 810.0 may be coded separately from content of tile 810.1, content of tile 810.1 may be coded separately from content of tile 810.2. Furthermore, tiles 810.1-810.n may be coded in multiple tiers, producing discrete encoded data that may be segmented by both tier and tile. In one aspect, encoded data may also be segmented into time chunks. Hence, encoded data may be segmented into discrete segments for each time chunk, tile, and tier.

As discussed, a sink terminal 120 (FIG. 1) may extract a viewport 830 from the frame 800, after it is coded by the source terminal 110 (FIG. 1), transmitted to the sink terminal 120, and decoded. The sink terminal 120 may display the viewport 800 locally. The sink terminal 120 may transmit to the source terminal 110 viewport information, such as data identifying a location of the viewport 830 within an area of the frame 800. For example, the sink terminal 120 may transmit offset data, shown as offset-x and offset-y from origin 820, identifying a location of the viewport 830 within the area of the frame 800. In an aspect, a size and/or shape of the viewport 830 may be included in the viewport information sent to source terminal 110. Source terminal 120 may then use the received viewport information to select which discrete portions of encoded data to transmit to sink terminal 120. In the example of FIG. 8, viewport 830 spans tiles 810.5 and 810.6. Hence, a first tier may be sent for tiles 810.5 and 810.6, while a second tier may be sent for the remaining tiles that do not include any portion of the viewport. For example, when the first tier provides higher quality video and the second tier provides more efficient coding (high compression), the first tier may be sent to sink terminal 120 for tiles 810.5 and 810.6, while the second tier providing lower quality video may be sent for some or all of the other tiles.

In an aspect, a lower quality tier may be provided for all tiles. In another aspect a lower quality tier may be provided for only a portion of the frame 800. For example, a lower quality tier may be provided only for 180 degrees of view centered on the current viewport (instead of 360 degrees), or the lower quality tier may be provided only in areas of frame 800 where the viewport is likely to move next.

In an aspect, frame 800 may be encoded according to a layered coding protocol, where one tier is coded as a base layer, and other tiers are encoded as enhancement layers of the base layer. An enhancement layer may be predicted from one or more lower layers. For example, a first enhancement layer may be predicted from the base layer, and a second, higher enhancement layer may be predicted from either the base layer or from the first, lower enhancement layer.

An enhancement layer may be differentially or predictively coded from one or more lower layers. Non-enhancement layers, such as a base layer, may be encoded independently of other layers. Reconstruction at a decoder of a differentially coded layer will require both the encoded data segment of the differentially coded layer and the segment(s) from the differentially coded layer(s) from which it is predicted. In the case of a predictively coded layer, sending that layer may include sending both the discrete encoded data segment of the predictively coded layer, and also sending the discrete encoded data segment of the layer(s) used as a prediction reference. In an example, differential layered coding of frame 800, a lower base layer may be sent to sink terminal 120, for all tiles, while discrete data segments for a higher differential layer (that is coded using predictions from the base layer) may be sent only for tiles 810.5 and 810.6 as the viewport 830 is included in those tiles.

FIG. 9 illustrates a frame 900 of omnidirectional video that may be coded by a source terminal 110. There, as in frame 800 of FIG. 8, the frame 900 is illustrated as having been parsed into a plurality of tiles 810.0-810.n. Frame 900 may represent a different video time from frame 800, for example a frame 900 may be a later time in the timeline of the video. At this later time, the viewport of sink terminal 120 may have moved to the location of viewport 930, which may be identified by offset-x′ and offset-y′ from origin 820. When the viewport of sink terminal 120 moves from the location of viewport 830 in FIG. 8 to the location of viewport 930 in FIG. 9, the sink terminal sends the new viewport information to source terminal 110. In response, sink terminal 120 may change which discrete segments of encoded video are sent to sink terminal, such that a first layer may be sent for tiles that include a portion of the viewport, while a second layer may be sent for tiles that do not include a portion of the viewport. In the example of FIG. 9, pixels of tiles 810.0 and 810.1 are included in viewport 930 and hence a first layer may be sent for these tiles, while a second layer may be sent for the tiles that do not include a portion of the viewport.

FIG. 10 is a simplified block diagram of an example video distribution system 100 suitable for use with the present invention, including when multi-directional video is pre-encoded and stored on a server. The system 1000 may include a distribution server system 1010 and a client device 1020 connected via a communication network 1030. The distribution system 1000 may provide coded multi-directional video data to the client 1020 in response to client requests. The client 1020 may decode the coded video data and render it on a display.

The distribution server 1010 may include a storage system 1040 on which pre-encoded multi-directional videos are stored in a variety of tiers for download by the client device 1020. The distribution server 1010 may store several coded representations of a video content item, shown as tiers 1, 2, and 3, which have been coded with different coding parameters. The video content item includes a manifest file containing pointers to chunks of encoded video data for each tier.

In the example of FIG. 10, the Tiers 1 and 2 differ by average bit rate, with Tier 2 enabling a higher quality reconstruction of the video content item at a higher average bitrate compared to that provided by Tier 1. The difference in bitrate and quality may be induced by differences in coding parameters—e.g., coding complexity, frame rates, frame size and the like. Tier 3 may be an enhancement layer of Tier 1, which, when decoded in combination with Tier 1, may improve the quality of the Tier 1 representation if it were decoded by itself. Each video tier 1-3 may be parsed into a plurality of chunks CH1.1-CH1.N, CH2.1-CH2.N, and CH3.1-CH3.N. Manifest file 1050 may include pointers to each chunk of encoded video data for each tier. The different chunks may be retrieved from storage and delivered to the client 1020 over a channel defined in the network 1030. Channel stream 1040 represents aggregation of transmitted chunks from multiple tiers. Furthermore, as explained above with regard to FIGS. 4 and 5, a multi-directional video may be spatially segmented into tiles. FIG. 10 depicts the chunks available for the various tiers of one tile. Manifest 1050 may additionally include other tiles (not depicted in FIG. 10), such as by providing metadata and pointers to multiple tiers including storage locations encoded data chunks for each of the various tiers.

The example of FIG. 10 illustrates three encoded video tiers 1, 2, and 3 for one tile, each tier coded into N chunks (1 to N) with different coding parameters. Although not required, this example illustrates the chunks of each tier as temporally-aligned so that chunk boundaries define respective time periods (t₁, t₂, t₃, . . . , t_N) of video content. Chunk boundaries may provide preferred points for stream switching between the tiers. Stream switching may be facilitated, for example, by resetting motion prediction coding state at switching points.

Times A, B, C, and D are depicted in FIG. 10 in part to assist in illustrating a moving viewport in an aspect of this disclosure. Times A, B, C, and D are positioned along the streaming timeline of the media chunks referenced by manifest 1050. Specifically, Times A, B, and D may correspond to the beginning of time period t₁, t₂, and t₃, respectively, while time C may correspond to a time somewhere in the middle of time period t₂, between the beginning of t₂and the beginning of t₃.

In an aspect, multi-directional image data may include depth maps and/or occlusion information. Depth maps and/or occlusion information may be included as separate channel(s) and manifest 1050 may include references to these separate channel(s) for depth maps and/or occlusion information.

FIG. 11 illustrates a frame 1100 of multi-directional video with a moving viewport. There, frame 1100 is illustrated as having been parsed into a plurality of tiles 1110.0-1110.n. Superimposed upon frame 1100 is viewport location 1130 which may correspond to a first location of a viewport in client 1020 at first time, and viewport location 1140, which may correspond to a second location of the same viewport at a second time.

In an aspect, in steady state when a viewport is not moving, client 1020 may extract a viewport image from the high reconstruction quality of tier 2. During a transitional period, client 1020 may extract a viewport image from the reconstructed combination of tier 1 and enhancement layer tier 3 when the viewport moves into a new spatial tile, and then return to a steady state by extracting a viewport image from tier 2 once tier 2 is again available at client 1020. An example of this is illustrated in tables 1 and 2 for a viewport of client 1020 that were to jump from viewport location 1130 to viewport location 1140 right at time C. Client 1020 requests for tiers of tiles is listed in Table 1, and tiers from which a viewport image is extracted is listed in Table 2.

TABLE 1 Requests for tiles Time A Time B Time C Time D Tier 1 Tiles All tiles All tiles None All tiles Requested except except except (1 MB/sec) 1110.0 1110.0 1110.5 Tier 2 Tiles 1110.0 1110.0 None 1110.5 Requested (2 MB/sec) Tier 3 Tiles None None 1110.5 None Requested (Enhance- ment of Tier 1)

TABLE 2 Viewport extraction Time A Time B Time C Time D Viewport Tile 1110.0 Tile 1110.0 Tile 1110.5 Tile 1110.5 location Extracted for Tier 2 Tier 2 Tier 1; then Tier 2 Viewport Tier 1 + Tier 3

Under the initial steady state condition during time period t₁, the viewport is not moving and viewport location 1130 is fully contained in tile 1110.0. Tier 2, being the higher quality tier, may be requested by client 1020 from server 1010 for tile 1110.0 at time A, as indicated in Table 1. For tiles not included in the viewport at location 1130 (tiles 1110.1-1110.n), the lower quality and more highly compressed tier 1 is requested instead. Hence, tier 1 chunks are requested for time period t₁at time A for all tiles other than tile 1110.0. The viewport is then extracted from the reconstruction of tier 2 by client 1020 starting at time A.

At time B, the viewport has not yet moved, so the same tiers are requested by client 1020 for the same tiles as at time A, but the requests are for the specific chunks corresponding to time period t₂. At time C, the viewport of client 1020 may jump from viewport location 1130 to location 1140. At the point to time C, somewhere between the beginning and end of t₂, lower quality tier 1 has already been requested for the new location of the viewport, tile 1110.5. So, a viewport can be extracted immediately from tier 1 at time C when the viewport moves. At time C, tier 3 can may be requested, and as soon as it is available, the combination of tier 1 and enhancement layer tier 3 can be used for extracting a viewport image at client 1020. At time D, client 1020 may go back to a steady state by requesting layer 2 for tiles containing the viewport location, and layer 0 for tiles not containing the viewport location.

FIG. 12 is a functional block diagram of a coding system 1200 according to an aspect of the present disclosure. The system 1200 may include an image source 1210, an image processing system 1220, a video coder 1230, a video decoder 1240, a reference picture store 1250 and a predictor 1260. The image source 1210 may generate image data as a multi-directional image, containing image data of a field of view that extends around a reference point in multiple directions. The image processing system 1220 may perform image processing operations to condition the image for coding. In one aspect, the image processing system 1220 may generate different versions of source data to facilitate encoding the source data into multiple layers of coded data. For example, image processing system 1220 may generate multiple different projections of source video aggregated from multiple cameras. In another example, image processing system 1220 may generate resolutions of source video for a high layer with a higher spatial resolution and a lower layer with a lower spatial resolution. The video coder 1230 may generate a multi-layered coded representation of its input image data, typically by exploiting spatial and/or temporal redundancies in the image data. The video coder 1230 may output a coded representation of the input data that consumes less bandwidth than the original source video when transmitted and/or stored. Video coder 1230 may output data in discrete time chunks corresponding to a temporal portion of source image data, and in some aspects, separate time chunks encoded data may be decoded independently of other time chunks. Video coder 1230 may also output data in discrete layers, and in some aspects, separate layers may be transmitted independently of other layers.

The video decoder 1240 may invert coding operations performed by the video encoder 1230 to obtain a reconstructed picture from the coded video data. Typically, the coding processes applied by the video coder 1230 are lossy processes, which cause the reconstructed picture to possess various errors when compared to the original picture. The video decoder 1240 may reconstruct pictures of select coded pictures, which are designated as “reference pictures,” and store the decoded reference pictures in the reference picture store 1250. In the absence of transmission errors, the decoded reference pictures may replicate decoded reference pictures obtained by a decoder (not shown in FIG. 12).

The predictor 1260 may select prediction references for new input pictures as they are coded. For each portion of the input picture being coded (called a “pixel block” for convenience), the predictor 1260 may select a coding mode and identify a portion of a reference picture that may serve as a prediction reference search for the pixel block being coded. The coding mode may be an intra-coding mode, in which case the prediction reference may be drawn from a previously-coded (and decoded) portion of the picture being coded. Alternatively, the coding mode may be an inter-coding mode, in which case the prediction reference may be drawn from another previously-coded and decoded picture. In one aspect of layered coding, prediction references may be pixel blocks previously decoded from another layer, typically a lower layer lower than the layer currently being encoded. In the case of two layers that encode two different projections formats of multi-directional video, a function such as an image warp function may be applied to a reference image in one projection format at a first layer to predict a pixel block in a different projection format at a second layer.

In another aspect of a layered coding system, a differentially coded enhancement layer may be coded with restricted prediction references to enable seeking or layer/tier switching into the middle of an encoded enhancement layer chunk. In a first aspect, predictor 1260 may restrict prediction references of only every frame in an enhancement layer to be frames of a base layer or other lower layer. When every frame of an enhancement layer is predicted without reference to other frames of the enhancement layer, a decoder may switch to the enhancement layer at any frame efficiently because previous enhancement layer frames will never be necessary to reference as a prediction reference. In a second aspect, predictor 1260 may require that every Nth frame (such as every other frame) within a chuck be predicted only from a base layer or other lower layer to enable seeking to every Nth frame within an encoded data chunk.

When an appropriate prediction reference is identified, the predictor 1260 may furnish the prediction data to the video coder 1230. The video coder 1230 may code input video data differentially with respect to prediction data furnished by the predictor 1260. Typically, prediction operations and the differential coding operate on a pixel block-by-pixel block basis. Prediction residuals, which represent pixel-wise differences between the input pixel blocks and the prediction pixel blocks, may be subject to further coding operations to reduce bandwidth further.

As indicated, the coded video data output by the video coder 1230 should consume less bandwidth than the input data when transmitted and/or stored. The coding system 1200 may output the coded video data to an output device 1270, such as a transceiver, that may transmit the coded video data across a communication network 130 (FIG. 1). Alternatively, the coding system 1200 may output coded data to a storage device (not shown) such as an electronic-, magnetic- and/or optical storage medium.

The transceiver 1270 also may receive viewport information from a decoding terminal (FIG. 7) and provide the viewport information to controller 1280. Controller 1280 may control the image processor 1220, the video coding process overall, including video coder 1230 and transceiver 1270. Viewport information received by transceiver 1270 may include a viewport location and/or a preferred projection format. In one aspect, controller 1280 may control transceiver 1270 based on viewport information to send certain coded layer(s) for certain spatial tiles, while sending a different coded layer(s) for other tiles. In another aspect, controller 1280 may control the allowable prediction references in certain frames of certain layers. In yet another aspect, controller 1280 may control the projection format(s) or scaled layers produced by image processor 1230 based on the received viewport information.

FIG. 13 is a functional block diagram of a decoding system 1300 according to an aspect of the present disclosure. The decoding system 1300 may include a transceiver 1310, a buffer 1315, a video decoder 1320, an image processor 1330, a video sink 1340, a reference picture store 1350, a predictor 1360, and a controller 1370. The transceiver 1310 may receive coded video data from a channel and route it to buffer 1315 before sending it to video decoder 1320. The coded video data may be organized into chunks of time and spatial tiles, and may include different coded layers for different tiles. The video data buffered in buffer 1315 may span the video time of multiple chunks. The video decoder 1320 may decode the coded video data with reference to prediction data supplied by the predictor 1360. The video decoder 1320 may output decoded video data in a representation determined by a source image processor (such as image processor 1220 of FIG. 12) of a coding system that generated the coded video. The image processor 1330 may extract video data from the decoded video according to the viewport orientation currently in force at the decoding system. The image processor 1330 may output the extracted viewport data to the video sink device 1340. Controller 1370 may control the image processor 1330, the video decoding processing including video decoder 1320, and transceiver 1310.

The video sink 1340, as indicated, may consume decoded video generated by the decoding system 1300. Video sinks 1340 may be embodied by, for example, display devices that render decoded video. In other applications, video sinks 1340 may be embodied by computer applications, for example, gaming applications, virtual reality applications and/or video editing applications, that integrate the decoded video into their content. In some applications, a video sink may process the entire multi-directional field of view of the decoded video for its application but, in other applications, a video sink 1340 may process a selected sub-set of content from the decoded video. For example, when rendering decoded video on a flat panel display, it may be sufficient to display only a selected subset of the multi-directional video. In another application, decoded video may be rendered in a multi-directional format, for example, in a planetarium.

The transceiver 1310 also may send viewport information provided by the controller 1370, such as a viewport location and/or a preferred projection format, to the source of encoded video, such as terminal 1200 of FIG. 12. When the viewport location changes, controller 1370 may provide new viewport information to transceiver 1310 to send on to the encoded video source. In response to the new viewport information, missing layers for certain previously received but not yet decoded tiles of encoded video may be received by transceiver 1310 and stored in buffer 1315. Decoder 1320 may then decode these tiles using these replacement layers (which were previously missing) instead of the layers that had previously been received based on the old viewport location.

Controller 1370 may determine viewport information based on a viewport location. In one example, the viewport information may include just a viewport location, and the encoded video source may then use the location to identify which encoded layers to provide to decoding system 1300 for specific spatial tiles. In another example, viewport information sent from the decoding system may include specific requests for specific layers of specific tiles, leaving much of the viewport location mapping in the decoding system. In yet another example, viewport information may include a request for a particular projection format based on the viewport location.

The principles of the present disclosure find application with a variety of projection formats of multi-directional images. In an aspect, one may convert between the various projection formats of FIGS. 14-16 using a suitable projection conversion function.

FIG. 14 illustrates an exemplary multi-directional image projection format according to one aspect. The multi-directional image 1430 may be generated by a camera 1410 that pivots along an axis. During operation, the camera 1410 may capture image content as it pivots along a predetermined angular distance 1420 (preferably, a full 360°) and may merge the captured image content into a 360° image. The capture operation may yield a multi-directional image 1430 that represents a multi-directional field of view having been partitioned along a slice 1422 that divides a cylindrical field of view into a two dimensional array of data. In the multi-directional image 1430, pixels on either edge 1432, 1434 of the image 1430 represent adjacent image content even though they appear on different edges of the multi-directional image 1430.

FIG. 15 illustrates an exemplary multi-directional image projection format according to another aspect. In the aspect of FIG. 15, a camera 1510 may possess image sensors 1512-1516 that capture image data in different fields of view from a common reference point. The camera 1510 may output a multi-directional image 1530 in which image content is arranged according to a cube map capture operation 1520 in which the sensors 1512-1516 capture image data in different fields of view 1521-1526 (typically, six) about the camera 1510. The image data of the different fields of view 1521-1526 may be stitched together according to a cube map layout 1530. In the example illustrated in FIG. 15, six sub-images corresponding to a left view 1521, a front view 1522, a right view 1523, a back view 1524, a top view 1525 and a bottom view 1526 may be captured, stitched and arranged within the multi-directional picture 1530 according to “seams” of image content between the respective views 1521-1526. Thus, as illustrated in FIG. 15, pixels from the front image 1532 that are adjacent to the pixels from each of the left, the right, the top, and the bottom images 1531, 1533, 1535, 1536 represent image content that is adjacent respectively to content of the adjoining sub-images. Similarly, pixels from the right and back images 1533, 1534 that are adjacent to each other represent adjacent image content. Further, content from a terminal edge 1538 of the back image 1534 is adjacent to content from an opposing terminal edge 1539 of the left image. The image 1530 also may have regions 1537.1-1537.4 that do not belong to any image. The representation illustrated in FIG. 15 often is called a “cube map” image.

Coding of cube map images may occur in several ways. In one coding application, the cube map image 1530 may be coded directly, which includes coding of null regions 1537.1-1537.4 that do not have image content. The encoding techniques of FIG. 3 may be applied to cube map image 1530.

In other coding applications, the cube map image 1530 may be repacked to eliminate null regions 1537.1-1537.4 prior to coding, shown as image 1540. The techniques described in FIG. 3 may also be applied to a packed image frame 1540. After decode, the decoded image data may be unpacked prior to display.

FIG. 16 illustrates another exemplary multi-directional projection image format 1630. The frame format of FIG. 16 may be generated by another type of omnidirectional camera 1600, called a panoramic camera. A panoramic camera typically is composed of a pair of fish eye lenses 1612, 1614 and associated imaging devices (not shown), each arranged to capture image data in a hemispherical view of view. Images captured from the hemispherical fields of view may be stitched together to represent image data in a full 360° field of view. For example, FIG. 16 illustrates a multi-directional image 1630 that contains image content 1631, 1632 from the hemispherical views 1622, 1624 of the camera and which are joined at a seam 1635. The techniques described hereinabove also find application with multi-directional image data in such formats 1630.

In an aspect, cameras, such as the cameras 1410, 1510, and 1610 in FIGS. 14-16, may capture depth or occlusion information in addition to visible light. In some cases, depth and occlusion information may be stored as separate data channels of data in multi-projection formats such as images such as 1430, 1530, 1540, and 1630. In other cases, depth and occlusion information may be included as a separate data channel in a manifest, such as manifest 1050 of FIG. 10.

FIG. 17 illustrates an exemplary prediction reference pattern. Video sequence 1700 includes a base layer 1720 and enhancement layer 1710, each layer comprising a series of corresponding frames. Base layer 1720 includes an intra-coded frame L0.I0 followed by predicted frames L0.P1-L0.P7. Enhancement layer 1710 includes predicted frames L1.P0-L1.P7. Intra-coded frame L0.I0 may be coded without prediction from any other frame. Predicted frames may be coded by predicting pixel blocks of the frame portions of reference frames indicated by solid arrows in FIG. 17, where the arrow head points to a reference frame that may be used as a prediction reference for a frame touching the tail of the arrow. For example, predicted frames in a base layer may be predicted using only a previous base layer frame as a prediction reference. As depicted in FIG. 17, L0.P1 is predicted only from frame L0.I0 as a reference, L0.P1 may be a reference for L0.P2, L0.P2 may be reference for L0.P3, and so on, as indicated by the arrows inside base layer 1720. The frames of enhancement layer 1710 may be predicted using only corresponding base layer reference frames, such that L0.I0 may be a prediction reference for L1.P0, L0.P1 may be a prediction reference for L1.P1, and so on.

In an aspect, enhancement layer 1710 frames may also be predicted from previous enhancement layer frames, as indicated by optional dashed arrows in FIG. 17. For example, frame L1.P7 may be predicted from either L0.P7 or L1.P6. Prediction references within enhancement layer 1710 may be limited such that only a subset of enhancement layer frames may use other enhancement layer frames as a prediction reference, and this subset of enhancement layer frames may follow a pattern. In the example of FIG. 17, every other frame of enhancement layer 1710 (L1.P0, L1.P2, L1.P4, and L1.P6) is predicted only from the corresponding base layer frame, while alternate frames (L1.P1, L1.P3, L1.P5, L1.P7) may be predicted from either base layer frames or previous enhancement layer frames. Tier switching to enhancement layer 1710 may be facilitated at the frames that are predicted only from lower layers because prior frames of the enhancement layer need not be previously decoded for use a reference frames. Enhancement layer frames that are predicted only from lower layer frames may be considered safe-switching frames, sometimes called key frames, because previous frames from the enhancement layer need not be available to correctly decode these safe switching frames.

In an aspect, a sink terminal may switch to a new layer or new tier on non-safe-switching frames when some decoded quality drift may be tolerated. A non-safe switching frame may be decoded without having access to the reference frames used for its prediction, and quality gradually gets worse as errors from incorrect predictions accumulate into what may be called quality drift. Error concealment techniques may be used to mitigate the quality drift due to switching at non-safe-switching enhancement layer frames. Example error concealment techniques include predicting from a frame similar to the missing reference frame, and periodic intra-refresh mechanisms. By tolerating some quality drift caused by switching at non-safe-switching frames, the latency can be reduced between moving a viewport and presenting images of the new viewport location.

FIG. 18 illustrates two exemplary multi-directional projections for combining. Images of the same scene may be encoded in a plurality of projection formats. In the example of FIG. 18, a multi-directional scene is encoded as a first image with a first projection format, such as an image 1810 in equirectangular projection format, and the same scene is encoded as a second image in a second projection format, such as image 1820 in a cube map projection format. Region of interest 1812 projected onto equirectangular image 1810 and region of interest 1822 projected onto cube map image 1820 may both correspond to the same region of interest in the scene projected into images 1810 and 1820. Cube Map image 1820 may include null regions 1837.1-1837.4 and cube faces left, front, right, back, top and bottom 1831-1836.

In one aspect, multiple projection formats may be combined to form a better reconstruction of a region of interest (ROI) than can be produced from a single projection format. A reconstructed region of interest, ROI_combo, may be produced from a weighted sum of the encoded projections or may be produced from a filtered sum of the encoded projections. For example, the region of interest in the scene of FIG. 18 may be reconstructed as:

ROI_combo=f(ROI₁,ROI₂)

where f( ) is a function for combining two region of interest images, first region of interest image ROI₁may be, for example, the equirectangular region of interest image from ROI 1812, and second region of interest image ROI₂may be, for example, the cube map region of interest image from ROI 1822. If f( ) is a weighted sum,

ROI_combo=alpha*ROI₁+beta*ROI₂

where alpha and beta are predetermined constants, and alpha+beta=1. In cases where pixel locations do not exactly correspond in the projection formats being combined, a projection format conversion function may be used, as in:

ROI_combo=alpha*PConv(ROI₁)+beta*ROI₂

where PConv( ) is a function that converts an image in a first projection format into a second projection format. For example, PConv( ) may simply be an up-sample or a down-sample function.

In another aspect, the best projection formation for encoding an entire multi-directional scene, such as for encoding a base layer, may be different than the best projection format for encoding only a region of interest, such as for encoding in an enhancement layer. Hence a multi-tiered encoding of the scene of FIG. 18 may include encoding the entirety of equirectangular image 1810 in a first tier, and encoding only the ROI 1822 of cube map image 1820 in a second tier. For example ROI 1822 may be encoded by encoding the entire front face 1832 as a tile of cube map image 1820. In a further aspect, this second tier may be encoded as an enhancement layer over the first tier base layer, as depicted in FIG. 19.

FIG. 19 illustrates an exemplary system for creating a residual from two different multi-directional projections. A base layer ROI image 1910 in a projection format P1 may be converted to a projection format P2 by conversion process 1902 to create a prediction of the ROI image 1920 in projection format P2. The prediction image from conversion process 1902 is subtracted from the actual P2 ROI image 1920 at adder 1904 to produce a P2 residual ROI, which may then be encoded as a P2 projection enhancement layer over a P1 base layer. In an aspect, the base layer may encode the entire scene in projection P1, while the enhancement layer may encode only a region of interest within the scene in projection P2. This aspect may be beneficial, for example, when projection P1 is preferred for encoding the entire scene, while projection P2 is preferred for encoding a particular region of interest. For example, with respect to FIG. 18, a first tier may be encoded as a base layer comprising the entire equirectangular image 1810, while a second tier may be encoded as an enhancement layer comprising a subset of cube map image 1820 such as a single tile or region of interest.

The foregoing discussion has described operation of the aspects of the present disclosure in the context of video coders and decoders. Commonly, these components are provided as electronic devices. Video decoders and/or controllers can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays and/or digital signal processors. Alternatively, they can be embodied in computer programs that execute on camera devices, personal computers, notebook computers, tablet computers, smartphones or computer servers. Such computer programs include processor instructions and typically are stored in physical storage media such as electronic-, magnetic-, and/or optically-based storage devices, where they are read by a processor and executed. Decoders commonly are packaged in consumer electronics devices, such as smartphones, tablet computers, gaming systems, DVD players, portable media players and the like; and they also can be packaged in consumer software applications such as video games, media players, media editors, and the like. And, of course, these components may be provided as hybrid systems that distribute functionality across dedicated hardware components and programmed general-purpose processors, as desired.

Claims

1.-21. (canceled)

22. A video reception method, comprising:

receiving a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;

first decoding the first version to produce a first decoded image in the first projection format;

second decoding the second version to produce a second decoded image in the second projection format;

converting the first decoded image from the first projection format to the second projection format;

combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and

outputting the combined image as a decoded version of the frame.

23. The method of claim 22, wherein:

the first projection format is a equirectangular projection;

the second projection format is a cube map projection;

24. The method of claim 22, wherein:

the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and

pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.

25. The method of claim 22, wherein:

the coded bitstream was encoded with a layered coding technique;

the first version is a base layer of the layered coding technique;

the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;

the converting predicts an enhancement layer output from the first version; and

the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.

26. The method of claim 25, wherein:

the base layer of the first version spatially includes the entire multi-directional scene; and

the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.

27. The method of claim 25, wherein:

the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and

the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.

28. A video reception system, comprising:

a receiver for receiving, from a source, a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;

a decoder for decoding the coded bitstream;

a controller to controlling the decoder to cause: first decoding the first version to produce a first decoded image in the first projection format; second decoding the second version to produce a second decoded image in the second projection format; converting the first decoded image from the first projection format to the second projection format; combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and outputting the combined image as a decoded version of the frame.

29. The method of claim 28, wherein:

the first projection format is a equirectangular projection;

the second projection format is a cube map projection;

30. The system of claim 28, wherein:

the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and

pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.

31. The system of claim 28, wherein:

the coded bitstream was encoded with a layered coding technique;

the first version is a base layer of the layered coding technique;

the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;

the converting predicts an enhancement layer output from the first version; and

the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.

32. The system of claim 31, wherein:

the base layer of the first version spatially includes the entire multi-directional scene; and

the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.

33. The system of claim 31, wherein:

the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and

the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.

34. A non-transitory computer readable medium comprising instructions that, when executed by a processor, cause:

receiving a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;

first decoding the first version to produce a first decoded image in the first projection format;

second decoding the second version to produce a second decoded image in the second projection format;

converting the first decoded image from the first projection format to the second projection format;

combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and

outputting the combined image as a decoded version of the frame.

35. The computer readable medium of claim 34, wherein:

the first projection format is a equirectangular projection;

the second projection format is a cube map projection;

36. The computer readable medium of claim 34, wherein:

the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and

pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.

37. The computer readable medium of claim 34, wherein:

the coded bitstream was encoded with a layered coding technique;

the first version is a base layer of the layered coding technique;

the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;

the converting predicts an enhancement layer output from the first version; and

the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.

38. The computer readable medium of claim 37, wherein:

the base layer of the first version spatially includes the entire multi-directional scene; and

the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.

39. The computer readable medium of claim 37, wherein:

the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and

the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.