SCALABILITY OF MULTI-DIRECTIONAL VIDEO STREAMING
Aspects of the present disclosure provide techniques for reducing latency and improving image quality of a viewport extracted from multi-directional video communications. According to such techniques, first streams of coded video data are received from a source. The first streams include coded data for each of a plurality of tiles representing a multi-directional video, where each tile corresponding to a predetermined spatial region of the multi-directional video, and at least one tile of the plurality of tiles in the first streams contains a current viewport location at a receiver. The techniques include decoding the first streams and displaying the tile containing the current viewport location. When the viewport location at the receiver changes to include a new tile of the plurality of tiles, retrieving and decoding first streams for the new tile, displaying the decoded content for the changed viewport location, and transmitting the changed viewport location to the source.
The present disclosure relates to coding techniques for multi-directional imaging applications.
Some modern imaging applications capture image data from multiple directions about a camera. Some cameras pivot during image capture, which allows a camera to capture image data across an angular sweep that expands the camera's effective field of view. Some other cameras have multiple imaging systems that capture image data in several different fields of view. In either case, an aggregate image may be created that merges image data captured from these multiple views.
A variety of rendering applications are available for multi-directional content. One rendering application involves extraction and display of a subset of the content contained in a multi-directional image. For example, a viewer may employ a head mounted display and change the orientation of the display to identify a portion of the multi-directional image in which the viewer is interested. Alternatively, a viewer may employ a stationary display and identify a portion of the multi-directional image in which the viewer is interested through user interface controls. In these rendering applications, a display device extracts a portion of image content from the multi-directional image (called a “viewport” for convenience) and displays it. The display device would not display other portions of the multi-directional image that are outside an area occupied by the viewport.
In communication applications, aggregate source image data at a transmitter exceeds the data that is needed to display a rendering of a viewport at a receiver. Coding techniques for transmitting source data may account for a current viewport of the receiving rendering device. However, when accounting for a moving viewport, these coding techniques incur coding and transmission latency and coding inefficiency.
Aspects of the present disclosure provide techniques for reducing latency and improving image quality of a viewport extracted from multi-directional video communications. According to such techniques, first streams of coded video data are received from a source. The first streams include coded data for each of a plurality of tiles representing a multi-directional video, where each tile corresponding to a predetermined spatial region of the multi-directional video, and at least one tile of the plurality of tiles in the first streams contains a current viewport location at a receiver. The techniques include decoding the first streams corresponding to the at least one tile containing the current viewport location, and displaying the decoded content for the current viewport location. When the viewport location at the receiver changes to include a new tile of the plurality of tiles, retrieving first streams for the new tile, decoding the retrieved first streams, displaying the decoded content for the changed viewport location, and transmitting information representing the changed viewport location to the source.
The sink terminal 120 may determine a viewport location in a three-dimensional space represented by the multi-directional image. The sink terminal 120 may select a portion of decoded video to be displayed, for example, based on the terminal's orientation in free space.
The network 130 represents any number of computer and/or communication networks that extend from the source terminal 110 to the sink terminal 120. The network 130 may include one or a combination of circuit-switched and/or packet-switched communication networks. The network 130 may communicate data between the source terminal 110 and the sink terminal 120 by any number of wireline and/or wireless communication media. The architecture and operation of the network 130 is immaterial to the present discussion unless otherwise noted herein.
Aspects of the present disclosure may apply video compression techniques according to any of a number of coding protocols. For example, the source terminal 110 (
In an aspect, individual frames of multi-directional content may be parsed into individual spatial regions, herein called “tiles”, and coded as independent data streams.
In an aspect, the tiles described here may be a special case of the tiles used in some standards, such as HEVC. In this aspect, the tiles used herein may be “motion constrained tile sets,” where all frames are segmented using the exact same tile partitioning, and each tile in every frame is only permitted to use prediction from co-located tiles in other frames. Filtering in the decoder loop may also be disallowed across tiles, providing decoding independency between tiles.
The coded tile 410 also may contain a number of differential codings 460-480, each coded differentially with respect to the coded data of the tier 0 representation and each having a bandwidth tied to the bandwidth of another bandwidth tier. Thus, in an example where the tier 0 coding is generated at a 500 Kbps representation and the tier 1 coding is generated at a 2 Mbps representation, the tier 1 differential coding 460 may be coded at a 1.5 Mbps representation (1.5 Mbps=2 Mbps-500 Kbps). The other differential codings 470, 480 may have data rates that match the differences between the data rates of their base tiers 440, 450 and the data rate of the tier 0 coding 420. In an aspect, elements of the differential codings 460, 470, 480 may be coded predictively using content from a corresponding chunk of the tier 0 coding as a prediction reference; in such an embodiment, the differential codings 460, 470, 480 may be generated as enhancement layers according to a scalable coding protocol in which tier 0 serves as a base layer for those encodings.
The codings 420-480 of the tile are shown as partitioned into individual chunks (e.g., chunks 420.1-420.N for tier 0 420, chunks 430.1-430.N for tier 1 430, etc.). Each chunk may be referenced by its own network identifier. During operation, a client device 120 (
The operations illustrated in
In an embodiment, a sink terminal 120 may identify a location of current viewport by identifying a spatial location within the multiview image at which the viewport is located, for example, by identifying its location within a coordinate space defined for the image (see,
In the example of
If the source terminal 110 provides the media content of the differential tier (msg. 650) before the chunk Y must be rendered, the sink terminal 120 may render chunk Y (box 660) using content developed from the content provided in messages 630 and 650. If not, the sink terminal 120 may render chunk Y (box 660) using content developed from the tier 0 level of service (msg. 630).
The example of
The example of
A switch from differential tiers to higher quality tiers (e.g., tier 3) may occur for chunks for which download requests are made after the viewport switch occurs. Thus, when a viewport changes from one tile to another, a sink terminal 120 may determine what tiers to request for the new tile from its operating state and the transmission latency in the system. In some cases there will be a transitional period after the viewport moves and before the sink terminal can render the new viewport location at a high quality of service (such as tier 3 for chunk Y+3 and later in
As discussed, a sink terminal 120 (
In an aspect, a lower quality tier may be provided for all tiles. In another aspect a lower quality tier may be provided for only a portion of the frame 800. For example, a lower quality tier may be provided only for 180 degrees of view centered on the current viewport (instead of 360 degrees), or the lower quality tier may be provided only in areas of frame 800 where the viewport is likely to move next.
In an aspect, frame 800 may be encoded according to a layered coding protocol, where one tier is coded as a base layer, and other tiers are encoded as enhancement layers of the base layer. An enhancement layer may be predicted from one or more lower layers. For example, a first enhancement layer may be predicted from the base layer, and a second, higher enhancement layer may be predicted from either the base layer or from the first, lower enhancement layer.
An enhancement layer may be differentially or predictively coded from one or more lower layers. Non-enhancement layers, such as a base layer, may be encoded independently of other layers. Reconstruction at a decoder of a differentially coded layer will require both the encoded data segment of the differentially coded layer and the segment(s) from the differentially coded layer(s) from which it is predicted. In the case of a predictively coded layer, sending that layer may include sending both the discrete encoded data segment of the predictively coded layer, and also sending the discrete encoded data segment of the layer(s) used as a prediction reference. In an example, differential layered coding of frame 800, a lower base layer may be sent to sink terminal 120, for all tiles, while discrete data segments for a higher differential layer (that is coded using predictions from the base layer) may be sent only for tiles 810.5 and 810.6 as the viewport 830 is included in those tiles.
The distribution server 1010 may include a storage system 1040 on which pre-encoded multi-directional videos are stored in a variety of tiers for download by the client device 1020. The distribution server 1010 may store several coded representations of a video content item, shown as tiers 1, 2, and 3, which have been coded with different coding parameters. The video content item includes a manifest file containing pointers to chunks of encoded video data for each tier.
In the example of
The example of
Times A, B, C, and D are depicted in
In an aspect, multi-directional image data may include depth maps and/or occlusion information. Depth maps and/or occlusion information may be included as separate channel(s) and manifest 1050 may include references to these separate channel(s) for depth maps and/or occlusion information.
In an aspect, in steady state when a viewport is not moving, client 1020 may extract a viewport image from the high reconstruction quality of tier 2. During a transitional period, client 1020 may extract a viewport image from the reconstructed combination of tier 1 and enhancement layer tier 3 when the viewport moves into a new spatial tile, and then return to a steady state by extracting a viewport image from tier 2 once tier 2 is again available at client 1020. An example of this is illustrated in tables 1 and 2 for a viewport of client 1020 that were to jump from viewport location 1130 to viewport location 1140 right at time C. Client 1020 requests for tiers of tiles is listed in Table 1, and tiers from which a viewport image is extracted is listed in Table 2.
Under the initial steady state condition during time period t1, the viewport is not moving and viewport location 1130 is fully contained in tile 1110.0. Tier 2, being the higher quality tier, may be requested by client 1020 from server 1010 for tile 1110.0 at time A, as indicated in Table 1. For tiles not included in the viewport at location 1130 (tiles 1110.1-1110.n), the lower quality and more highly compressed tier 1 is requested instead. Hence, tier 1 chunks are requested for time period t1 at time A for all tiles other than tile 1110.0. The viewport is then extracted from the reconstruction of tier 2 by client 1020 starting at time A.
At time B, the viewport has not yet moved, so the same tiers are requested by client 1020 for the same tiles as at time A, but the requests are for the specific chunks corresponding to time period t2. At time C, the viewport of client 1020 may jump from viewport location 1130 to location 1140. At the point to time C, somewhere between the beginning and end of t2, lower quality tier 1 has already been requested for the new location of the viewport, tile 1110.5. So, a viewport can be extracted immediately from tier 1 at time C when the viewport moves. At time C, tier 3 can may be requested, and as soon as it is available, the combination of tier 1 and enhancement layer tier 3 can be used for extracting a viewport image at client 1020. At time D, client 1020 may go back to a steady state by requesting layer 2 for tiles containing the viewport location, and layer 0 for tiles not containing the viewport location.
The video decoder 1240 may invert coding operations performed by the video encoder 1230 to obtain a reconstructed picture from the coded video data. Typically, the coding processes applied by the video coder 1230 are lossy processes, which cause the reconstructed picture to possess various errors when compared to the original picture. The video decoder 1240 may reconstruct pictures of select coded pictures, which are designated as “reference pictures,” and store the decoded reference pictures in the reference picture store 1250. In the absence of transmission errors, the decoded reference pictures may replicate decoded reference pictures obtained by a decoder (not shown in
The predictor 1260 may select prediction references for new input pictures as they are coded. For each portion of the input picture being coded (called a “pixel block” for convenience), the predictor 1260 may select a coding mode and identify a portion of a reference picture that may serve as a prediction reference search for the pixel block being coded. The coding mode may be an intra-coding mode, in which case the prediction reference may be drawn from a previously-coded (and decoded) portion of the picture being coded. Alternatively, the coding mode may be an inter-coding mode, in which case the prediction reference may be drawn from another previously-coded and decoded picture. In one aspect of layered coding, prediction references may be pixel blocks previously decoded from another layer, typically a lower layer lower than the layer currently being encoded. In the case of two layers that encode two different projections formats of multi-directional video, a function such as an image warp function may be applied to a reference image in one projection format at a first layer to predict a pixel block in a different projection format at a second layer.
In another aspect of a layered coding system, a differentially coded enhancement layer may be coded with restricted prediction references to enable seeking or layer/tier switching into the middle of an encoded enhancement layer chunk. In a first aspect, predictor 1260 may restrict prediction references of only every frame in an enhancement layer to be frames of a base layer or other lower layer. When every frame of an enhancement layer is predicted without reference to other frames of the enhancement layer, a decoder may switch to the enhancement layer at any frame efficiently because previous enhancement layer frames will never be necessary to reference as a prediction reference. In a second aspect, predictor 1260 may require that every Nth frame (such as every other frame) within a chuck be predicted only from a base layer or other lower layer to enable seeking to every Nth frame within an encoded data chunk.
When an appropriate prediction reference is identified, the predictor 1260 may furnish the prediction data to the video coder 1230. The video coder 1230 may code input video data differentially with respect to prediction data furnished by the predictor 1260. Typically, prediction operations and the differential coding operate on a pixel block-by-pixel block basis. Prediction residuals, which represent pixel-wise differences between the input pixel blocks and the prediction pixel blocks, may be subject to further coding operations to reduce bandwidth further.
As indicated, the coded video data output by the video coder 1230 should consume less bandwidth than the input data when transmitted and/or stored. The coding system 1200 may output the coded video data to an output device 1270, such as a transceiver, that may transmit the coded video data across a communication network 130 (
The transceiver 1270 also may receive viewport information from a decoding terminal (
The video sink 1340, as indicated, may consume decoded video generated by the decoding system 1300. Video sinks 1340 may be embodied by, for example, display devices that render decoded video. In other applications, video sinks 1340 may be embodied by computer applications, for example, gaming applications, virtual reality applications and/or video editing applications, that integrate the decoded video into their content. In some applications, a video sink may process the entire multi-directional field of view of the decoded video for its application but, in other applications, a video sink 1340 may process a selected sub-set of content from the decoded video. For example, when rendering decoded video on a flat panel display, it may be sufficient to display only a selected subset of the multi-directional video. In another application, decoded video may be rendered in a multi-directional format, for example, in a planetarium.
The transceiver 1310 also may send viewport information provided by the controller 1370, such as a viewport location and/or a preferred projection format, to the source of encoded video, such as terminal 1200 of
Controller 1370 may determine viewport information based on a viewport location. In one example, the viewport information may include just a viewport location, and the encoded video source may then use the location to identify which encoded layers to provide to decoding system 1300 for specific spatial tiles. In another example, viewport information sent from the decoding system may include specific requests for specific layers of specific tiles, leaving much of the viewport location mapping in the decoding system. In yet another example, viewport information may include a request for a particular projection format based on the viewport location.
The principles of the present disclosure find application with a variety of projection formats of multi-directional images. In an aspect, one may convert between the various projection formats of
Coding of cube map images may occur in several ways. In one coding application, the cube map image 1530 may be coded directly, which includes coding of null regions 1537.1-1537.4 that do not have image content. The encoding techniques of
In other coding applications, the cube map image 1530 may be repacked to eliminate null regions 1537.1-1537.4 prior to coding, shown as image 1540. The techniques described in
In an aspect, cameras, such as the cameras 1410, 1510, and 1610 in
In an aspect, enhancement layer 1710 frames may also be predicted from previous enhancement layer frames, as indicated by optional dashed arrows in
In an aspect, a sink terminal may switch to a new layer or new tier on non-safe-switching frames when some decoded quality drift may be tolerated. A non-safe switching frame may be decoded without having access to the reference frames used for its prediction, and quality gradually gets worse as errors from incorrect predictions accumulate into what may be called quality drift. Error concealment techniques may be used to mitigate the quality drift due to switching at non-safe-switching enhancement layer frames. Example error concealment techniques include predicting from a frame similar to the missing reference frame, and periodic intra-refresh mechanisms. By tolerating some quality drift caused by switching at non-safe-switching frames, the latency can be reduced between moving a viewport and presenting images of the new viewport location.
In one aspect, multiple projection formats may be combined to form a better reconstruction of a region of interest (ROI) than can be produced from a single projection format. A reconstructed region of interest, ROIcombo, may be produced from a weighted sum of the encoded projections or may be produced from a filtered sum of the encoded projections. For example, the region of interest in the scene of
ROIcombo=f(ROI1,ROI2)
where f( ) is a function for combining two region of interest images, first region of interest image ROI1 may be, for example, the equirectangular region of interest image from ROI 1812, and second region of interest image ROI2 may be, for example, the cube map region of interest image from ROI 1822. If f( ) is a weighted sum,
ROIcombo=alpha*ROI1+beta*ROI2
where alpha and beta are predetermined constants, and alpha+beta=1. In cases where pixel locations do not exactly correspond in the projection formats being combined, a projection format conversion function may be used, as in:
ROIcombo=alpha*PConv(ROI1)+beta*ROI2
where PConv( ) is a function that converts an image in a first projection format into a second projection format. For example, PConv( ) may simply be an up-sample or a down-sample function.
In another aspect, the best projection formation for encoding an entire multi-directional scene, such as for encoding a base layer, may be different than the best projection format for encoding only a region of interest, such as for encoding in an enhancement layer. Hence a multi-tiered encoding of the scene of
The foregoing discussion has described operation of the aspects of the present disclosure in the context of video coders and decoders. Commonly, these components are provided as electronic devices. Video decoders and/or controllers can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays and/or digital signal processors. Alternatively, they can be embodied in computer programs that execute on camera devices, personal computers, notebook computers, tablet computers, smartphones or computer servers. Such computer programs include processor instructions and typically are stored in physical storage media such as electronic-, magnetic-, and/or optically-based storage devices, where they are read by a processor and executed. Decoders commonly are packaged in consumer electronics devices, such as smartphones, tablet computers, gaming systems, DVD players, portable media players and the like; and they also can be packaged in consumer software applications such as video games, media players, media editors, and the like. And, of course, these components may be provided as hybrid systems that distribute functionality across dedicated hardware components and programmed general-purpose processors, as desired.
Claims
1.-21. (canceled)
22. A video reception method, comprising:
- receiving a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;
- first decoding the first version to produce a first decoded image in the first projection format;
- second decoding the second version to produce a second decoded image in the second projection format;
- converting the first decoded image from the first projection format to the second projection format;
- combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and
- outputting the combined image as a decoded version of the frame.
23. The method of claim 22, wherein:
- the first projection format is a equirectangular projection;
- the second projection format is a cube map projection;
24. The method of claim 22, wherein:
- the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and
- pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.
25. The method of claim 22, wherein:
- the coded bitstream was encoded with a layered coding technique;
- the first version is a base layer of the layered coding technique;
- the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;
- the converting predicts an enhancement layer output from the first version; and
- the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.
26. The method of claim 25, wherein:
- the base layer of the first version spatially includes the entire multi-directional scene; and
- the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.
27. The method of claim 25, wherein:
- the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and
- the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.
28. A video reception system, comprising:
- a receiver for receiving, from a source, a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;
- a decoder for decoding the coded bitstream;
- a controller to controlling the decoder to cause: first decoding the first version to produce a first decoded image in the first projection format; second decoding the second version to produce a second decoded image in the second projection format; converting the first decoded image from the first projection format to the second projection format; combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and outputting the combined image as a decoded version of the frame.
29. The method of claim 28, wherein:
- the first projection format is a equirectangular projection;
- the second projection format is a cube map projection;
30. The system of claim 28, wherein:
- the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and
- pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.
31. The system of claim 28, wherein:
- the coded bitstream was encoded with a layered coding technique;
- the first version is a base layer of the layered coding technique;
- the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;
- the converting predicts an enhancement layer output from the first version; and
- the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.
32. The system of claim 31, wherein:
- the base layer of the first version spatially includes the entire multi-directional scene; and
- the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.
33. The system of claim 31, wherein:
- the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and
- the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.
34. A non-transitory computer readable medium comprising instructions that, when executed by a processor, cause:
- receiving a coded bitstream of multi-directional video of a scene including a first version of a frame in a first projection format and a second version of the frame in a second projection format;
- first decoding the first version to produce a first decoded image in the first projection format;
- second decoding the second version to produce a second decoded image in the second projection format;
- converting the first decoded image from the first projection format to the second projection format;
- combining the first decoded image in the second projection format with the second decoded image in the second projection format to produce a combined image in the second projection format; and
- outputting the combined image as a decoded version of the frame.
35. The computer readable medium of claim 34, wherein:
- the first projection format is a equirectangular projection;
- the second projection format is a cube map projection;
36. The computer readable medium of claim 34, wherein:
- the combined image represents a region of interest that correspond to a subset of the first projection format and a subset of the second projection format; and
- pixels in the combined image are based on a weighted combination of corresponding pixels in the first decoded image with corresponding pixels in the second decoded image.
37. The computer readable medium of claim 34, wherein:
- the coded bitstream was encoded with a layered coding technique;
- the first version is a base layer of the layered coding technique;
- the second version is an interlayer prediction residual for an enhancement layer of the layered coding technique;
- the converting predicts an enhancement layer output from the first version; and
- the combining combines the predicted enhancement layer with the interlayer prediction residual to produce a decoded enhancement layer output.
38. The computer readable medium of claim 37, wherein:
- the base layer of the first version spatially includes the entire multi-directional scene; and
- the enhancement layer of the second version includes a spatial region of interest that is a subset of the entire multi-directional scene.
39. The computer readable medium of claim 37, wherein:
- the first projection format is equirectangular projection and the base layer of the first version spatially includes the entire multi-directional scene; and
- the second projection format is a cube map projection and the enhancement layer of the second version includes one face of the cube map projection and is a subset of the entire multi-directional scene.
Type: Application
Filed: Apr 2, 2021
Publication Date: Jul 22, 2021
Inventors: Alexandros Tourapis (Milpitas, CA), Dazhong Zhang (Milpitas, CA), Hang Yuan (San Jose, CA), Hsi-Jung Wu (Cupertino, CA), Jae Hoon Kim (San Jose, CA), Jiefu Zhai (San Jose, CA), Ming Chen (Cupertino, CA), Xiaosong Zhou (Campbell, CA)
Application Number: 17/221,299