EXPERIENCE OR "SENTIO" CODECS, AND METHODS AND SYSTEMS FOR IMPROVING QoE AND ENCODING BASED ON QoE EXPERIENCES

Info

Publication number: 20160219279
Type: Application
Filed: Mar 31, 2016
Publication Date: Jul 28, 2016
Inventors: Stanislav Vonog (San Francisco, CA), Nikolay Surin (San Francisco, CA), Tara Lemmey (San Francisco, CA)
Application Number: 15/087,657

Abstract

Certain embodiments teach a variety of experience or “sentio” codecs, and methods and systems for enabling an experience platform, as well as a Quality of Experience (QoS) engine which allows the sentio codec to select a suitable encoding engine or device. The sentio codec is capable of encoding and transmitting data streams that correspond to participant experiences with a variety of different dimensions and features. As will be appreciated, the following description provides one paradigm for understanding the multi-dimensional experience available to the participants, and as implemented utilizing a sentio codec. There are many suitable ways of describing, characterizing and implementing the sentio codec and experience platform contemplated herein.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent application Ser. No. 13/363,187 entitled “Experience or “Sentio” Codecs, and “Methods and Systems for Improving QoE and Encoding Based on QoE Experiences”, filed Jan. 31, 2012, which is a continuation of U.S. patent application Ser. No. 13/136,870 entitled “Experience or “Sentio” Codecs, and “Methods and Systems for Improving QoE and Encoding Based on QoE Experiences”, filed Aug. 12, 2011 (now U.S. Pat. No. 9,172,979) which claims the benefit of and priority to U.S. Provisional Patent Application No. 61/373,236 entitled “Experience or “Sentio” Codecs, and Methods and Systems for Improving QoE and Encoding Based on QoE Experiences,” filed on Aug. 12, 2010, the contents of the above identified applications are incorporated herein by reference in their entirety. This application is therefore entitled to a priority date of Aug. 12, 2010.

FIELD OF INVENTION

The present teaching relates to experience or “sentio” codecs enabling encoding and transmission for data streams involving a variety of dimensions and data types including video, group participation, gesture recognition, heterogeneous device use, emotions, etc.

SUMMARY OF THE INVENTION

The present invention contemplates a variety of experience or “sentio” codecs, and methods and systems for enabling an experience platform, as well as a Quality of Experience (QoE) engine which allows the sentio codec to select a suitable encoding engine or device. As will be described in more detail below, the sentio codec is capable of encoding and transmitting data streams that correspond to participant experiences with a variety of different dimensions and features. As will be appreciated, the following description provides one paradigm for understanding the multi-dimensional experience available to the participants, and as implemented utilizing a sentio codec. There are many suitable ways of describing, characterizing and implementing the sentio codec and experience platform contemplated herein.

BRIEF DESCRIPTION OF DRAWINGS

These and other objects, features and characteristics of the present invention will become more apparent to those skilled in the art from a study of the following detailed description in conjunction with the appended claims and drawings, all of which form a part of this specification. In the drawings:

FIG. 1 illustrates a system architecture for composing and directing user experiences;

FIG. 2 is a block diagram of an experience agent;

FIG. 3 is a block diagram of a sentio codec; and

FIG. 4 provides a screen shot useful for illustrating how a hybrid encoding scheme can be used to accomplish low-latency transmission.

DETAILED DESCRIPTION OF THE INVENTION

The present invention contemplates a variety of experience or “sentio” codecs, and methods and systems for enabling an experience platform, as well as a Quality of Experience (QoS) engine which allows the sentio codec to select a suitable encoding engine or device. As will be described in more detail below, the sentio codec is capable of encoding and transmitting data streams that correspond to participant experiences with a variety of different dimensions and features. (The term “sentio” is Latin roughly corresponding to perception or to perceive with one's senses, hence the original nomenclature “sensio codec.”)

The primary goal of a video codec is to achieve maximum compression rate for digital video while maintaining great picture quality video; audio codecs are similar. But video and audio codecs alone are insufficient to generate and capture a full experience, such as a real-time experience enabled by hybrid encoding, and encoding of other experience aspects such as gestures, emotions, etc.

FIG. 4 will now be described to provide an example experience showing 4 layers where video encoding alone is inadequate. (The “layer” concept will be described below in more detail with reference to FIGS. 1-3.) A first layer is generated by Autodesk 3ds Max instantiated on a suitable layer source, such as on an experience server or a content server. A second layer is an interactive frame around the 3ds Max layer, and in this example is generated on a client device by an experience agent. A third layer is the black box in the bottom-left corner with the text “FPS” and “bandwidth”, and is generated on the client device but pulls data by accessing a service engine available on the service platform. A fourth layer is a red-green-yellow grid which demonstrates an aspect of a low-latency transfer protocol (e.g., different regions being selectively encoded) and is generated and computed on the service platform, and then merged with the 3ds Max layer on the experience server.

FIG. 4 illustrates how a hybrid encoding approach can be used to accomplish low-latency transmission. The first layer provides an Autodesk 3ds Max image including a rotating teapot, the first layer moving images, static or nearly static images, and graphic and/or text portions. Rather then encoding all the information with a video encoder alone, a hybrid approach encoding some regions with a video encoder, other regions with a picture encoder, and other portions as command, results in better transmission results, and can be optimized based on factors such as the state of the network and the capabilities of end devices. These different encoding regions are illustrated by the different coloring of the red-green-yellow grid of layer 4. One example of this low-latency protocol is described in more detail in Vonog et al.'s U.S. patent application Ser. No. 12/569,876, filed Sep. 29, 2009, and incorporated herein by reference for all purposes including the low-latency protocol and related features such as the network engine and network stack arrangement.

As is seen from the example of FIG. 4, a video codec alone is inadequate to accomplish the hybrid encoding scheme covering video, pictures and commands. While it is theoretically possible to encode the entire first layer using only a video codec, latency and other issues can prohibit real-time and/or quality experiences. A low-latency protocol can solve this problem by efficiently encoding the data.

In another example, a multiplicity of video codecs can be used to improve encoding and transmission. For example, h.264 can be used if a hardware decoder is available, thus saving battery life and improving performance, or a better video codec (e.g., low latency) can be used if the device fails to support h.264.

As yet another example, consider the case of multiple mediums where an ability to take into account the nature of human perception would be beneficial. For example, assume we have video and audio information. If network quality degrades, it could be better to prioritize audio and allow the video to degrade. To do so would require using psychoacoustics to improve the QoE.

Accordingly, the present teaching contemplates an experience or sentio codec capable of encoding and transmitting data streams that correspond to experiences with a variety of different dimensions and features. These dimensions include known audio and video, but further may include any conceivable element of a participant experience, such as gestures, gestures+voice commands, “game mechanics” (which you can use to boost QoE when current conditions (such as network) do not allow you to do so—i.e. apply sound distortion effect specific to a given experience when loss of data happened), emotions (perhaps as detected via voice or facial expressions, various sensor data, microphone input, etc.

It is also contemplated that virtual experiences can be encoded via the sentio codec. According to one embodiment, virtual goods are evolved into virtual experiences. Virtual experiences expand upon limitations imposed by virtual goods by adding additional dimensions to the virtual goods. By way of example, User A transmits flowers as a virtual good to User B. The transmission of the virtual flowers is enhanced by adding emotion by way of sound, for example. The virtual flowers are also changed to a virtual experience when User B can do something with the flowers, for example User B can affect the flowers through any sort of motion or gesture. User A can also transmit the virtual goods to User B by making a “throwing” gesture using a mobile device, so as to “toss” the virtual goods to User B.

The sentio codec improves the QoE to a consumer or experience participant on the device of their choice. This is accomplished through a variety of mechanisms, selected and implemented, possibly dynamically, based on the specific application and available resources. In certain embodiments, the sentio codec encodes multi-dimensional data streams in real-time, adapting to network capability. A QoE engine operating within the sentio codec a makes decisions on how to use different available codecs. The network stack can be implemented as hybrid, as described above, and in further detail with reference to Vonog et al.'s U.S. patent application Ser. No. 12/569,876.

The sentio codec can include 1) a variety of codecs for each segment of experience described above, 2) a hybrid network stack with network intelligence, 3) data about available devices, and 4) a QoE engine that makes decisions on how to encode. It will be appreciated that QoE is achieved through various strategies that work differently for each given experience (say a zombie karaoke game vs. live stadium rock concert experience), and adapt in real-time to the network and other available resources, know the devices involved and take advantages of various psychological tricks to conceal imperfections which inevitably arise, particularly when the provided experience is scaled for many participants and devices.

FIG. 1 illustrates a block diagram of a system 10. The system 10 can be viewed as an “experience platform” or system architecture for composing and directing a participant experience. As will be appreciated, the experience platform described herein provides, by way of example only, one platform suitable for incorporating and taking advantage of the sentio codec described herein. In one embodiment, the experience platform 10 is provided by a service provider to enable an experience provider to compose and direct a participant experience. The participant experience can involve one or more experience participants. The experience provider can create an experience with a variety of dimensions, as will be explained further now. The sentio codec enables the encoding and transmission of data streams representing this variety of dimensions. As will be appreciated, the following description provides one paradigm for understanding the multi-dimensional experience available to the participants. There are many suitable ways of describing, characterizing and implementing the experience platform contemplated herein.

In general, services are defined at an API layer of the experience platform. The services are categorized into “dimensions.” The dimension(s) can be recombined into “layers.” The layers form to make features in the experience. The sentio codec enables encoding and transmission of the data streams representing the various dimensions and features.

By way of example, the following are some of the dimensions that can be supported on a suitable experience platform, and the related data streams encoded by a suitable sentio codec. It will be appreciated that not all dimensions are necessarily available or needed on specific experience platforms or specific devices, and that the sentio codec can be implemented with general, all encompassing capabilities, or with only those capabilities needed for the specific implementation, or with a suitable subset.

Video—is the near or substantially real-time streaming of the video portion of a video or film with near real-time display and interaction.

Audio—is the near or substantially real-time streaming of the audio portion of a video, film, karaoke track, song, with near real-time sound and interaction.

Live—is the live display and/or access to a live video, film, or audio stream in near real-time that can be controlled by another experience dimension. A live display is not limited to single data stream.

Encore—is the replaying of a live video, film or audio content. This replaying can be the raw version as it was originally experienced, or some type of augmented version that has been edited, remixed, etc.

Graphics—is a display that contains graphic elements such as text, illustration, photos, freehand geometry and the attributes (size, color, location) associated with these elements. Graphics can be created and controlled using the experience input/output command dimension(s) (see below).

Input/Output Command(s)—are the ability to control the video, audio, picture, display, sound or interactions with human or device-based controls. Some examples of input/output commands include physical gestures or movements, voice/sound recognition, and keyboard or smart-phone device input(s).

Interaction—is how devices and participants interchange and respond with each other and with the content (user experience, video, graphics, audio, images, etc.) displayed in an experience. Interaction can include the defined behavior of an artifact or system and the responses provided to the user and/or player.

Game Mechanics—are rule-based system(s) that facilitate and encourage players to explore the properties of an experience space and other participants through the use of feedback mechanisms. Some services on the experience Platform that could support the game mechanics dimensions include leader boards, polling, like/dislike, featured players, star-ratings, bidding, rewarding, role-playing, problem-solving, etc.

Ensemble—is the interaction of several separate but often related parts of video, song, picture, story line, players, etc. that when woven together create a more engaging and immersive experience than if experienced in isolation.

Auto Tune—is the near real-time correction of pitch in vocal and/or instrumental performances. Auto Tune is used to disguise off-key inaccuracies and mistakes, and allows singer/players to hear back perfectly tuned vocal tracks without the need of singing in tune.

Auto Filter—is the near real-time augmentation of vocal and/or instrumental performances. Types of augmentation could include speeding up or slowing down the playback, increasing/decreasing the volume or pitch, or applying a celebrity-style filter to an audio track (like a Lady Gaga or Heavy-Metal filter).

Remix—is the near real-time creation of an alternative version of a song, track, video, image, etc. made from an original version or multiple original versions of songs, tracks, videos, images, etc.

Viewing 360°/Panning—is the near real-time viewing of the 360° horizontal movement of a streaming video feed on a fixed axis. Also the ability to for the player(s) to control and/or display alternative video or camera feeds from any point designated on this fixed axis.

Turning back to FIG. 1, the experience platform 10 includes a plurality of devices 20 and a data center 40. The devices 12 may include devices such as an iPhone 22, an android 24, a set top box 26, a desktop computer 28, and a netbook 30. At least some of the devices 12 may be located in proximity with each other and coupled via a wireless network. In certain embodiments, a participant utilizes multiple devices 12 to enjoy a heterogeneous experience, such as using the iPhone 22 to control operation of the other devices. Multiple participants may also share devices at one location, or the devices may be distributed across various locations for different participants.

Each device 12 has an experience agent 32. The experience agent 32 includes a sentio codec and an API. The sentio codec can include 1) a variety of codecs for each segment of experience described above, 2) a hybrid network stack with network intelligence, 3) data about available devices, and 4) a QoE decision engine that makes decisions on how to encode. It will be appreciated that QoE is achieved through various strategies that work differently for each given experience (say a zombie karaoke game vs. live stadium rock concert experience), and adapt in real-time to the network and other available resources, know the devices involved and take advantages of various psychological tricks to conceal imperfections which inevitably arise, particularly when the provided experience is scaled for many participants and devices.

The sentio codec and the API enable the experience agent 32 to communicate with and request services of the components of the data center 40. The experience agent 32 facilitates direct interaction between other local devices. Because of the multi-dimensional aspect of the experience, the sentio codec and API should fully enable the desired experience. However, the functionality of the experience agent 32, including the sentio codec, is typically tailored to the needs and capabilities of the specific device 12 on which the experience agent 32 is instantiated. In some embodiments, services implementing experience dimensions are implemented in a distributed manner across the devices 12 and the data center 40. In other embodiments, the devices 12 have a very thin experience agent 32 with little functionality beyond a minimum API and sentio codec, and the bulk of the services and thus composition and direction of the experience are implemented within the data center 40.

Data center 40 includes an experience server 42, a plurality of content servers 44, and a service platform 46. As will be appreciated, data center 40 can be hosted in a distributed manner in the “cloud,” and typically the elements of the data center 40 are coupled via a low latency network. The experience server 42, servers 44, and service platform 46 can be implemented on a single computer system, or more likely distributed across a variety of computer systems, and at various locations.

The experience server 42 includes at least one experience agent 32, an experience composition engine 48, and an operating system 50. The experience agent 32 again includes a sentio codec with the various capabilities as described herein. In one embodiment, the experience composition engine 48 is defined and controlled by the experience provider to compose and direct the experience for one or more participants utilizing devices 12. Direction and composition is accomplished, in part, by merging various content layers and other elements into dimensions generated from a variety of sources such as the service provider 42, the devices 12, the content servers 44, and/or the service platform 46.

The content servers 44 may include a video server 52, an ad server 54, and a generic content server 56. Any content suitable for encoding by the sentio codec of an experience agent can be included as an experience layer. These include well know forms such as video, audio, graphics, and text. As described in more detail earlier and below, other forms of content such as gestures, emotions, temperature, proximity, etc., are contemplated for encoding and inclusion in the experience via a sentio codec, and are suitable for creating dimensions and features of the experience.

The service platform 46 includes at least one experience agent 32, a plurality of service engines 60, third party service engines 62, and a monetization engine 64. In some embodiments, each service engine 60 or 62 has a unique, corresponding experience agent with a corresponding sentio codec. The sentio codecs may have separate code and utilize different and/or combinations of the same local hardware. As will be appreciated, the implementation may be distinct to each application. In other embodiments, a single experience agent 32 can support multiple service engines 60 or 62. The service engines and the monetization engines 64 can be instantiated on one server, or can be distributed across multiple servers. The service engines 60 correspond to engines generated by the service provider and can provide services such as audio remixing, gesture recognition, and other services referred to in the context of dimensions above, etc. Third party service engines 62 are services included in the service platform 46 by other parties. The service platform 46 may have the third-party service engines instantiated directly therein, or within the service platform 46 these may correspond to proxies which in turn make calls to servers under control of the third-parties.

Monetization of the service platform 46 can be accomplished in a variety of manners. For example, the monetization engine 64 may determine how and when to charge the experience provider for use of the services, as well as tracking for payment to third-parties for use of services from the third-party service engines 62.

FIG. 2 illustrates a block diagram of an experience agent 100. The experience agent 100 includes an application programming interface (API) 102 and a sentio codec 104. The API 102 is an interface which defines available services, and enables the different agents to communicate with one another and request services.

The sentio codec 104 is a combination of hardware and/or software which enables encoding of many types of data streams for operations such as transmission and storage, and decoding for operations such as playback and editing. These data streams can include standard data such as video and audio. Additionally, the data can include graphics, sensor data, gesture data, and emotion data.

FIG. 3 illustrates a block diagram of one embodiment of a sentio codec 200. The sentio codec 200 includes a plurality of codecs such as video codecs 202, audio codecs 204, graphic language codecs 206, sensor data codecs 208, and emotion codecs 210. The sentio codec 200 further includes a quality of experience (QoE) decision engine 212 and a network engine 214. The codecs, the QoE decision engine 212, and the network engine 214 work together to encode one or more data streams and transmit the encoded data according to a low-latency transfer protocol supporting the various encoded data types. One suitable low-latency protocol and more details related to the network engine 214 can be found in Vonog et al.'s U.S. patent application Ser. No. 12/569,876.

The sentio codec 200 can be designed to take all aspects of the experience platform into consideration when executing the transfer protocol. The parameters and aspects include available network bandwidth, transmission device characteristics and receiving device characteristics. Additionally, the sentio codec 200 can be implemented to be responsive to commands from an experience composition engine or other outside entity to determine how to prioritize data for transmission. In many applications, because of human response, audio is the most important component of an experience data stream. However, a specific application may desire to emphasize video or gesture commands.

The sentio codec provides the capability of encoding data streams corresponding to many different senses or dimensions of an experience. For example, a device 12 may include a video camera capturing video images and audio from a participant. The user image and audio data may be encoded and transmitted directly or, perhaps after some intermediate processing, via the experience composition engine 48, to the service platform 46 where one or a combination of the service engines can analyze the data stream to make a determination about an emotion of the participant. This emotion can then be encoded by the sentio codec and transmitted to the experience composition engine 48, which in turn can incorporate this into a dimension of the experience. Similarly a participant gesture can be captured as a data stream, e.g. by a motion sensor or a camera on device 12, and then transmitted to the service platform 46, where the gesture can be interpreted, and transmitted to the experience composition engine 48 or directly back to one or more devices 12 for incorporation into a dimension of the experience.

The sentio codec delivers the best QoE to a consumer on the device of their choice through current network. This is accomplished through a variety of mechanisms, selected and implemented based on the specific application and available resources. In certain embodiments, the sentio codec encodes multi-dimensional data streams in real-time, adapting to network capability. A QoE engine operating within the sentio codec a makes decisions on how to use different available codecs. The network stack can be implemented as hybrid, as described above, and in further detail with reference to Vonog et al.'s U.S. patent application Ser. No. 12/569,876.

In addition to the above mentioned examples, various other modifications and alterations of the invention may be made without departing from the invention. Accordingly, the above disclosure is not to be considered as limiting and the appended claims are to be interpreted as encompassing the true spirit and the entire scope of the invention.

Claims

1. A hybrid codec for encoding and decoding a plurality of multi-dimensional data streams for a multi-dimensional experience, the hybrid codec comprising:

a plurality of codecs suitable for encoding and decoding multi-dimensional experience data streams related to a multi-dimensional experience shared over a network between one or more transmitting devices and a receiving device;

a Quality of Experience (QoE) decision engine configured to: receive an output associated with the multi-dimensional experience, the encoded output including the plurality of multi-dimensional data streams; wherein the output is divided into a plurality of regions; analyze the output in each of the plurality of regions; and for each of the plurality of regions, select one codec from the plurality of codecs to decode the encoded output in that region; wherein, the selection of the codec is made to improve a human perception of the multi-dimensional experience and is based on data associated with the capabilities of a transmitting device, the capabilities of the receiving device, and the characteristics of the multi-dimensional experience; and

a network engine configured to implement a low-latency transfer protocol for transmitting and receiving of encoded multi-dimensional data streams;

wherein the low latency transfer protocol takes into account the current conditions of the network, the capabilities of the one or more transmitting devices, and the capabilities of the receiving device.

2. The hybrid codec of claim 1, wherein the plurality of codecs includes an audio codec and a video codec.

3. The hybrid codec of claim 2, wherein the plurality of codecs further includes a gesture command codec.

4. The hybrid codec of claim 2, wherein the plurality of codecs further includes a sensor data codec.

5. The hybrid codec of claim 2, wherein the plurality of codecs further includes an emotion data codec.

6. The hybrid codec of claim 1, wherein improving human perception of the multi-dimensional experience includes prioritizing the encoding and decoding of audio streams over the encoding and decoding of video streams.

7. The hybrid codec of claim 1, wherein the capabilities of the transmitting and receiving device include the availability of hardware codecs.

8. The hybrid codec of claim 1, wherein the capabilities of the receiving device include graphical processing capabilities.

9. The hybrid codec of claim 1, wherein the received encoded output associated with the multi-dimensional experience includes a plurality of layers, each of the plurality of layers associated with a subset of the plurality of multi-dimensional data streams.

10. A computer implemented method for providing an experience using a hybrid codec for encoding, decoding, and transmitting experiences, the hybrid codec including a quality of experience (QoE) decision engine, a network engine, and plurality of codecs suitable for encoding and decoding multi-dimensional data streams associated with the experience, the computer implemented method comprising:

receiving, by the QoE decision engine, an output including a plurality of data streams, the plurality of data streams including video, audio, graphics, text, gestures and at least one emotion;

wherein the output is divided into a plurality of regions;

analyzing, by the QoE decision engine, the output in each of the plurality of regions;

for each of the plurality of regions, selecting, by the QoE decision engine, one codec from the plurality of codecs to encode the output within that region;

wherein, the selection of the codec is made to improve a human perception of the multi-dimensional experience and is based on data associated with the capabilities of a transmitting device, the capabilities of the receiving device, and the characteristics of the multi-dimensional experience;

wherein the selection is informed by the network engine, the network engine including a hybrid network stack with network intelligence configured to implement a low-latency transfer protocol; and

for each of the plurality of regions, encoding, by the selected codec, the plurality of data streams associated with the output in that region.

11. The computer implemented method of claim 10, wherein the experience includes a plurality of layers, and the encoding generates the plurality of layers.

12. The computer implemented method of claim 10, wherein the quality of experience engine affects the encoding by taking into consideration the nature and type of devices involved in providing the experience.

13. The computer implemented method of claim 10 further comprising encoding and transmitting virtual goods as part of the experiences.

14. The computer implemented method of claim 10, wherein the received output includes a plurality of layers, each of the plurality of layers associated with a subset of the plurality of multi-dimensional data streams.

15. The computer implemented method of claim 10, wherein improving human perception of the multi-dimensional experience includes prioritizing the encoding and decoding of audio streams over the encoding and decoding of video streams.

16. The computer implemented method of claim 10, further characterized in that a network engine provides instructions on how to encode, the network engine utilizing network information including bandwidth, latency, and jitter.

17. A system comprising:

a plurality of codecs suitable for encoding and decoding multi-dimensional data streams

one or more processors; and

a memory unity having instructions stored thereon, which when executed by the one or more processors, cause the system to: receiving an output including a plurality of data streams, the plurality of data streams including video, audio, graphics, text, gestures and at least one emotion, wherein the output is divided into a plurality of regions; analyze the output in each of the plurality of regions; for each of the plurality of regions, select one codec from the plurality of codecs to encode the output within that region; wherein, the selection of the codec is made to improve a human perception of the multi-dimensional experience and is based on data associated with the capabilities of a transmitting device, the capabilities of the receiving device, and the characteristics of the multi-dimensional experience; wherein the selection is informed by a network engine, the network engine including a hybrid network stack with network intelligence configured to implement a low-latency transfer protocol; and for each of the plurality of regions, cause the selected codec to encode the plurality of data streams associated with the output in that region.