AUTOMATIC GENERATION OF SEMANTIC-BASED CINEMAGRAPHS

Described are various technologies that pertain to automatically generating looped videos or cinemagraphs by selecting objects to animate from an input video. In one implementation a group of semantically labeled objects from an input video is received. Candidate objects from the input video that can appear as a moving object in an output cinemagraph or looped video are identified. Candidate video loops are generated using the selected candidate objects. One or more of these candidate video loops are then selected to create a final cinemagraph. The selection of candidate video loops used to create the final cinemagraph can be made by a user or by a predictive model trained to evaluate the subjective attractiveness of the candidate video loops.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description

Visual imagery commonly can be classified as either a static image (e.g., photograph, painting, etc.) or dynamic imagery (e.g., video, animation, etc.). A static image captures a single instant in time. For instance, a static photograph often derives its power by what is implied beyond its spatial and temporal boundaries (e.g., outside the frame and in moments before and after the photograph was taken). Typically, a viewer's imagination can fill in what is left out of the static image (e.g., spatially and/or temporally). In contrast, video loses some of that power; yet, by being dynamic, video can provide an unfolding temporal narrative through time.

Differing types of short videos can be created from an input video. The input video can be any video of any length, or a portion of a longer video. It can even be a short video clip itself or a short burst of images (e.g., 12 images captured at 10 frames per second). Examples of the short videos that can be created include cinemagraphs and cliplets, which selectively freeze, play, and loop video regions to achieve compelling effects. The contrasting juxtaposition of looping elements against a still background can help grab the attention of a viewer. For instance, cinemagraphs can commonly combine static scenes with small repeating movements (e.g., a hair wisp blowing in the wind); thus, some motion and narrative can be captured in a cinemagraph. In a cinemagraph, the dynamic element is commonly looping in a sequence of frames.

Various techniques are conventionally employed to create video loops that look natural or visually pleasing. For example, some approaches define video textures by locating pairs of similar video frames to create a sparse transition graph. A stochastic traversal of this graph can generate non-repeating video; however, finding compatible frames may be difficult for scenes with many independently moving elements when employing such techniques. Other traditional approaches for creating video loops synthesize videos using a Markov Random Field (MRF) model. Such approaches can successively merge video patches offset in space and/or time, and determine an optimal merging scene using a binary graph cut. Introducing constraints can allow for creation of video loops with a specified global period. Other conventional techniques attempt to create panoramic video textures from a panning video sequence. Accordingly, a user can select a static background layer image and can draw masks to identify dynamic regions. For each region, a natural periodicity can be automatically determined. Then a 3D MRF model can be solved using a multi-label graph cut on a 3D grid. Still other techniques attempt to create panoramic stereo video textures by blending the overlapping video in the space-time volume.

Various approaches for interactive authoring of cinemagraphs have been developed. For example, regions of motion in a video can be automatically isolated. Moreover, a user can select which regions to make looping and which reference frame to use for each region. Looping can be achieved by finding matching frames or regions. Some conventional techniques for creating cinemagraphs can selectively stabilize motions in video. Accordingly, a user can sketch differing types of strokes to indicate regions to be made static, immobilized, or fully dynamic, where the strokes can be propagated across video frames using optical flow. The video can further be warped for stabilization and a 3D MRF problem can be solved to seamlessly merge the video with static content. Other recent techniques provide a set of idioms (e.g., static, play, loop and mirror loop) to allow a user to combine several spatiotemporal segments from a source video. These segments can be stabilized and composited together to emphasize scene elements or to form a narrative.

SUMMARY

Described herein are various technologies that pertain to automatically generating looped videos or cinemagraphs by selecting the best objects to animate from an input video.

In some cinemagraph generator implementations described herein, an input video is received, where the input video includes values at pixels over a time range. The frames of the input video are semantically segmented and the pixels of the frames are assigned semantic object labels. An optimization can be performed to determine a respective input time interval for each pixel from the pixels in the input video. The respective input time interval for a particular pixel can include a per-pixel loop period and a per-pixel start time of a loop at the particular pixel. Moreover, an output video in the form of a looped video or cinemagraph can be created based upon the values at the pixels over the respective input time intervals for the pixels in the input video.

In general, in one implementation described herein, a group of per-pixel semantic object labels for all pixels of an input video is received. The selection of objects in the input video to loop is performed by first selecting candidate objects that can appear as a moving object in a candidate video loop. Candidate video loops are generated with the selected candidate objects and one or more of these candidate video loops are selected to create a final cinemagraph. In some implementations, the selection of candidate video loops used to create the final cinemagraph is made by a model trained to evaluate the subjective attractiveness of the candidate video loops.

The candidate objects to animate in some implementations are found by receiving the per-pixel semantic object labels of all of the pixels in the input video. The best label (as determined by a semantic segmentation algorithm) for each pixel and each frame in the input video is found independently. Given the best label map per frame, histograms are constructed by sum-pooling the label maps across all of the frames and pixels. To this end, given the best label map per frame, histograms are created for each pixel with respect to all types of object labels in the video. The histograms are aggregated across all frames to create a global histogram for each of the semantic object label types of the input video clip. Given the set which includes all possible object labels, object labels associated with inherently static objects are discarded. Furthermore, object labels that have low dynamicity (in terms of intensity variation across time or motion) are also discarded. (The dynamicity of an object label is measured by summing intensity variations across time or magnitudes of optical flow motion over the pixels involved in the label). Object labels that have connected components that are too small in average are also discarded. The top K object labels that have the K largest histogram values of the global histogram over the remaining (not discarded) label set are selected as the candidate objects that will appear as moving objects in a candidate video-loop. An object candidate can be a single object type (e.g., face, tree, car, etc.) or a combination of possible object types.

Once the most appropriate objects to animate are selected, one or more candidate object loops are created using the selected candidate objects. In some implementations, candidate loops are created by generating rough explicit masks of the region of each specific candidate object. A feature set is constructed for each mask region associated with a candidate object. The features in the feature set are clustered and the centroids of the clusters are the feature basis that represents the object label for the masked region. Using this feature basis for calculated object candidate looping start times and looping periods, pixels (e.g., in the input video) that have a similar feature basis and a similar representative feature are penalized so as not to be designated as static (namely so as to be designated as dynamic) in the optimization process. For example, when a pixel is selected to be static during optimization process, if the pixel has a feature similar to one of feature bases, then the pixel is penalized in the optimization computations, so that the pixel can have a period greater than one frame. Candidate loops are then created using the dynamic pixels.

In some implementations, a trained model is used to evaluate the subjective attractiveness of the animation of object types to a user. This trained model can be used to select the candidate loops with which to create a looped video or cinemagraph. To this end, a model is trained to determine the attractiveness of looping certain types of objects in a scene. In some implementations, this model is then used to automatically determine which candidate loops to use to create the cinemagraph. In order to create the model, a training set of looped training image sequences is received. A plurality of features from each looped training image sequence in the set of looped training image sequences is extracted. A human subjective quality rating is received for each of the looped training image sequences. A predictive model is then generated from a combination of the features extracted from each looped training image sequence and the corresponding human subjective quality ratings of the looped training image sequences. Such a predictive model can be extracted, for example, through direct regression (support vector regression or random forests), prediction by clustering, or by collaborative filtering.

Once created, the predictive model can then be used to rank a set of candidate loops for generating a looped image sequence, for example, a cinemagraph. In order to use the predictive model to rank the set of candidate loops, a plurality of candidate loops for generating a looped image sequence, such as a cinemagraph, is received. The predictive model is applied to the candidate loops of the candidate set to generate a quality score for each candidate loop. The quality score defines a subjective quality of the corresponding candidate loop. The candidate loops are then ranked based on the quality scores. A looping image sequence (e.g., cinemagraph) is then generated using a prescribed number of the highest ranked loops. In some implementations, the quality score of a candidate loop must exceed a predetermined threshold in order to be ranked.

The cinemagraph generator implementations described herein are advantageous in that they allow cinemagraphs to be automatically generated without requiring manual user identification of regions or objects to make dynamic. Additionally, these cinemagraph generator implementations create cinemagraphs much more quickly and efficiently than prior methods of generating cinemagraphs and are therefore much more computationally efficient. Furthermore, cinemagraphs generated by these implementations are more realistic and visually pleasing to users because a predictive model can be used to create these cinemagraphs.

The above summary presents a simplified summary in order to provide a basic understanding of some aspects of the systems and/or methods discussed herein. This summary is not an extensive overview of the systems and/or methods discussed herein. It is not intended to identify key/critical elements or to delineate the scope of such systems and/or methods. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a functional block diagram of an exemplary system that generates a video loop or cinemagraph from an input video

FIG. 2 illustrates an exemplary input video V(x,t) and a corresponding exemplary output video L(x,t).

FIG. 3 illustrates an exemplary time-mapping from an input video to an output video.

FIG. 4 illustrates a functional block diagram of an exemplary system that controls rendering of an output video.

FIG. 5 is a flow diagram that illustrates an exemplary methodology for generating a video cinemagraph according to implementations described herein.

FIG. 6 is an exemplary flow diagram of selecting objects in an input video to animate in a cinemagraph or other looping video.

FIG. 7 is an exemplary flow diagram for creating and selecting candidate loops.

FIG. 8 is a flow diagram that illustrates an exemplary methodology for displaying an output video on a display screen of a device.

FIG. 9 is an exemplary flow diagram for creating/using a model to evaluate the attractiveness of a looping video.

FIG. 10 illustrates an exemplary computing device.

DETAILED DESCRIPTION

Various technologies pertaining to generating a spectrum of video loops from an input video are now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. It may be evident, however, that such aspect(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more aspects. Further, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components.

Moreover, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless specified otherwise, or clear from the context, the phrase “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, the phrase “X employs A or B” is satisfied by any of the following instances: X employs A; X employs B; or X employs both A and B. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from the context to be directed to a singular form.

1.0 Introduction/Overview.

The following paragraphs provide an introduction and an overview of a system for generating a cinemagraph based on various technologies described herein.

Referring now to the drawings, FIG. 1 illustrates a system 100 that generates a video loop (e.g., cinemagraph) from an input video 102. The system 100 receives the input video 102, where the input video 102 includes values at pixels over a time range. The input video 102 can be denoted as a three-dimensional (3D) volume V(x,t), with two-dimensional (2D) pixel location x and frame time t. The 2D pixel location x is also referred to herein as the pixel x. The received input video 102 can be semantically segmented so that pixels are labeled with semantic object labels. However, the system can also optionally include a semantic segmenting component 104 that can include a semantic segmentation algorithm 126 that segments the input video into semantic objects (e.g., a semantically segmented video 106 with pixels that are labeled with semantic object labels) if the input video 102 is not already semantically segmented with semantic object labels.

The system 100 automates forming looping content from the input video 102. Certain motions in a scene included in the input video 102 can be rendered in an output video 114. It is contemplated that such motions can be stochastic or semi-periodic such as, for example, a person's face, swaying grasses, swinging branches, rippling puddles, and pulsing lights. These moving elements in a scene often have different loop periods; accordingly, the system 100 can automatically identify a respective per-pixel loop period for each pixel of the input video 102 as well as a respective per-pixel start time for each pixel of the input video 102. At a given pixel, a combination of a per-pixel loop period and a per-pixel start time can define an input time interval in the input video 102. A length of the input time interval is the per-pixel loop period, and a first frame of the input time interval is the per-pixel start time. Moreover, it is contemplated that some moving objects in the input video 102 can be static (e.g., frozen) in the output video 114.

Conventional techniques for forming loops typically rely on user identification of spatial regions of the scene that are looping and user specification of a loop period for each of the identified spatial regions. Such conventional techniques also commonly rely on user identification of spatial regions of the scene that are static. In contrast to traditional approaches, the system 100 formulates video loop creation as an optimization (via an optimizer component 110) in which a per-pixel loop period can be determined for each pixel of the input video 102. Moreover, it is contemplated that the per-pixel loop period of one or more of the pixels of the input video 102 may be unity, whereby a pixel becomes static. Therefore, the optimization can advantageously automatically segment a scene into regions with naturally occurring periods, as well as regions that are static.

Further, looping content can be parameterized to preserve phase coherence, which can cause the optimization to be more tractable. For each pixel, there can be one degree of freedom available to temporally shift a video loop (e.g., the repeating time interval identified from the input video 102 using the per-pixel loop period and the per-pixel start time) in the output video 114. Thus, different delays can be introduced at each pixel, where a delay for a given pixel influences when the given pixel begins a loop in the output video 114. These delays can be set so as to preserve phase coherence, which can enhance spatiotemporal consistency. Accordingly, if two adjacent pixels are assigned the same per-pixel loop period and have respective input time intervals with non-zero overlap, then the pixel values within the time overlap can concurrently appear for both pixels in the output video 114. By way of illustration, if pixel C and pixel D have a common per-pixel loop period, and pixel C has a start frame that is 2 frames earlier than pixel D, then the loop at pixel D in the output video 114 can be shifted by 2 frames relative to the loop at pixel C such that content of the pixel C and the pixel D appears to be synchronized.

The system 100 can be at least part of a dedicated interactive tool that allows the output video 114 to be produced from the input video 102, for example. According to another example, the system 100 can be at least part of a set of dedicated interactive tools, which can include a dedicated interactive tool for forming a video loop from the input video 102 and a disparate dedicated interactive tool for producing the output video 114 from the formed video loop. By way of another example, it is contemplated that the system 100 can be included in a device that captures the input video 102; thus, the system 100 can be configured for execution by a processor of the device that captures the input video 102. Following this example, a camera of a smartphone can capture the input video 102, and a user can employ the smartphone to create the output video 114 using the system 100 (e.g., executed by a processor of the smartphone that captured the input video 102). Pursuant to a further example, a portion of the system 100 can be included in a device that captures the input video 102 (e.g., configured for execution by a processor of the device that captures the input video 102) and a remainder of the system 100 can be included in a disparate device (e.g., configured for execution by a processor of the disparate device); following this example, the portion of the system 100 included in the device that captures the input video 102 can form a video loop, while the remainder of the system 100 included in the disparate device can create the output video 114 from the formed video loop.

The input video 102 can be received from substantially any source. For example, the input video 102 can be received from a camera that captures the input video 102. According to another example, the input video 102 can be received from a data repository that retains the input video 102. It is to be appreciated, however, that the claimed subject matter is not limited to the foregoing examples.

Many types of devices, such as smartphones, cameras, tablet computers, laptop computers, mobile gaming consoles, and the like, can capture the input video 102. For instance, it is to be appreciated that such types of devices can capture high-definition video as well as photographs. Moreover, with increased parallel processing, the gap in resolution between these two media is narrowing. Thus, it may become more commonplace to archive short bursts of video rather than still frames. Accordingly, looping content can be automatically formed from the short bursts of captured video using the system 100.

The input video 102 received may have previously been stabilized (e.g., prior to receipt), for example. According to another example, the input video 102 can be stabilized subsequent to being received. Stabilization of the input video 102 can be performed automatically or with user guidance.

The system 100 further includes a cinemagraph generator 108 that can cause an optimizer component 110 to perform an optimization to determine a respective input time interval within the time range of the input video 102 for each pixel from the pixels in the input video 102. A respective input time interval for a particular pixel can include a per-pixel loop period and a per-pixel start time of a loop at the particular pixel within the time range from the input video 102. For example, the cinemagraph generator 108 can cause the optimizer component 110 to perform the optimization to determine the respective input time intervals within the time range of the input video 102 for the pixels that optimize an objective function. The cinemagraph generator 106 can also optionally include a semantic segmenting component (not shown) that segments the pixels of the input video 102 into semantic objects and labels each pixel with a semantic object label if the pixels of the input video have not previously been semantically segmented.

The system 100 further includes a candidate object selector 116 that can identify and select candidate object types to animate in a loop using the cinemagraph generator 108. The candidate object selector 116 finds the candidate objects to animate in some implementations by receiving the per-pixel semantic labels of all of the pixels in the input video. The best label (e.g., most provable label or best label as defined by the semantic segmentation algorithm) for each pixel and each frame in the input video is found independently. Given the best label map per frame, histograms are constructed by sum-pooling the label maps across all of the frames and pixels. To this end, given the best label map per frame, histograms are created for each pixel with respect to all types of object labels in the video. The histograms are aggregated across all frames to create a global histogram of the input video clip. Given the set which includes all possible object labels, object labels that inherently static objects are discarded. Furthermore, object labels that have low dynamicity (in terms of intensity variation across time or motion) are also discarded. (The dynamicity of an object label is measured by summing intensity variations across time or magnitudes of optical flow motion over the pixels involved in the label). Object labels that have connected components that are too small in average are also discarded. The top K object labels that have the K largest histogram values of the global histogram over the remaining (not discarded) label set are selected as the candidate objects that will appear as moving objects in a candidate video-loop. An object candidate can be a single object type (e.g., face, tree, car, etc.) or a combination of possible candidate types.

Once candidate objects are selected by the candidate object selector, candidate video loops are then generated by using a candidate video loop generator 118. Some of these candidate video loops are selected to create a looped video or cinemagraph. In some implementations, candidate loops are created by generating rough explicit masks of the region of each specific candidate object. A feature set is constructed for each mask region associated with a candidate object. The features in the feature set are clustered and the centroids of the clusters are the feature basis that represents the object label for the masked region. Using this feature basis for calculated object candidate looping start times and looping periods, pixels that have a similar feature basis and a similar representative feature are encouraged to be dynamic. For example, when a pixel is selected to be static during the optimization process, if the pixel has a feature similar to one of the feature bases, then the pixel is penalized so as to be dynamic in the optimization (e.g., so that the pixel can have a looping period greater than one frame). Candidate loops are then created using the dynamic pixels.

The system 100 further includes a loop selector 120 that selects which candidate loops to use to create the output video 114. In some implementations loop attractiveness of candidate loops is determined by a loop attractiveness evaluator 122. The loop attractiveness evaluator 122 in some implementations employs a predictive model 124 to determine which candidate loops to use. The predictive model 124 will be discussed in greater detail in the paragraphs below.

Moreover, the system 100 includes a viewer component 112 that can create the output video 114 based upon the values at the pixels over the respective input time intervals for the pixels in the input video 102. The viewer component 112 can generate the output video or cinemagraph 114 based upon the video loop created by the cinemagraph generator 108. The output video 112 can include looping content and/or static content. The output video 114 can be denoted as a 3D volume L(x,t), with the 2D pixel location x and frame time t. The viewer component 112 can cause the output video 114 to be rendered on a display screen of a device.

As discussed above, in some cinemagraph generator implementations as described herein, the loop attractiveness evaluator 122 employs a trained predictive model 124 that is used to evaluate the attractiveness of the animation of types of objects to a user. In these implementations, the model 124 is trained to determine the attractiveness of looping certain objects in a scene and this trained model is then used to automatically determine the most attractive objects to be used to animate the objects in a cinemagraph. In order to create the predictive model 124, a training set of looped training image sequences is received. A plurality of features from each looped training image sequence in the set of looped training image sequences is extracted. A human subjective quality rating is received for each of the looped training image sequences. The human subjective quality rating can be provided by one or more users that provide a rating on how they like each looped training image sequence that is displayed to them. The predictive model 124 is then generated from a combination of the features extracted from each looped training image sequence and the corresponding human subjective quality ratings of the looped training image sequences. Such a predictive model can be extracted through direct regression (support vector regression or random forests), prediction by clustering, or by collaborative filtering. These methods are well-known in the literature.

Once created, the predictive model 124 of the loop attractiveness evaluator 122 can then be used to rank a set of candidate loops for generating a looped image sequence, for example, a cinemagraph, in the output video 114. In order to use the predictive model 124 to rank the set of candidate loops, a plurality of candidate loops for generating a looped image sequence, such as a cinemagraph, is received. The predictive model 124 is applied to the candidate loops to generate a quality score for each candidate loop. The quality score defines a subjective quality of a corresponding candidate loop. The candidate loops are then ranked based on the quality scores. A looping image sequence (e.g., a cinemagraph) is then generated using a prescribed number of the highest ranked candidate loops. In some implementations, the quality score of a candidate loop must exceed a predetermined threshold in order for the candidate loop to be ranked.

2.0 Input Time Interval Computation

The following paragraphs provide an explanation, including exemplary mathematical computations, of how input time intervals can be computed in some cinemagraph generator implementations described herein.

Reference is again made to FIG. 1. As set forth herein, the cinemagraph generator 108 can cause the optimizer component 110 to perform the optimization to determine the respective input time intervals within the time range of the input video 102 for the pixels that optimize an objective function. Video loop construction can be formulated as an MRF problem. Accordingly, the per-pixel start times s={sx} and the per-pixel loop periods p={px} that minimize the following objective function can be identified:


E(s,p)=Econsistency(s,p)+Estatic(s,p)

In the foregoing objective function, the first term can encourage pixel neighborhoods in the video loop to be consistent both spatially and temporally with those in the input video 102. Moreover, the second term in the above noted objective function can penalize assignment of static loop pixels except in regions of the input video 102 that are static. In contrast to conventional approaches, the MRF graph can be defined over a 2D spatial domain rather than a full 3D video volume. Also, in contrast to conventional approaches, the set of unknowns can include a per-pixel loop period at each pixel.

According to an example, the cinemagraph generator 108 can cause the optimizer component 110 to solve the MRF optimization using a multi-label graph cut algorithm, where the set of pixel labels is the outer product of candidate start times {s} and periods {p}. In addition, as indicated previously, the discovery of regions for looping takes into consideration the semantics of the scene to force pixels within the same semantic segment to have the same looping parameters. By employing semantic segmentation to prevent breaking up semantically meaningful objects (e.g., face, animals, man-made objects such as motorcycles and cars, to name a few), the loops generated for that object cover the entire region containing the object. This concept can employ any semantic segmentation technique, such as the technique described in J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Conference on Computer Vision and Pattern Recognition, 2015 for general objects; and in D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint Cascade Face Detection and Alignment,” European Conference on Computer Vision, 2014 for faces.

Before describing the creation of semantically meaningful video loops in more detail, it will be helpful to understand how a semantic-free video loop could be created. In the generated video loop created by the cinemagraph generator 108, spatiotemporal neighbors of each pixel can look similar to those in the input video 102. Because the domain graph is defined on the 2D spatial grid, the objective function can distinguish both spatial and temporal consistency:


Econsistency(s,p)=βEspatial(s,p)+Etemporal(s,p)

The spatial consistency term Espatial can measure compatibility for each pair of adjacent pixels x and z, averaged over time frames in the video loop.

E spatial = x - z = 1 γ s ( x , z ) T t = 0 T - 1 ( V ( x , φ ( x , t ) ) - V ( x , φ ( z , t ) ) 2 + V ( z , φ ( x , t ) ) - V ( z , φ ( z , t ) ) 2 )

The period T is the least common multiple (LCM) of the per-pixel loop periods of the pixels in the input video 102. Accordingly, the objective can be formulated as limT→∞Espatial, the average spatial consistency over an infinitely looping video. Further, pixel value differences at both pixels x and z can be computed for symmetry. Moreover, the factor γs(x,z) can be as follows:

γ s ( x , z ) = 1 1 + λ s M A D t V ( x , t ) - V ( z , t )

The factor γs(x,z) can reduce the consistency cost between pixels when the temporal median absolute deviation (MAD) of the color values (e.g., differences of color values) in the input video 102 is large because inconsistency may be less perceptible. It is contemplated that MAD can be employed rather than variance due to MAD being less sensitive to outliers; yet, it is to be appreciated that the claimed subject matter is not so limited. Pursuant to another example, the MAD metric can be defined in terms of respective neighborhoods of the pixel x and the pixel z, instead of single pixel values V(x,t) and V(z,t). According to a further example, λs can be set to 100; however, the claimed subject matter is not so limited.

The energy Espatial(x,z) can be simplified for various scenarios, which can enable efficient evaluation. According to an exemplary scenario, pixels x and z can both be static. Thus, the energy can reduce to:


Espatial(x,z)=∥V(x,sx)−V(x,sz)∥2+∥V(z,sx)−V(z,sz)∥2.

In accordance with another exemplary scenario, pixel x can be static and pixel z can be looping. Accordingly, the energy can simplify to:

E spatial ( x , z ) = 1 T t = 0 T - 1 ( V ( x , s x ) - V ( x , φ ( z , t ) ) 2 + V ( z , s x ) - V ( z , φ ( z , t ) ) 2 )

For each of the two summed vector norms and for each color coefficient vcεV, the sum can be obtained as:

1 T t = 0 T - 1 ( v c ( x , s x ) - v c ( x , φ ( z , t ) ) ) 2 = v c 2 ( x , s x ) - 2 n c ( x , s x ) p z t = s z s z + p z - 1 v c ( x , t ) + 1 p z t = s z s z + p z - 1 v c 2 ( x , t )

The two sums above can be evaluated in constant time by pre-computing temporal cumulative sum tables on V and V2.

In accordance with another exemplary scenario, when both pixels x and z are looping with the same period, px=pz, the energy can reduce to:

E spatial ( x , z ) = 1 p x t = 0 p x - 1 ( V ( x , φ ( x , t ) ) - V ( x , φ ( z , t ) ) 2 + V ( z , φ ( x , t ) ) - V ( z , φ ( z , t ) ) 2 )

Further, the zero value terms for which φ(x,t)=φ(z,t) can be detected and ignored. Thus, as previously illustrated in FIG. 3, for the case where start times are similar, significant time intervals marked with arrows in FIG. 3 can be ignored.

According to another exemplary scenario, when the pixels have differing loop periods, generally the sum is computed using T=LCM(px,pz). However, when the two loop periods are relatively prime (e.g., LCM(px,pz)=pxpz), then the following can be evaluated:

1 mn i = 0 m - 1 j = 0 n - 1 ( a i - b j ) 2 = 1 m i = 0 m - 1 a i 2 + 1 n j = 0 n - 1 b i 2 - 2 mn ( i = 0 m - 1 a i ) ( j = 0 n - 1 b i )

In the foregoing, a and b correspond to coefficients in V(x,•) and V(z,•). Thus, the recomputed cumulative sum tables from the exemplary scenario noted above where pixel x is static and pixel z is looping can be reused to evaluate these terms in constant time.

Moreover, it is contemplated that the expected squared difference can be used as an approximation even when the periods px and pz are not relatively prime. Such approximation can provide a speed up without appreciably affecting result quality.

Moreover, as noted above, the objective function can include a temporal consistency objective term Etemporal.

E temporal = x ( V ( x , s x ) - V ( x , s x + p x ) 2 + V ( x , s x - 1 ) - V ( x , s x + p x - 1 ) 2 ) γ t ( x )

The aforementioned temporal consistency objective term can compare, for each pixel, the value at the per-pixel start time of the loop sx and the value after the per-pixel end time of the loop sx+px(e.g., from a next frame after the per-pixel end time) and, for symmetry, the value before the per-pixel start time of the loop sx−1 (e.g., from a previous frame before the per-pixel start time) and the value at the per-pixel end time of the loop sx+px−1.

Because looping discontinuities are less perceptible when a pixel varies significantly over time in the input video 102, the consistency cost can be attenuated using the following factor:

γ t ( x ) = 1 1 + λ t M A D t V ( x , t ) - V ( x , t + 1 )

The foregoing factor can estimate the temporal variation at the pixel based on the median absolute deviation of successive pixel differences. According to an example, λt can be set to 400; yet, the claimed subject matter is not so limited.

For a pixel assigned as being static (e.g., px=1), Etemporal can compute the pixel value difference between successive frames, and therefore, can favor pixels with zero optical flow in the input video 102. While such behavior can be reasonable, it may be found that moving objects can be inhibited from being frozen in a static image. According to another example, it is contemplated that the temporal energy can be set to zero for a pixel assigned to be static.

According to an example, for looping pixels, a factor of

1 p x

can be utilized to account for shorter loops revealing temporal discontinuities more frequently relative to longer loops. However, it is to be appreciated that the claimed subject matter is not limited to utilization of such factor.

A neighborhood N of a pixel can refer to a spatiotemporal neighborhood of the pixel. Thus, the neighborhood of a given pixel can be a set of pixels within a specified window in both space and time around the given pixel, optionally weighted by a kernel (e.g., a Gaussian kernel) that reduces influence of pixel values that are farther away in space or time from the given pixel. Moreover, it is contemplated that the specified window for a neighborhood of a given pixel can include the given pixel while lacking other pixels (e.g., a neighborhood of a given pixel in the input video 102 can be the given pixel itself).

The term Estatic can be utilized to adjust the energy objective function based on whether the neighborhood N of each pixel has significant temporal variance in the input video 102. In one implementation, if a pixel is assigned a static label, it can incur a cost penalty cstatic. Such penalty can be reduced according to the temporal variance of the neighborhood N of a pixel. Thus, Estaticx|px=1Estatic(x) can be defined with:

E static ( x ) = c static min ( 1 , λ static ) M A D t N ( x , t ) - N ( x , t + 1 )

In the foregoing, λstatic can be set to 100, and N can be a Gaussian weighted spatial temporal neighborhood with σx=0.9 and σt=1.2; yet, the claimed subject matter is not so limited.

The foregoing looping procedure can be modified by computing an additional semantic term, which encourages neighboring pixels within the same semantic object to not separate in the final result. The same object is tracked over time, and the faster the object moves, the stronger the semantic factor is used to discourage adjacent pixel breakup. In one implementation, a temporal semantic consistency term is added to the Etemporal(x,z) term, and a spatial semantic consistency term is used in the Espatial(x,z) term, to discourage adjacent pixel breakup in semantic object regions.

In the implementation involving adding semantic consistency terms to the Etemporal(x,z) and Espatial(x,z) terms, the following formulation is employed. Denoting start times s={sx}, periods p={px}, l={lx}, where lx={sx,px}, the video-loop problem can be formulated as the following MRF problem,

arg min s , p x ( E temp . ( l x ) + α 1 E static ( l x ) + α 2 z N ( x ) E spa . ( l x , l z ) ) , ( 1 )

where each term involves semantic consistency and photometric consistency as well. The temporal consistency term Etemp measures the consistency across near the loop start frame sx and end frame sx+A, as


Etemp.(lx)=γt(x)[(1−wV(x)+F(x)],  (2)

where the temporal photometric consistency ΦV(x) and the temporal semantic consistency ΦF(x) are defined as follows:

Φ V ( x ) = 1 dim ( V ) ( V ( x , s x ) - V ( x , s x + p x ) 2 + V ( x , s x - 1 ) - V ( x , s x + p x - 1 ) 2 ) , Φ F ( x ) = 1 dim ( F ) ( F ( x , s x ) - F ( x , s x + p x ) 2 + F ( x , s x - 1 ) - F ( x , s x + p x - 1 ) 2 ) .

The semantic feature as F(x,t) encodes semantic information of the scene, e.g., label map, semantic segmentation responses, intermediate activation of Convolutional Neural Networks (CNN). How F(x,t) is computed will be described shortly.

The static term Estatic assigns a penalty to static pixels to prevent a trivial all-static solution as


Estatic(lx)=cstatic·δ[px=1].  (3)

The spatial consistency term Espa. is designed to measure the spatial neighbor compatibility of color profiles over all spatially adjacent pixels x and z across a loop period. This is extended to measure semantic compatibility between neighbor pixels as well. Then, Espa. is defined as


Espa.(lx,lz)=γs(x,z)[(1−wV(x,z)+F(x,z)].  (4)

The spatial photometric consistency ΨV(x,z) and the spatial semantic consistency ΨF(x,z) are defined as follows:

Ψ V ( x , z ) = 1 dim ( V ) · T t = 0 T - 1 ( V ( x , φ ( x , t ) ) - V ( x , φ ( z , t ) ) 2 + V ( z , φ ( x , t ) ) - V ( z , φ ( z , t ) ) 2 ) , Ψ F ( x , z ) = 1 dim ( F ) · T t = 0 T - 1 ( F ( x , φ ( x , t ) ) - F ( x , φ ( z , t ) ) 2 + F ( z , φ ( x , t ) ) - F ( z , φ ( z , t ) ) 2 ) ,

where the period T is the least common multiple (LCM) of per-pixel periods.

The connectivity potential γs (x,z) in Eq. 4 is employed to account for spatial coherence (consistency) at each pixel. The connectivity potential defined previously is computed on the basis of the deviation of intensity difference across time. However, regardless of real object boundaries or motion, loop coherence is broken apart if the intensity variation is just large. In view of this, in one implementation, the connectivity potential is measured instead by computing the difference of semantic label occurrence distribution. The semantic label occurrence measure is a per-pixel histogram of an object label across time. If two histograms are similar, it indicates that two pixels have the similar semantic occurrence behavior.

The α1 and α2 balance parameters in Eq. 1 can be tuned. In one implementation, the semantic information is employed to accomplish this balance parameter adjustment. Given per-pixel semantic label information, the balance parameters are adaptively adjusted to control the dynamicity of labels. To this end, objects are classified as being natural or non-natural objects to encourage the diversity of loop labels or to synchronize loop labels, respectively. The natural category denotes objects like trees, water, grass, waterfalls, and so on, which are natural things and are often easily loopable. The non-natural category denotes objects like a person, animal, car, and so on, which are sensitive to label incompatibility (incoherence). Additionally, we observe that long periods are favored for natural objects in that it is perceptually natural while short periods are perceptibly unnatural. In order to encourage long period, a high penalty is added on natural object regions for short period labels.

2.1 Candidate Object Selection and Candidate Loop Selection

Referring again to FIG. 1, given candidate segments (per-pixel category responses obtained from semantic segmentation), potential object candidates are selected that will appear as a moving object in each candidate video-loop. An object candidate can be a single object type or a combination of possible candidates. In one implementation, the candidate object selector 116 uses the following procedure to select candidate objects that are used to create candidate loops:

    • 1. Given per-pixel semantic response of the input video, the best label lx* is picked for each pixel x by lx*=argmaxlscorex(l) and each frame independently, where scorer(l) denotes the category response of a object label l at the pixel x obtained from the semantic segmentation algorithm that is used to semantically segment the semantic objects.
    • 2. Given the best label map for each frame computed in the previous step, histograms {right arrow over (h)}f=[hf(1) . . . hf(l) . . . hf(L)] are constructed for each frame f with respect to all the L types of object labels. The histograms are aggregated by {right arrow over (h)}′=Σf{right arrow over (h)}f across all frames, i.e, sum-pooling. Then, one has a global label histogram {right arrow over (h)}′ of the input video clip.
    • 3. Given a set which includes all possible object labels, from the set, object labels that are inherently static objects are discarded.
    • 4. Object labels that have low dynamicity from the set are also discarded. The dynamicity of a object label l is measured by summing intensity variations across time, i.e, γt(x), or magnitudes of optical flow motion v(x), i.e, ∥v(x)∥, over the pixels x's involved in the label l.
    • 5. Object labels of which connected components are too small in average from the set are also discarded.
    • 6. The top-K object labels {θ} that have the K largest histogram values of the histogram {right arrow over (h)}′ over the remaining label set are picked, so that the discarded labels are never regarded as potential object candidates.

Once the candiate objects are selected, candidate loops are generated using the candidate loop generator 118. Given the top-K specific candidate object types (labels) {θ} as described above, the candidate loop generator 116 computes rough explicit masks for each specific candidate by thresholding the number of occurrence of the specified candidate label across time. Specifically, label histograms {right arrow over (h)}x=[hx(1) . . . hx(l) . . . hx(L)] are constructed at each pixel x by accumulating across all the frames in the similar way as described above. Then, for each candidate label θ, a binary mask is computed by thresholding the histogram value hx(θ) for all the pixels. The threshold value is an integer value as the number of frames, so that one sets the binary mask as 1 for the regions where the object θ occurs more than the times of the threshold. In summary, the candidate loop generator 118 contructs the candidate masks by the following steps:

    • 1. Constructing label histograms {right arrow over (h)}x by accumulating across all the frames.
    • 2. For a candidate label θ, generating a mask for θ by binarizing hx(θ) with a threshold value.
    • 3. Iterating the step 2 for all the candidate labels θ's in .

Using the candidate masks, the candidate loops are constructed. Any combination of the remaining labels in the label set can also be an element of , so that various types of candidate video loops are considered, e.g, ={1,2,3}={1,2,3, {1,2}, {2,3}, {1,2}, {1,3}, {1,2,3}}. The elements of the label combination, e.g., {2,3}, can be analagously taken into account by taking a union operation (merging) to two binary masks independently generated from respective label elements, i.e, θ=2 and θ=3. The subsequent processes are also applicable similarly.

Then, a representative feature basis of θ on the masked region can be extracted. Given the mask for the label θ, a feature set is constructed by gathering all the semantic features F(x,t) over the masked regions R and the frames t that the label θ is occurred, i.e,

θ={F(x,t)|xεR∩{t|θ=argmaxl,scorex,t(l′) for at x}}, where scorex,t(l′) denotes the category response of a object label l′ at the pixel x and the frame t obtained from the semantic segmentation algorithm. Namely, θ consists of semantic features relating to the label θ. In order to estimate representative feature basis, the features in the set θ are clustered by K-means clustering or Gaussian mixture model (GMM). The centroid vectors {{right arrow over (v)}i}k=1K of K clusters are regarded as representative feature bases. In summary, the candidate loops can be calculated by the candidate loop generator 118 as follows:

    • 1. Given the mask w.r.t the label θ, construct a feature set θ.
    • 2. Cluster the features in the set θ. The centroids of the clusters are the feature basis that represent the object label θ.

In the optimization step for start frames and periods, using this feature basis, one can encourage some regions to be dynamic when the regions have similar features with the representative feature bases. Specifically, one can modify the static term in Eq. (3) such that the similarity between a semantic feature and the feature bases is taken into account for encouraging the pixel having high similarity to be dynamic (in other words, penalizing when the period px=1, i.e, static), i.e,

E static ( l x ) = ( c static + λ max i sim ( v i , F ( x , s x ) ) ) δ [ p x = 1 ] , ( 5 )

where sim(•,•) denotes a similarity measure between two vectors, such as normalized correlation, and A is a balance parameter. This allows for the imposition of a soft penalty for the static pixels with the label θ to be dynamic in the resulting video loop candidates, and this may be insensitive to the failure of the hard candidate mask that is often very rough.

Since the numbers of the label space is the outer product of candidate start times s and periods p, i.e, |s|×|p|, directly optimizing Eq. (1) may produce poor local minima. This is because a graph cut a-expansion only deal with a single new candidate label at a time. A multi-stage approach is introduced, which can be regarded as an alternative directional optimization. The procedure for this is as follows:

    • 1. For each candidate looping period p>1, the per-pixel start times sx|p that create the best video-loop L given the fixed p, saying L|p, are found by solving a multi-label graph cut. This multi-label graph cut is initialized by the start frames sx|p that minimizes Etem. per pixel independently.
    • 2. Given a candidate object label, one solves for per-pixel periods px≧1 that define the best video-loop (px, sx|px) using the pixel start times obtained in the first state, again by solving a multi-label graph cut for the candidate video-loop generation. In this stage, the set of labels includes all the periods p>1 considered in the first stage, together with all possible frames sx, for the static case p=1.
    • 3. Due to the restriction of the paired label in the first phase, it can be stuck into a poor solution. In this stage, the periods of each pixel obtained from the second stage are fixed and one solves a multi-label graph cut only for sx.

This single alternating optimization by using the second and third stages produces a better solution than a two stage approach. Furthermore, since several candidate video-loops are generated for specified objects, the multi-label graph cuts are solved several times. Fortunately, the candidate specific regularization is not involved when the paired initial labels are made in the first stage. The paired initial labels are shared across all the candidate video-loop generation procedures, i.e the second and third stages.

The number of labels optimized in the first stage is equal to the feasible start frames |s| for each period. In the second stage, the number of labels is |p|+|s|: the optimized loops found in the first stage for periods p plus the static loops corresponding to the frames s of the input video. Lastly, the number of labels is |s| in the third stage. In accordance with these number of labels, the computational complexity of this multi-phase approach is significantly lower than direct one-shot multi-label graph cut with |p|×|s| number of labels.

As aforementioned, the semantic feature F(x,t) can be any type of feature vector that can measure semantic difference between two pixels, e.g., including label map, semantic segmentation responses, intermediate activation of CNN, and so on. For example, in one tested implementation, the semantic segmentation responses obtained from the semantic segmentation approaches described in J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Conference on Computer Vision and Pattern Recognition, 2015; and J. Dai, K. He, and J. Sun, “Convolutional Feature Masking for Joint Object and Stuff Segmentation,” IEEE Conference on Computer Vision and Pattern Recognition 2015 were used as the semantic feature F(x,t). These semantic segmentation responses are applied to each frame of the input video. In one tested implementation, the models learned using the approach described in R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The Role of Context for Object Detection and Semantic Segmentation in the Wild” IEEE Conference on Computer Vision and Pattern Recognition 2014 were used for the two semantic segmentations. This produces a 60 dimensional output vector for each pixel. It is noted that when feeding this vector representing F(x; t) into the optimization, it is normalize so that its sum is one.

In one implementation, an ensemble approach is taken to exploit the semantic information as much as possible. Although the aforementioned J. Long et al. approach involves fully convolutional features, a fixed kernel size employed in the approach can affect the final segmentation results depending on the input resolution. In view of this, in one implementation, the J. Long et al. approach is used to scale a pyramid with 3 levels, and to upsample the semantic response passing through the softmax layer employed in the approach to match the resolution with the input size. Then, the responses are aggregated by average pooling. The responses obtained from the aforementioned J. Dai et al. approach are then aggregated with the J. Long et al. approach responses. This produces a voting effect from the different models.

The aforementioned 60 dimensional feature can introduce high dimensional vector computation and possibly high computational complexity. In view of this, in one implementation, a principal component analysis is applied to project the features into a lower-dimensional space. It is seamlessly possible to replace the semantic feature terms in this manner because the cost functions do not require any explicit label information. Thus, an implicit low-dimensional feature can be used to measure the degree of semantic similarity between two pixels.

3.0 Output Video

The system 100 attempts to maintain spatiotemporal consistency in the output video 114 (e.g., a loop can avoid undesirable spatial seams or temporal pops that can occur when content of the output video 114 is not locally consistent with the input video 102). Due to stabilization of the input video 102, the output video 114 can be formed by the viewer component 112 retrieving, for each pixel of the output video 114, content associated with the same pixel in the input video 102. The content retrieved from the input video 102 and included in the output video 114 by the viewer component 112 can be either static or looping. More particularly, the content can be represented as a temporal interval [sx, sx+px) from the input video 102, where sx is a per-pixel start time of a loop for a pixel x and px is a per-pixel loop period for the pixel x. The per-pixel start time sx and the per-pixel loop period px can be expressed in units of frames. A static pixel thus corresponds to the case px=1.

Turning to FIG. 2, illustrated is an exemplary input video 200, V(x,t), and a corresponding exemplary output video 202, L(x,t). Input time intervals can be determined for each pixel in the input video 200 (e.g., by the cinemagraph generator 108 of FIG. 1). As shown, pixels included in a spatial region 204 of the input video 200 each have a per-pixel start time of sx and a per-pixel loop period of px. Further, as depicted, pixels included in a spatial region 206 of the input video 200 and pixels included in a spatial region 208 of the input video 200 each have a common per-pixel loop period; however, a per-pixel start time of the pixels included in the spatial region 206 of the input video 200 differs from a per-pixel start time of the pixels included in the spatial region 208 of the input video 200. Moreover, pixels included in a spatial region 210 of the input video 200 are static (e.g., a unity per-pixel loop period).

Values from the respective input time intervals for the pixels from the input video 200 can be time-mapped to the output video 202. For example, the input time interval from the input video 200 for the pixels included in the spatial region 206 can be looped in the output video 202 for the pixels included in the spatial region 206. Also, as depicted, static values for the pixels included in the spatial region 210 from the specified time of the input video 200 can be maintained for the pixels included in the spatial region 210 over a time range of the output video 202.

The time-mapping function utilized to map the input time intervals from the input video 200 to the output video 202 can preserve phase differences between differing spatial regions, which can assist maintaining spatial consistency across adjacent pixels in differing spatial regions with a common per-pixel loop period and differing per-pixel start times. Thus, an offset between the input time interval for the pixels in the spatial region 206 and the input time interval for the pixels in the spatial region 208 from the input video 200 can be maintained in the output video 202 to provide synchronization.

Again, reference is made to FIG. 1. The viewer component 112 can time-map a respective input time interval for a particular pixel in the input video 102 to the output video 114 utilizing a modulo-based time-mapping function. An output of the modulo-based time-mapping function for the particular pixel can be based on the per-pixel loop period and the per-pixel start time of the loop at the particular pixel from the input video 102. Accordingly, a relation between the input video 102 and the output video 114 can be defined as:


L(x,t)=V(x,φ(x,t)),t≧0.

In the foregoing, φ(x,t) is the time-mapping function set forth as follows:


φ(x,t)=sx−((t−sx)mod px).

Due to the above modulo arithmetic of the time-mapping function, if two adjacent pixels are looping with the same period in the input video 102, then the viewer component 112 can cause such adjacent pixels to be in-phase in the output video 114 (e.g., in an output loop).

FIG. 3 illustrates an exemplary time-mapping from an input video 300 to an output video 302. In the depicted example of FIG. 3, pixel x and pixel z are spatially adjacent. The pixels x and z have the same per-pixel loop period, and thus, px=pz. Further, a start time for the pixel x, sx, differs from a start time for the pixel z, sz.

Content from the input time interval [sx, sx+px) of the input video 300 can be retrieved for the pixel x and content from the input time interval [sz, sz+pz) of the input video 300 can be retrieved for the pixel z. Although the start times sx and sz differ, the input time intervals can have significant overlap as illustrated by the arrows between the input time intervals in the input video 300. Since the adjacent pixels x and z have the same loop period and similar start times, the in-phase time-mapping function of Equation 1 above can automatically preserve spatiotemporal consistency over a significant portion of the output timeline shown in FIG. 3 for the output video 302 (represented by the arrows). The time-mapping function can wrap the respective input time intervals for the pixel x and the pixel z in the output timeline to maintain adjacency within the temporal overlap, and thus, can automatically preserve spatial consistency.

Solving for start times can encourage phase coherence to be maintained between adjacent pixels. Moreover, loops within the input video 102 can have regions that loop in-phase with a common optimized period, but with staggered per-pixel start times for differing regions. In contrast to determining start times for pixels, some conventional approaches solve for time offsets between output and input videos.

While many of the examples set forth herein pertain to time-mapping where loops from an input video move forward in time in an output video, other types of time-mappings are intended to fall within the scope of the hereto appended claims. For instance, time-mappings such as mirror loops, reverse loops, or reverse mirror loops can be employed, and thus, optimization for such other types of time-mappings can be performed.

4.0 Rendering

Turning to FIG. 4, illustrated is a system 400 that controls rendering of the output video 114. The system 400 includes the viewer component 410, which can obtain a source video 402. The source video 402, for example, can be the input video 102 of FIG. 1. The parameters 404 can include, for each pixel, the per-pixel start times sx, the per-pixel loop period px, and the static time sx′.

The viewer component 410 can include a formation component 406 and a render component 408. The formation component 406 can create the output video 412 based upon the source video 402 and the parameters 404. The parameters 404 can encode a respective input time interval within a time range of the source video 402 for each pixel in the source video 402. Moreover, a respective input time interval for a particular pixel can include a per-pixel loop period of a loop at the particular pixel within the time range from the source video 402. The respective input time interval for the particular pixel can also include a per-pixel start time of the loop at the particular pixel within the time range from the source video 402. Further, the render component 410 can render the output video on a display screen of a device.

Various other exemplary aspects generally related to the claimed subject matter are described below. It is to be appreciated, however, that the claimed subject matter is not limited to the following examples.

5.0 Methodology and Exemplary Processes

FIGS. 5-9 illustrate exemplary methodologies for generating looping video. While the methodologies are shown and described as being a series of acts that are performed in a sequence, it is to be understood and appreciated that the methodologies are not limited by the order of the sequence. For example, some acts can occur in a different order than what is described herein. In addition, an act can occur concurrently with another act. Further, in some instances, not all acts may be required to implement a methodology described herein.

Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions can include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies can be stored in a computer-readable medium, displayed on a display device, and/or the like.

FIG. 5 illustrates an exemplary computer-implemented methodology 500 for generating a video loop and/or cinemagraph according to the technologies described herein. An input video is received, wherein the input video comprises a sequence of frames each frame of which comprises pixels that may be labeled with semantic object labels, as shown in block 502. As shown in block 504, if the frames of the input video are not previously semantically segmented, the frames are semantically segmented to identify regions in the frames that correspond to semantic objects with semantic object labels as previously discussed above. Semantic objects are selected as candidate objects to animate, as shown in block 506. (FIG. 6 and the associated discussion below provide an example of how these candidate objects can be selected.) Candidate loops of the selected candidate objects are generated, as shown in block 508. (FIG. 7 and the associated discussion below provide an exemplary illustration of how these candidate loops can be generated). Once the candidate loops are generated, one or more of the candidate loops are selected to create the cinemagraph, as shown in block 510. Selection of the candidate loops can be by a user or can be automatically performed, for example, by using a predictive model. A cinemagraph is created from the selected candidate loops using the input time intervals computed for the pixels of the frames of the input video, wherein the cinemagraph exhibits regions that appear static to a viewer and regions comprising dynamic video loops of the selected candidate objects that appear to the viewer to be changing over time, as shown in block 512.

FIG. 6 is an exemplary computer-implemented process 600 that depicts how objects to animate in a cinemagraph are selected from an input video or burst of images. As shown in block 602, a set of semantic objects identified in the input video as per-pixel semantic labels of all of the pixels in the input video is received (e.g., the per-pixel semantic labels of all of the pixels in the input video). The best label for each pixel and each frame in the input video is found independently, as shown in block 604. Given the best label map per frame, histograms are constructed by sum-pooling the label maps across all of the frames and pixels to create a global histogram of the labels of the input video, as shown in block 606. To this end, given the best label map per frame, histograms are created for each pixel with respect to all types of object labels in the video. The histograms are aggregated across all frames to create a global histogram of the input video clip. As shown in block 608, given the set which includes all possible object labels, object labels that represent inherently static objects are discarded. Furthermore, as shown in block 610, object labels that have low dynamicity (in terms of intensity variation across time or motion) are also discarded. (The dynamicity of an object label is measured by summing intensity variations across time or magnitudes of optical flow motion over the pixels involved in the label). Additionally, as shown in block 612, object labels that have connected components that are too small in average are also discarded. The top K object labels that have the K largest histogram values of the global histogram over the remaining (not discarded) label set are selected as the candidate objects that will appear as moving objects in a candidate video-loop, as shown in block 614. An object candidate can be a single object type (e.g., face, tree, car, etc.) or a combination of possible candidate types.

FIG. 7 is an exemplary computer-implemented process 700 that depicts how candidate video loops are generated from the selected objects to create a looped video or cinemagraph. In some implementations, candidate loops are created by generating rough explicit masks of each specific candidate object, as shown in block 702. A feature set is constructed for each mask region associated with a candidate object, as shown in block 704. The features in the feature set are clustered and the centroids of the clusters are the feature basis that represents the object label for the masked region, as shown in block 706. Using this feature basis for calculated object candidate looping start times and looping periods, pixels in the masked region that have a similar feature basis and a similar representative feature are encouraged to be dynamic (e.g., a penalty is added to these pixels in applying previously discussed equation (5) in order to assign these pixels non-static looping periods), as shown in block 708. Candidate loops are then created using the objects associated with the dynamic regions, as shown in block 710.

With reference to FIG. 8, illustrated is a methodology 800 for displaying an output video on a display screen of a device. At 802, the input video is received, and the output video is created based upon values from the input video (804). Finally, at 806, the output video can be rendered on the display screen of the device. For example, the output video can be created in real-time per frame as needed for rendering. Thus, using a time-mapping function as described herein, values at each pixel can be retrieved from respective input time intervals in the input video.

FIG. 9 is an exemplary flow diagram that depicts how, in one computer-implemented implementation 900, a trained model is used to evaluate the attractiveness of the animation of objects to a user. This trained model can be used to select the most (subjectively) attractive candidate loops with which to create a looped video or cinemagraph. To this end, a predictive model is trained to determine the attractiveness of looping certain types of objects in a scene. This model is then used to automatically determine the most attractive objects in an input video to animate in a cinemagraph.

In order to create the model, as shown in block 902, a training set of looped training image sequences is received. A plurality of features from each looped training image sequence in the set of looped training image sequences is extracted, as shown in block 904. A human subjective quality rating is received for each of the looped training image sequences, as shown in block 906. A predictive model is then generated from a combination of the features extracted from each looped training image sequence and the corresponding human subjective quality ratings of the looped training image sequences, as shown in block 908. Such a predictive model can be extracted through direct regression (support vector regression or random forests), prediction by clustering, or by collaborative filtering. These methods are well-known in the literature. Furthermore, features can include, for example, features relevant to the face, sharpness, motion features, trajectory, loopability, color/texture features (magnitude variations, smoothness, shape), object instance type, and semantic features, among others.

Once created, the predictive model can then be used to rank a set of candidate loops for generating a looped image sequence such as, for example, a cinemagraph. In order to use the predictive model to rank the set of looping candidates, a plurality of looping candidates for generating a looped image sequence, such as a cinemagraph, is received, as shown in block 910. The predictive model is applied to the candidate loops to generate a quality score for each candidate loop, as shown in block 912. The quality score defines a subjective quality of the corresponding candidate loop. The candidate loops are then ranked based on the quality scores, as shown in block 914. A looping image sequence (e.g., a cinemagraph) is then generated using the highest ranked candidate loop or loops, as shown in block 916. In some implementations, the quality score of a looping candidate must exceed a predetermined threshold in order to be ranked.

6.0 Computing Environment

Referring now to FIG. 10, the systems and methodologies disclosed herein for generating a cinemagraph that includes one or more video loops are operational within numerous types of general purpose or special purpose computing system environments or configurations. FIG. 10 illustrates a simplified example of a general-purpose computer system on which various implementations and elements for generating a cinemagraph, as described herein, may be implemented. It is noted that any boxes that are represented by broken or dashed lines in the simplified computing device 1010 shown in FIG. 10 represent alternate implementations of the simplified computing device. As described below, any or all of these alternate implementations may be used in combination with other alternate implementations that are described throughout this document. The simplified computing device 1010 is typically found in devices having at least some minimum computational capability such as personal computers (PCs), server computers, handheld computing devices, laptop or mobile computers, communications devices such as cell phones and personal digital assistants (PDAs), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and audio or video media players.

To allow a device to realize the cinemagraph generating implementations described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, the computational capability of the simplified computing device 1010 shown in FIG. 10 is generally illustrated by one or more processing unit(s) 1012, and may also include one or more graphics processing units (GPUs) 1014, either or both in communication with system memory 1016. Note that that the processing unit(s) 1012 of the simplified computing device 1010 may be specialized microprocessors (such as a digital signal processor (DSP), a very long instruction word (VLIW) processor, a field-programmable gate array (FPGA), or other micro-controller) or can be conventional central processing units (CPUs) having one or more processing cores.

In addition, the simplified computing device 1010 may also include other components, such as, for example, a communications interface 1018. The simplified computing device 1010 may also include one or more conventional computer input devices 1020 (e.g., touchscreens, touch-sensitive surfaces, pointing devices, keyboards, audio input devices, voice or speech-based input and control devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, and the like) or any combination of such devices.

Similarly, various interactions with the simplified computing device 1010 and with any other component or feature of the cinemagraph generating implementations described herein, including input, output, control, feedback, and response to one or more users or other devices or systems associated with the cinemagraph generating implementations, are enabled by a variety of Natural User Interface (NUI) scenarios. The NUI techniques and scenarios enabled by the cinemagraph generating implementations include, but are not limited to, interface technologies that allow one or more users user to interact with the cinemagraph generating implementations in a “natural” manner, free from artificial constraints imposed by input devices such as mice, keyboards, remote controls, and the like.

Such NUI implementations are enabled by the use of various techniques including, but not limited to, using NUI information derived from user speech or vocalizations captured via microphones or other sensors (e.g., speech and/or voice recognition). Such NUI implementations are also enabled by the use of various techniques including, but not limited to, information derived from a user's facial expressions and from the positions, motions, or orientations of a user's hands, fingers, wrists, arms, legs, body, head, eyes, and the like, where such information may be captured using various types of 2D or depth imaging devices such as stereoscopic or time-of-flight camera systems, infrared camera systems, RGB (red, green and blue) camera systems, and the like, or any combination of such devices. Further examples of such NUI implementations include, but are not limited to, NUI information derived from touch and stylus recognition, gesture recognition (both onscreen and adjacent to the screen or display surface), air or contact-based gestures, user touch (on various surfaces, objects or other users), hover-based inputs or actions, and the like. Such NUI implementations may also include, but are not limited, the use of various predictive machine intelligence processes that evaluate current or past user behaviors, inputs, actions, etc., either alone or in combination with other NUI information, to predict information such as user intentions, desires, and/or goals. Regardless of the type or source of the NUI-based information, such information may then be used to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the cinemagraph generating implementations described herein.

However, it should be understood that the aforementioned exemplary NUI scenarios may be further augmented by combining the use of artificial constraints or additional signals with any combination of NUI inputs. Such artificial constraints or additional signals may be imposed or generated by input devices such as mice, keyboards, and remote controls, or by a variety of remote or user worn devices such as accelerometers, electromyography (EMG) sensors for receiving myoelectric signals representative of electrical signals generated by user's muscles, heart-rate monitors, galvanic skin conduction sensors for measuring user perspiration, wearable or remote biosensors for measuring or otherwise sensing user brain activity or electric fields, wearable or remote biosensors for measuring user body temperature changes or differentials, and the like. Any such information derived from these types of artificial constraints or additional signals may be combined with any one or more NUI inputs to initiate, terminate, or otherwise control or interact with one or more inputs, outputs, actions, or functional features of the cinemagraph generating implementations described herein.

The simplified computing device 1010 may also include other optional components such as one or more conventional computer output devices 1022 (e.g., display device(s) 1024, audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, and the like). Note that typical communications interfaces 1018, input devices 1020, output devices 1022, and storage devices 1026 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.

The simplified computing device 1010 shown in FIG. 10 may also include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 1010 via storage devices 1026, and can include both volatile and nonvolatile media that is either removable 1028 and/or non-removable 1030, for storage of information such as computer-readable or computer-executable instructions, data structures, programs, sub-programs, or other data. Computer-readable media includes computer storage media and communication media. Computer storage media refers to tangible computer-readable or machine-readable media or storage devices such as digital versatile disks (DVDs), blu-ray discs (BD), compact discs (CDs), floppy disks, tape drives, hard drives, optical drives, solid state memory devices, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), CD-ROM or other optical disk storage, smart cards, flash memory (e.g., card, stick, and key drive), magnetic cassettes, magnetic tapes, magnetic disk storage, magnetic strips, or other magnetic storage devices. Further, a propagated signal is not included within the scope of computer-readable storage media.

Retention of information such as computer-readable or computer-executable instructions, data structures, programs, sub-programs, and the like, can also be accomplished by using any of a variety of the aforementioned communication media (as opposed to computer storage media) to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and can include any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media can include wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, radio frequency (RF), infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves.

Furthermore, software, programs, sub-programs, and/or computer program products embodying some or all of the various cinemagraph generating implementations described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer-readable or machine-readable media or storage devices and communication media in the form of computer-executable instructions or other data structures. Additionally, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, or media.

The cinemagraph generating implementations described herein may be further described in the general context of computer-executable instructions, such as programs, sub-programs, being executed by a computing device. Generally, sub-programs include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types. The cinemagraph generating implementations may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, sub-programs may be located in both local and remote computer storage media including media storage devices. Additionally, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.

Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include FPGAs, application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), complex programmable logic devices (CPLDs), and so on.

As used herein, the terms “component” and “system” are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer-executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.

Further, as used herein, the term “exemplary” is intended to mean “serving as an illustration or example of something.”

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable modification and alteration of the above devices or methodologies for purposes of describing the aforementioned aspects, but one of ordinary skill in the art can recognize that many further modifications and permutations of various aspects are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the details description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.

While cinemagraph generation has been described by specific reference to implementations thereof, it is understood that variations and modifications thereof can be made without departing from its true spirit and scope. It is noted that any or all of the implementations that are described in the present document and any or all of the implementations that are illustrated in the accompanying drawings may be used and thus claimed in any combination desired to form additional hybrid implementations. In addition, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

What has been described above includes example implementations. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

In regard to the various functions performed by the above described components, devices, circuits, systems and the like, the terms (including a reference to a “means”) used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., a functional equivalent), even though not structurally equivalent to the disclosed structure, which performs the function in the herein illustrated exemplary aspects of the claimed subject matter. In this regard, it will also be recognized that the foregoing implementations include a system as well as a computer-readable storage media having computer-executable instructions for performing the acts and/or events of the various methods of the claimed subject matter.

There are multiple ways of realizing the foregoing implementations (such as an appropriate application programming interface (API), tool kit, driver code, operating system, control, standalone or downloadable software object, or the like), which enable applications and services to use the implementations described herein. The claimed subject matter contemplates this use from the standpoint of an API (or other software object), as well as from the standpoint of a software or hardware object that operates according to the implementations set forth herein. Thus, various implementations described herein may have aspects that are wholly in hardware, or partly in hardware and partly in software, or wholly in software.

The aforementioned systems have been described with respect to interaction between several components. It will be appreciated that such systems and components can include those components or specified sub-components, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (e.g., hierarchical components).

Additionally, it is noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.

Claims

1. A system for generating a cinemagraph of one or more video loops, comprising:

a cinemagraph generator comprising one or more computing devices, said computing devices being in communication with each other via a computer network whenever there is a plurality of computing devices, and a computer program having a plurality of sub-programs executable by said computing devices, wherein the sub-programs configure said computing devices for, receiving an input video, wherein the input video comprises a sequence of frames each frame of which comprises pixels, if the input video is not semantically segmented, semantically segmenting the frames of the input video to identify regions in the frames that correspond to semantic objects, selecting semantic objects as candidate objects to animate; generating candidate loops of the selected candidate objects; selecting one or more of the candidate loops to be used to create the cinemagraph; and creating a cinemagraph from the selected candidate loops using the input time intervals computed for the pixels of the frames of the input video, wherein the cinemagraph exhibits regions that appear static to a viewer and regions comprising dynamic video loops of the selected candidate objects that appear to the viewer to be changing over time.

2. The system of claim 1, wherein the sub-program to select one or more of the candidate objects to be used to create the cinemagraph selects the candidate objects by:

receiving for a set of semantic objects identified in the input, video per-pixel semantic labels of all of the pixels in the input video;
constructing histograms by sum-pooling the label maps across all of the frames and pixels to generate a global object label histogram;
discarding object labels that correspond to inherently static objects;
discarding object labels that have low dynamicity in terms of intensity variation across time or motion;
discarding object labels for which connected components are too small in average; and
selecting the top K object labels in terms of high frequency of the global object label histogram as the candidate objects that will appear as moving objects in each candidate video-loop.

3. The system of claim 2 wherein a candidate object candidate can be a single object type or a combination of possible candidate object types.

4. The system of claim 1 wherein the candidate loops are generated by:

generating rough explicit masks of the region of each specific candidate object;
constructing a feature set for each mask region associated with a candidate object;
clustering the features in the feature set and selecting the centroids of the clusters as the feature basis that represents the object label for the masked region;
using the feature basis for calculated object candidate looping start times and looping periods, encouraging pixels that have a similar feature basis and a similar representative feature to be dynamic;
creating candidate loops using the dynamic pixels.

5. The system of claim 1 wherein the candidate loops are selected by using a predictive model that ranks the subjective attractiveness of each candidate loop.

6. The system of claim 1, further comprising computing input time intervals for the pixels of the frames of the input video, wherein:

an input time interval for a particular pixel comprises a per-pixel loop period and a per-pixel start time of a loop at the particular pixel, and
the input time interval of a pixel is based, in part, on one or more semantic factors which keep pixels associated with the same semantic object in the same video loop.

7. The system of claim 6, wherein the input time intervals computed for the pixels of the frames of the input video are used to animate the selected candidate loops.

8. A computer-implemented process for evaluating the attractiveness of a looped image sequence, comprising:

using a computing device for:
receiving a set of training candidates of looped image sequences for determining a human subjective quality rating of a looped image sequence;
automatically extracting a plurality of features from each training candidate in the set of training candidates;
receiving a human subjective quality rating for each of the training candidates;
generating a predictive model that evaluates the subjective attractiveness of a cinemagraph from a combination of the features extracted from each training candidate and the corresponding human subjective quality ratings of the training candidate.

9. The computer-implemented process of claim 8, further comprising using the predictive model to rank a set of candidate loops for generating a looped image sequence by:

receiving a set comprising a plurality of candidate loops for generating a looped image sequence;
applying the predictive model to the candidate loops of the candidate set to generate a quality score for each candidate loop, the quality score defining a subjective quality of the corresponding candidate loop; and
ranking the candidate loops based on the quality scores;
automatically generating a looping image sequence using one or more of the highest ranked candidate loops.

10. The computer-implemented process of claim 9, wherein the quality score of a candidate loop exceeds a predetermined threshold in order to be ranked.

11. The computer-implemented process of claim 8 wherein the features further comprise features relevant to face, sharpness, motion and loopability.

12. The computer-implemented process of claim 9, wherein the generated looping image sequence is used to generate one or more portions of a cinemagraph.

13. A system for generating a cinemagraph of one or more video loops, comprising:

a cinemagraph generator comprising one or more computing devices, said computing devices being in communication with each other via a computer network whenever there is a plurality of computing devices, and a computer program having a plurality of sub-programs executable by said computing devices, wherein the sub-programs configure said computing devices for, receiving semantic object labels extracted from an input video as candidate objects to animate; generating candidate loops of the selected candidate objects; selecting one or more of the candidate loops to be used to create the cinemagraph; and creating a cinemagraph from the selected candidate loops.

14. The system of claim 13 further comprising using input time intervals and looping periods computed for the pixels of the semantic objects to determine the candidate loops.

15. The system of claim 13 wherein the cinemagraph exhibits regions that appear static to a viewer and regions comprising dynamic video loops of the selected candidate objects that appear to the viewer to be changing over time.

16. The system of claim 13, wherein one or more of the candidate objects used to create the cinemagraph are selected by:

receiving for a set of semantic objects identified in the input video per-pixel semantic labels of all of the pixels in the input video;
constructing histograms by sum-pooling the label maps across all of the frames and pixels;
discarding object labels that correspond to inherently static objects;
discarding object labels that have low dynamicity in terms of intensity variation across time or motion;
discarding object labels for which connected components are too small in average; and
selecting the top K object labels in terms of high frequency of the histogram as the candidate objects that will appear as moving objects in each candidate video-loop.

17. The system of claim 13 wherein the candidate loops are selected by using a predictive model trained to assess the subjective attractiveness of looping types of objects in cinemagraph.

18. The system of claim 13, wherein the candidate loops are generated by using a Markov Random Field operation.

19. The system of claim 13, wherein one or more of the candidate loops used to create the cinemagraph are selected by a human being.

20. The system of claim 13, wherein the one or more of the candidate loops used to create the cinemagraph are automatically selected by a predictive model.

Patent History
Publication number: 20180025749
Type: Application
Filed: Sep 14, 2016
Publication Date: Jan 25, 2018
Inventors: Tae-Hyun Oh (Daejeon), Sing Bing Kang (Redmond, WA), Neel Suresh Joshi (Seattle, WA), Baoyuan Wang (Sammamish, WA)
Application Number: 15/265,461
Classifications
International Classification: G11B 27/00 (20060101); G11B 27/034 (20060101); G06K 9/66 (20060101); G06T 13/80 (20060101); G06T 5/40 (20060101); G06K 9/62 (20060101); H04N 9/82 (20060101); G06T 7/00 (20060101);