REAL-TIME MOTION TRACKING IN MOVING SCENES

Info

Publication number: 20220279203
Type: Application
Filed: Jul 9, 2020
Publication Date: Sep 1, 2022
Inventor: Shmuel MANGAN (Nes Ziona)
Application Number: 17/625,751

Abstract

A method comprising: receiving a high frame rate video stream of a scene; continuously dividing, in real time, the video stream into consecutive sequences of frames each; with respect to each current sequence: (i) estimating pixel motion between pairs of frames in the sequence, to calculate a current motion vector field for each pixel in the sequence, (ii) co-locating all of the pixels to current representative pixel positions associated with a desired time point in the sequence, and (iii) calculating an inter-sequence motion vector field, based on estimating motion between the current representative pixel positions and an immediately preceding sequence of the sequences; and outputting, in real time, at a rate that is lower than the high frame rate, at least one of (x) the current motion vector field, (y) the inter-sequence motion vector field, and (z) pixel values associated with the current representative pixel positions.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority from U.S. Provisional Patent Application No. 62/871,941, filed Jul. 9, 2019, entitled, “SYSTEM AND METHOD FOR REAL-TIME IMAGE GENERATION IN MOVING SCENES,” the contents of which are hereby incorporated by reference in their entirety.

FIELD OF INVENTION

The invention relates to the field of computer image processing.

BACKGROUND OF THE INVENTION

Visual tracking plays a critical role in computer vision with numerous applications such as surveillance, robotics, autonomous driving, and behavior analysis. Visual tracking is still considered a challenging task due to several complicating factors under real world conditions, e.g., background clutter, illumination variation, partial occlusions and object transformation. Tremendous efforts have been focused on establishing robust appearance models to handle these difficulties. However, most existing tracking algorithms do not explicitly consider the motion blur contained in video sequences, which degrades their performance in real life.

The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.

SUMMARY OF THE INVENTION

The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, not limiting in scope.

In some embodiments, the present disclosure provides a system, method, and computer program product for efficient real-time computation of motion vector field through using a high frame rate camera, together with a spatial temporal calculations scheme and simplified optical flow algorithms. In an embodiment, the conditions set by the invention, specifically the fast frame rate and the spatial temporal derivatives at multiple resolutions and multiple temporal spacing, allow utilizing 100% of the photons impinging on the sensor, thus providing optimal SNR conditions for the computation. In another embodiment, the conditions set by the invention, specifically optimal SNR, the fast frame rate, and the spatial temporal derivatives at multiple resolutions and multiple temporal spacing, bring the motion detection problem to a regime where single iteration is required for solving the aperture problem and generating the motion field vectors at a pixel resolution. Accordingly, this enables (i) computing average image frame from multiple high-rate frames under fast motion conditions; and (ii) extrapolating to produce a stream of real-time color-motion vector field at the output time point. By providing a color-motion vector field at given time points, may generate motion based image segmentation and object detection in scenes containing fast motion.

There is provided, in an embodiment, a system comprising at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a high frame rate video stream of a scene, wherein the scene comprises at least one object in motion relative to an imaging device acquiring the video stream, continuously divide, in real time, the video stream into consecutive sequences of n frames each, with respect to each current sequence: (i) estimate pixel motion between at least some pairs of frames in the sequence, to calculate a current motion vector field for each pixel in the sequence, (ii) co-locate, based on the calculated motion vector fields, all of the pixels to current representative pixel positions associated with a desired time point in the sequence, and (ii) calculate an inter-sequence motion vector field, based, at least in part, on estimating motion between the current representative pixel positions and the representative pixel positions associated with an immediately preceding sequence of the sequences, and output, in real time, at a rate that is lower than the high frame rate, at least one of (x) the current motion vector field, (y) the inter-sequence motion vector field, and (z) pixel values associated with the current representative pixel positions.

There is also provided, in an embodiment a method comprising receiving a high frame rate video stream of a scene, wherein the scene comprises at least one object in motion relative to an imaging device acquiring the video stream; continuously dividing, in real time, the video stream into consecutive sequences of n frames each; with respect to each current sequence: (i) estimating pixel motion between at least some pairs of frames in the sequence, to calculate a current motion vector field for each pixel in the sequence, (ii) co-locating, based on the calculated motion vector fields, all of the pixels to current representative pixel positions associated with a desired time point in the sequence, and (iii) calculating an inter-sequence motion vector field, based, at least in part, on estimating motion between the current representative pixel positions and the representative pixel positions associated with an immediately preceding sequence of the sequences; and outputting, in real time, at a rate that is lower than the high frame rate, at least one of (x) the current motion vector field, (y) the inter-sequence motion vector field, and (z) pixel values associated with the current representative pixel positions.

There is further provided, in an embodiment, a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a high frame rate video stream of a scene, wherein the scene comprises at least one object in motion relative to an imaging device acquiring the video stream; continuously divide, in real time, the video stream into consecutive sequences of frames each; with respect to each current sequence: (i) estimate pixel motion between at least some pairs of frames in the sequence, to calculate a current motion vector field for each pixel in the sequence, (ii) co-locate, based on the calculated motion vector fields, all of the pixels to current representative pixel positions associated with a desired time point in the sequence, and (iii) calculate an inter-sequence motion vector field, based, at least in part, on estimating motion between the current representative pixel positions and the representative pixel positions associated with an immediately preceding sequence of the sequences; and output, in real time, at a rate that is lower than the high frame rate, at least one of (x) the current motion vector field, (y) the inter-sequence motion vector field, and (z) pixel values associated with the current representative pixel positions.

In some embodiments, the inter-sequence motion vector field is calculated based on the current representative pixel positions associated with more than one preceding sequence of the sequences.

In some embodiments, the calculating is based, at least in part, on average acceleration in the more than one preceding sequence of the sequences.

In some embodiments, the at least one of estimating and calculating is based, at least in part, on solving multi-frame multi-level temporal-spatial smoothness constraints with respect to the inter-sequence motion vector field.

In some embodiments, at least some of the pairs of frames are adjacent pairs of frames.

In some embodiments, the estimating is performed with respect to a non-adjacent subset of the frames in the sequence.

In some embodiments, the estimating is initialized with at least one of: the estimating associated with a preceding one of the pairs in the sequence; the estimating associated with a preceding time point in the video stream; and a the estimating associated with a higher hierarchical motion estimation.

In some embodiments, the estimating is performed using an optical flow algorithm.

In some embodiments, the estimating is based, at least in part, on down-sampled resolution level.

In some embodiments, calculating is further refined over (i) a subset of the frames in the sequence, and (ii) a subset of resolution levels, by jointly solving multi-frame multi-level temporal-spatial smoothness constraints with respect to the motion vector field.

In some embodiments, the high frame rate is between 60-10,000 frames per second (fps).

In some embodiments, the program instructions are further executable to generate, and the method comprises generating, a current representative frame with respect to the sequence, based on the co-locating, wherein the current representative frame aggregates, for each of the current representative pixel positions, pixel values from all frames in the sequence.

In some embodiments, the program instructions are further executable to select, and the method comprises selecting, a region of interest (ROI) in the frames in the current sequence, and wherein the estimating, co-locating, and calculating are performed with respect to the ROI.

In some embodiments, the program instructions are further executable to apply, and the method comprises applying, to the output one of: a key-point extraction algorithm, a feature extraction algorithm, and an object detection algorithm, and wherein the output comprises only results of the applying.

In some embodiments, the program instructions are further executable to receive, and the method comprises receiving, one of: depth information, position-motion sensor information, and ego-motion information, with respect to the scene, wherein the outputting comprises outputting the stream of at least one of the pixel positions in relation to three-dimensional (3D) world coordinate.

In some embodiments, the program instructions are further executable to perform, and the method comprises, the following steps (i) receive two the video streams of the scene, and (ii) determine depth information with respect to the scene based, at least in part, on the two video streams.

In some embodiments, the system comprises a parallel hardware computing system comprising 3D die stacked packaging.

In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the figures and by study of the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments are illustrated in referenced figures. Dimensions of components and features shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. The figures are listed below.

FIG. 1 shows an exemplary system for automated real-time motion vector field detection, object detection and tracking in a scene, using a high frame rate imaging device, according to exemplary embodiments of the present invention;

FIGS. 2A-2B show a flowchart detailing the functional steps in a process for real-time calculating of pixel-level color motion vector fields in a continuous input stream acquired using a high frame rate imaging device, according to exemplary embodiments of the present invention;

FIGS. 3A-3B are schematic illustrations of an iterative process for real-time calculating of pixel-level color motion vector fields in a continuous input stream acquired using a high frame rate imaging device, according to exemplary embodiments of the present invention;

FIG. 4 describes an exemplary variation of a system of the present disclosure including more than one imaging devices, according to exemplary embodiments of the present invention;

FIG. 5 illustrates an exemplary implementation of a system of the present disclosure using 3D Die stacking, according to exemplary embodiments of the present invention; and

FIGS. 6A-6C illustrate a highly parallel HW implementation of a system of the present disclosure, according to exemplary embodiments of the present invention.

DETAILED DESCRIPTION

Described herein are a system, method, and computer program product for automated real-time object detection and tracking in a scene, using a high frame rate imaging device.

In some embodiments, the present disclosure may be configured to output, in real time, color motion vector fields representing the scene motion per pixel, at desired or required time points in the continuous high-frame-rate input stream. In some embodiments, real time detection and tracking of a moving object in the scene may be performed, based, at least in part, on the outputted motion vector fields.

In some embodiments, the present disclosure is particularly useful for motion vector field detection, object detection and motion tracking on fast-moving objects in a scene, using a stream of frames, e.g., a video stream, acquired by a high frame rate imaging device, when objects in the scene and/or the imaging device are in quick relative motion.

In some embodiments, this real time continuous process may be enabled because the present disclosure provides for reduced computational overhead requirements, by utilizing a high-frame-rate input stream which represents relatively small frame-to-frame motion rates. Thus, motion estimation can be calculated quickly and efficiently on commonly used imaging and computing platforms, e.g., mobile devices, without the need for offline processing.

Many emerging applications require tracking targets in video. Most existing visual tracking methods do not work well when the target is motion-blurred, especially due to fast motion. The imperfectness of the target's appearances under motion blur jeopardize image features, such as image gradients, sum-of-square-differences (SSD), and color histograms, and thus invalidates the image matching model or the measurement model in tracking. Although deblurring methods that improve the image quality have been widely investigated in the literature, these studies are often based on the assumption that the entire image is subject to the same global motion blur (e.g., when there is camera motion). However, in real applications, more complicated and challenging situations may be observed where motion blurs are only present in parts of the image. These local motion blurs can be produced by, e.g., fast movements of the targets, or insufficient lighting that reduces the shutter speed of auto-exposure cameras.

Motion blur and signal noise are primary sources of image quality degradation in digital imaging. In low light conditions, the image quality is often a tradeoff between motion blur and noise. Long exposure time is required in low illumination level in order to obtain adequate signal to noise ratio. On the other hand, the risk of motion blur due to camera-induced motion or subject motion increases as exposure time becomes longer.

Motion blur occurs when the camera or the subject moves during the exposure period. When this happens, the image of the subject moves to different areas of the camera sensor photosensitive surface during the exposure time. Thus, when the exposure time is long, camera movements or the movement of an object in the scene are likely to become visible in the image.

Digital camera noise includes multiple noise sources, e.g., noise created by photosensor components; noise generated by photosensor voltage leaks; pattern noise associated with the nonuniformity of the image sensor pixels; or the dominant noise source—photon shot noise—which is associated with the randomness of the number of captured photons by a sensor pixel during a given time interval. Due to the Poisson distribution of the shot noise, the relative noise decrease as the amount of light increases, therefore longer exposure time improves the SNR.

This is particularly the case in the mobile device cameras, such as smartphone cameras. Because of their smaller size, pixels receive a smaller number of photons within the same exposure time. In addition, the random noise caused by various sources is present in the obtained signal. Thus, the most effective way to reduce the relative amount of noise in the image (i.e., increase the SNR) is to use longer exposure times, which allows more photons to be observed by the sensor. However, in the case of long exposure times, the risk of motion blur increases, which may degrade the performance of motion tracking algorithms.

Known methods which attempt to mitigate these issues include:

- Applying convolution kernels to increase image resolution, however these techniques provide only limited improvement, usually when the motion extent is small and global.
- Using optical or electronic image stabilization to mitigate small-extent camera motion and vibrations. However, while useful for compensating camera shake, these techniques are ineffective when the motion extent is large or motion is not global.
- Grabbing multiple frames with selection of frames based on sharpness criteria.

However, this method fails to deal with non-uniform motion in the scene.

However, these known methods typically cannot effectively handle low light conditions, concurrent camera motion and objects motion, and/or large frame-to-frame motion rate.

Accordingly, in some embodiments, the present disclosure provides for a process which captures a sequence of high-frame-rate, short-exposure-time image frames of a scene, e.g., using a high frame rate imaging device.

In some embodiments, the present disclosure then calculates pixel-level motion between pairs of adjacent frames within the sequence of frames.

In some embodiments, the present disclosure then calculates a motion vector field for each pixel across all frames in the sequence. In some embodiments, the present disclosure then co-locates, based on the calculated motion vector fields, all of the pixels to current representative pixel positions associated with a desired time point in the sequence. In some embodiments, the present disclosure may also provide for calculating aggregate pixel values for each pixel position across all frames in the sequence where the aggregation compensates the motion between pixels of different frames.

In some embodiments, the present disclosure may then calculate an inter-sequence motion vector field between the current representative pixel positions calculation and the representative pixel positions associated with an immediately preceding sequence, to account for variations in speed, i.e., acceleration, and overcome noise.

In some embodiments, the present disclosure may then output in real time, at a rate that is lower than the high frame rate, the current motion vector field, the inter-sequence motion vector field, and pixel values associated with the current representative pixel positions.

In some embodiments, aggregate pixel values may then be used for generating a continuous output of color motion vector fields representing the scene motion per pixel, wherein the motion vector fields may be outputted at desired or required time points along the continuous input stream. For example, such time points may be any time point during a current sequence, e.g., middle of a sequence, end of a sequence, and any point in between. In some embodiments, the time point may be outside the sequence, in order to compensate for HW-SW latencies, and produce a true real time motion output, that is produced through extrapolating the motion fields to the required time point.

In some embodiments, the present disclosure may further be configured to output a representative image of all frames in the sequence at the desired timepoint or any other time point, in which each pixel position reflects an increased amount of captured photons, and thus higher SNR, with reduced motion blur as compared to any individual frame in the sequence. In some embodiments, aggregating or combining pixel values across multiple short-exposure frames works to increase an effective exposure time for the resulting representative image, without incurring any motion blur penalty.

In some embodiments, this process may be performed iteratively, e.g., on consecutive sequences from a continuous video stream. In some embodiments, with respect to each such sequence, the present disclosure provides for outputting a continuous output of color motion vector fields representing the scene motion per pixel at desired time points.

In some embodiments, the present disclosure further provides for outputting a corresponding stream of representative frames at a frame rate that is lower than the higher acquisition frame rate of the input stream. In some embodiments, an output stream frame rate of the present disclosure may be determined based on a desired output frame rate, e.g., in conjunction with the requirements of a downstream processing algorithm, such as an object detection and tracking algorithm.

In some embodiments, the present disclosure provides for performing this process in real time, e.g., by generating in real time a continuous output of motion vector filed calculations, as well as a lower-frame-rate stream of high SNR images free of motion blur, from a received high-frame-rate image stream, e.g., a video stream.

In some embodiments, the present disclosure is based on aggregating a representative image frame exposure time from multiple high-frame-rate, short-exposure-time frames, which may provide for a greater dynamic range of illumination, because in strong light, this prevents saturation due to the limited full-well capacity of the image sensor, and enables 100% photons collection during the full imaging time, thus improving SNR and motion detection accuracy. In low-light condition, this prevents saturation by strong light sources in the scene, and uses them as strong motion anchors while enabling 100% photon collection of darker areas in the scene.

In some embodiments, by using higher frame rate image acquisition, for a given scenario with a desired frame rate output, the present disclosure increases the ability to operate in a wide dynamic range of illuminations conditions. In strong light, this prevents saturation due to the limited full-well capacity of the image sensor, and enables 100% photon collection during the full imaging time, thus improving SNR and motion detection accuracy. In low-light condition, this prevents saturation by strong light sources in the scene, and uses them as strong motion anchors while enabling 100% photon collection of darker areas in the scene.

In some embodiments, the present disclosure may be particularly suited for implementing using commonly available technologies and devices, such as:

- Low cost, high speed CMOS image sensors,
- high frame rate image sensors which can operate at rates of, e.g., between 60-10,000 frames per second (fps),
- mobile devices such as smartphones which package, incorporating image sensors, memory, and suitable processing capabilities within a small unit with low power consumption,
- efficient, highly parallel processing modules using small processing units, such as graphics processing units (GPU), tensor processing units (TPU), and artificial intelligence processing units (AIU), or other arrays of multiply accumulate (MAC) operations which can be incorporated within a smart-camera or mobile device, and/or
- low cost, high accuracy depth sensors, based on phase detection (PD), stereo detection, and/or radar or Lidar.

A potential advantage of the present disclosure is, therefore, in that it provides for efficient real-time computation of a motion vector field in a sequence of high frame rate image or video stream, which allows outputting a color motion vector filed at desired time points in the continuous stream, to enable object detection and tracking in the input stream.

The present disclosure may be particularly useful in the context of, e.g., consumer grade cameras, e.g., in mobile devices, which typically exhibit poor results when filming scenes in motion under low light. In addition, the present disclosure may be implemented in the context of robotics, autonomous vehicles, and automated driver assist (ADAS) systems, which may be installed on platforms moving at high speeds and thus suffer from motion blur under low light conditions. These systems typically rely on high quality imaging to perform such tasks as object detection, segmentation, and target motion estimation and tracking. Thus, motion blur at driving speed may affect object detection performance.

As used herein, the term ‘image’ refers to a two-dimensional array of pixel values. An image can be a two-dimensional subset of another image. A digital image includes one or more digital image channels, each comprises a two-dimensional array of pixels, wherein each pixel value relates to the amount of light received by an electronic image sensor corresponding to the geometrical domain of the pixel. For color imaging applications, a digital image will typically consist of red, green, and blue digital image channels, however, other configurations are also practiced. For monochrome applications, the digital image consists of one digital image channel. In some embodiments, the present disclosure can be applied to, but is not limited to, a digital image for any of the above-mentioned applications.

Although the present disclosure describes a digital image channel as a two-dimensional array of pixel values arranged by rows and columns, the present disclosure can be applied to mosaic arrays, such as Bayer array, with equal effect. Similarly, the present disclosure can be applied to color image sensors where pixel color sensors are laid one on top of the other.

In some embodiments, the present disclosure describes replacing an original pixel value with processed pixel values, to form a new digital image with the processed pixel values, however, retaining original pixel values is also contemplated.

FIG. 1 illustrates an exemplary system 100 for automated real-time generating of high SNR images of a scene, using a high frame rate imaging device, in accordance with some embodiments of the present invention.

System 100 as described herein is only an exemplary embodiment of the present invention, and in practice may have more or fewer components than shown, may combine two or more of the components, or may have a different configuration or arrangement of the components. The various components of system 100 may be implemented in hardware, software, or a combination of both hardware and software. In various embodiments, system 100 may comprise a dedicated hardware device, or may form an addition to or extension of an existing device.

In some embodiments, system 100 may comprise a processing unit 110 and memory storage device 114. In some embodiments, system 100 may store in a non-volatile memory thereof, such as storage device 114, software instructions or components configured to operate a processing unit (also “hardware processor,” “CPU,” or simply “processor), such as processing unit 110. In some embodiments, the software components may include an operating system, including various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between various hardware and software components. In some embodiments, system 100 may comprise one or more graphic processing units (GPUs). In some embodiments, processing unit 110 comprises, e.g., a GPU, a TPU, an AIU, or other arrays of multiply accumulate (MAC), which can be incorporated within any desktop, smart camera, or mobile computing device.

In some embodiments, system 100 may further comprise one or more of, e.g., an IMU sensor used for compensating self-vibrations; and a 3D sensor or a depth sensors that produces dense or sparse 3D information.

The software instructions and/or components operating processing unit 110 may include instructions for receiving and analyzing multiple frames captured by a suitable imaging device. For example, processing unit 110 may comprise image processing module 111 and convolutional network module 112. Image processing module 110 receives, e.g., a video stream and applies one or more image processing algorithms thereto. In some embodiments, image processing module 111 comprises one or more algorithms configured to perform motion estimation, object detection, classification, and/or any other similar operation, using any suitable image processing, algorithm technique, and/or feature extraction process.

The incoming image stream may come from various imaging devices. The image stream received by the image processing module 111 may vary in resolution, frame rate (e.g., between 60 and 10,000 fps), format, and protocol according to the characteristics and purpose of their respective source device. Depending on the embodiment, the image processing module 111 can route video streams through various processing functions, or to an output circuit that sends the processed video stream for presentation, e.g., on a display, to a recording system, across a network, or to another logical destination. The image processing module 111 may perform image stream processing algorithms alone or in combination. Image processing module 111 may also facilitate logging or recording operations with respect to an image stream.

Convolutional network module 112 may comprise a network of convolutional layers that perform motion detection or compensation through derivative convolution kernels, down- and up-sampling kernels, or low-pass kernels, or spatial shift kernels, within a single frame or between two or more frames in a sequence.

Convolutional network module 112 may comprise a convolutional network (i.e., which includes one or more convolutional neural network layers), and can be implemented to embody any appropriate convolutional neural network architecture, e.g., U-Net, Mask R-CNN, DeepLab, and the like. In a particular example, convolutional network module 112 may include an input layer followed by a sequence of shared convolutional neural network layers. The output of the final shared convolutional neural network layer may be provided to a sequence of one or more additional neural network layers that are configured to generate the object detection data. However, other appropriate neural network processes may also be used. The output of the final shared convolutional neural network layers may be provided to a different sequence of one or more additional neural network layers.

In some embodiments, the convolutional network operates on the aggregate color-motion field. In some embodiments, the convolutional network operates on each frame before aggregation. In some embodiments, the object detection is produced by a convolutional neural network that operates on the output of a sequence. In some embodiments, the object detection is produced by a convolutional neural network such that each iteration receives the output of an earlier sequence as an input. In some embodiments, the motion field is produced by a convolutional neural network that processes each frame. In some embodiments, the motion field is produced by a convolutional neural network that processes each frame and a subset of earlier frames. In some embodiments, the motion field is produced by a convolutional neural network such that each iteration receives the output of earlier frame as an input. In some embodiments, the motion field is produced by a convolutional neural network such that each iteration receives the output of earlier sequence as an input. In some embodiments, the convolutional neural network has an architecture of recurrent-neural network, Long-Short Memory network, and the like. In some embodiments, the object detection is produced by a convolutional neural network such that each iteration receives the output of earlier sequence as an input.

In some embodiments, system 100 may also be configured to employ suitable algorithms to estimate motion between image frames, i.e., determine motion vectors field that describe the transformation from every point in one frame to points in another frame (usually, between adjacent frames in a sequence or between all frames to one representative frame). Motion estimation may be defined as the process of finding corresponding points between two images (e.g., frames), wherein the points that correspond to each other in two views of a scene or object may be considered to be the same point in that scene or on that object. In some embodiments, the present disclosure may apply high density optical flow algorithm base of inter- and intra-frame derivatives plus a global constraint to estimate motion between frames. See, e.g., B. K. P. Horn and B. G. Schenck, “Determining optical flow.” Artificial Intelligence, vol 17, pp 185-203, 1981. In some embodiments, the components and conditions set forth by the present disclosure enable solving the global constraint in a single or a few iterations, thus enabling efficient computation of the dense optical flow in real time.

In some embodiments, the present disclosure may apply optical flow and/or another and/or similar computer vision technique or algorithm to estimate motion between frames. See, e.g., Farneback G. (2003), Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun J., Gustaysson T. (eds) Image Analysis. SCIA 2003. Lecture Notes in Computer Science, vol 2749. Springer, Berlin, Heidelberg.

For consecutive image sequences such as found in video presentations, optical flow may be defined as the velocity field which warps one image into another (usually representing minute positional changes) image. In some embodiments, an optical flow estimate comprises an estimate of a translation that describes any motion of a pixel from a position in one image to a position in a subsequent image. In some embodiments, optical flow estimation returns, with respect to each pixel and/or groups of pixels, a change is coordinates (x, y) of the pixel. In some embodiments, pixel motion between pairs of images may be estimated using additional and/or other methods. In some embodiments, system 100 may also compute cumulative pixel coordinate difference acquired over a sequence of image frames.

FIGS. 2A-2B show a flowchart detailing the functional steps in a process for automated real-time generating of lower-noise, high dynamic range, motion compensated images of a scene, using a high frame rate imaging device, in accordance with some embodiments of the present invention.

In some embodiments, at step 202, a system such as exemplary system 100 in FIG. 1 may be configured to receive an input image stream depicting, e.g., a scene, which may comprise one or more moving objects, such as humans, pets, vehicles, and the like, or a relative motion between the camera and the scene.

In some embodiments, the input stream may be acquired using a high frame rate imaging device, e.g., between 60-10,000 fps. In some embodiments, an imaging device with a lower and/or higher frame rate may be used.

In some embodiments, the scene or parts of the scene depicted in the stream may be dimly lit, e.g., the image stream may be acquired under low light ambient conditions. In some embodiments, parts of the scene contain dark shaded regions. In some embodiments, the scene may comprise objects moving at a relatively high rate of motion. In some embodiments, the camera is moving at a relatively high rate of motion or angular motion relative to the scene.

In some embodiments, at step 204, the image frame stream may be continuously divided into consecutive sequences of n frames each, e.g., between 10-100 frames per sequence.

In some embodiments, the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, desired speed and quality outcomes, and the like. In some embodiments, the number of frames in a sequence may be dictated, e.g., by the computing power and processing times of the associated computing platform on which the process is to be performed. In some embodiments, the number of frames in a sequence may be dynamically adjusted, based, at least in part, on instant response times of the computing platform. Thus, for example, a first sequence may comprise a specified number of frames assuming frame processing time of, e.g., 80 ms, whereas a subsequent sequence may comprise, e.g., a larger number of frames, where instant processing times may have reduced to, e.g., 40 ms.

In some embodiments, at step 206, the present disclosure may estimate pixel motion throughout pairs of adjacent frames in the sequence, in a continuous optical flow process which estimates pixel motion frame-to-frame throughout the sequence. In some embodiments, at step 206, the present disclosure uses multiple down-sampled levels of a pair of frames to estimate pixel motion throughout pairs of adjacent frames.

In some embodiments, at step 206, the present disclosure estimates pixel motion throughout a subset of multiple pairs of non-adjacent frames in the sequence, in a continuous optical flow process which estimates pixel motion through the sequence. In some embodiments, the estimated frame-to-frame pixel motion calculated at step 206 is jointly refined over a subset of frames and a subset of resolution levels using multi-frame temporal-spatial constraints over the motion vector field.

FIG. 3A is a schematic illustration of the iterative process of step 206. Accordingly, as illustrated in FIG. 3A, in some embodiments, system 100 may receive a current sequence comprising n frames, e.g., sequence i comprising frames N−2 through N+2, where n=5 in the number of frames in the sequence.

System 100 may then perform the following sub-steps of step 206 with respect to each current sequence.

- (i) Step 206a: Estimate pixel-level motion e.g., between:
  - a. frames [N−2]−[N−1],
  - b. frames [N−1]−[N],
  - c. frames [N]−[N+1], and
  - d. frames [N+1]−[N+2]; and
- (ii) Step 206b: Calculate pixel-level motion field vector for each pixel over current sequence using the motion estimated over all resolution levels and frame pairs.

In some embodiments, pixel-level motion estimation may be refined over multiple down-sampled resolution levels, between each adjacent pair of frames in current sequence i.

In some embodiments, adjacent pair motion estimation may be initialized using a result from another, e.g., preceding, pair in the sequence.

In some embodiments, pixel-level motion estimation according to the present disclosure may be performed using any suitable method, e.g., any suitable image motion estimation algorithms, such as any optical flow or 2D motion flow algorithm.

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent velocities of movement of brightness pattern in an image.

Thus, the optical flow methods try to calculate the motion between two image frames which are taken at times t and t+Δt at every pixel or voxel position. These methods are called differential, because they are based on local Taylor series approximations of the image signal; that is, they use partial derivatives with respect to the spatial and temporal coordinates.

For a 2D+t dimensional case, a voxel at location (x, y, t) with intensity I(x, y, t) will have moved by Δx, Δy and Δt between the two image frames, and the following brightness constancy constraint can be given:

I(x,y,t)=I(x+x,y+y,t+t).

Assuming the movement to be small, the image constraint at I(x, y, t) with

Taylor series can be developed to get:

$I (x + Δ x, y + Δ y, t + Δ t) = I (x, y, t) + \frac{\partial I}{\partial x} Δ x + \frac{\partial I}{\partial y} Δ y + \frac{\partial I}{\partial t} Δ t + H . O . T .$

From these equations it follows that:

$\frac{\partial I}{\partial x} Δ x + \frac{\partial I}{\partial y} Δ y + \frac{\partial I}{\partial t} Δ t = 0$ $or$ $\frac{\partial I}{\partial x} \frac{Δ x}{Δ t} + \frac{\partial I}{\partial y} \frac{Δ y}{Δ t} + \frac{\partial I}{\partial t} \frac{Δ t}{Δ t} = 0$

which results in

$\frac{\partial I}{\partial x} V_{x} + \frac{\partial I}{\partial y} V_{y} + \frac{\partial I}{\partial t} = 0$

where V_x,V_yare the x and y components of the velocity or optical flow of I(x, y, t) and

$\frac{\partial I}{\partial x}, \frac{\partial I}{\partial y}, and \frac{\partial I}{\partial t}$

are the derivatives of the image at (x, y, t) in the corresponding directions. I_x, I_yand I_tcan be written for the derivatives in the following:

I_xV_x+I_yV_y=−I_t

or

∇I^T·{right arrow over (V)}=−I_t.

The above-detailed method describes computing the spatial and temporal derivatives of every pixel in the image I_x, I_y, I_t. However, computing the motion vector field for every pixel requires solving the ‘aperture problem’ of motion estimation. Known methods for solving the aperture problem include:

- Phase correlation;
- Block-based methods (minimizing the sum of squared differences or sum of absolute differences, or maximizing normalized cross-correlation);
- Differential methods of estimating optical flow, based on partial derivatives of the image signal and/or the sought flow field and higher-order partial derivatives, such as:
  - Lucas-Kanade method: Regarding image patches and an affine model for the flow field,
  - Horn-Schunck method: Optimizing a functional based on residuals from the brightness constancy constraint, and a particular regularization term expressing the expected smoothness of the flow field,
  - Buxton-Buxton method: Based on a model of the motion of edges in image sequences,
  - Black-Jepson method: Coarse optical flow via correlation, and
  - General variational methods: A range of modifications/extensions of Horn-Schunck, using other data terms and other smoothness terms.
- Discrete optimization methods: The search space is quantized, and then image matching is addressed through label assignment at every pixel, such that the corresponding deformation minimizes the distance between the source and the target image. The optimal solution is often recovered through Max-flow min-cut theorem algorithms, linear programming or belief propagation methods.

The main drawbacks of the optical flow methods are in the computation of derivatives: First, when the motion is too fast, the motion blur effect prevents accurate estimation of the derivatives, and reduces the accuracy of the method. Second, when short exposure is used to prevent motion blur, the distance due to spatial motion between frames is too long for correct estimation of the spatial derivatives. In most applications, these limit the solutions to low motion rate only.

In some embodiments, the present disclosure utilizes Horn—Schunck method of estimating optical flow, which is a global method which introduces a global constraint of smoothness to solve the aperture problem. The method is iterative by nature, however under certain conditions, such as small motion, it may be solved by one or few iterations only.

Variations of the technique may include:

Applying hierarchical down-sampling and using multi scale resolution;

- using key-point matching for better estimation on corners and junctions; or feeding the motion from earlier frames as initial guess.

In some embodiments, the present disclosure may be expanded to 3D world coordinates motion by introducing 3D or distance measurements, either dense or sparse, to estimate 3D velocities. In some embodiments, the present disclosure may be expanded to 3D world coordinates motion by using the ego-motion of the camera or system. In some embodiments, real time 3D point cloud or real-world position and orientation are fed as an input, and the ego-motion is computed using these and the objects, features or key-points output of the 2D motion detection.

Accordingly, in some embodiments, the present disclosure provides for calculating a motion vector field for each pixel in each frame in a current sequence, based on computation of the spatial and temporal derivatives in each adjacent frame pair, e.g., frames N−1, N in FIG. 3A, as described above. In some embodiments, spatial derivatives may be calculated by a frame of a pair of adjacent frames in the x and y directions, and subtracting the result from the unshifted image to produce the x, y derivatives I_x, I_y. In some embodiments, the pair of frames N−1, N are subtracted to produce the temporal derivative I_t.

In some embodiments, the present disclosure provides for calculating a motion vector field for each pixel in each frame in a current sequence, based on computation of the spatial and temporal derivatives between a subset of non-adjacent frame pairs, e.g., frames N, M, as described above. In some embodiments, spatial derivatives may be calculated by a frame of a pair of non-adjacent frames in the x and y directions, and subtracting the result from the unshifted image to produce the x, y derivatives I_x, I_y. In some embodiments, the pair of frames N, M are subtracted to produce the temporal derivative I_t.

In some embodiments, the aperture problem may then be solved using a predefined input method, such as Lucas-Kanade, Horn-Schunck, or any other suitable method. In some embodiments, solving the aperture problem comprises, e.g., controlling input parameters of the process, e.g., block size, smoothness regulation term, or any other relevant parameter. In some embodiments, this process may further be initialized using a motion vector field estimate which may be taken from the result of another adjacent frame pair in the sequence.

In some embodiments, motion vector field calculation may utilize multi-scale resolution, also known as hierarchical resolution or pyramidal resolution. Accordingly, in some embodiments, frame pair N, M may be down-sampled at several ratios, such as, but not limited to, 2:1, 4:1, and up to a maximal ratio max:1, before the derivatives are calculated as described above.

In some embodiments, the present process comprises detecting and matching key-points, possibly on multi scale image pairs. In some embodiments, key-points may be output together with, or instead of, the motion field vector data.

In some embodiments, the present disclosure provides a significant reduction of the output steam bandwidth by outputting only processed data such as key-points, feature list or object list. In some embodiments, the reduced data stream includes 3D point cloud of key-points, features or objects.

In some embodiments, the calculated derivatives may be used for solving the aperture problem, to produce the output motion vector field.

Accordingly, in some embodiments, a current frame sequence, e.g., sequence i in FIG. 3A, may be processed according to the iterative process of step 206 to produce a motion vector field for each pixel position in the frames comprising sequence i.

In some embodiments, at step 208 in FIG. 2A, the present disclosure may provide for outputting in real time the color motion vector fields calculated at step 206 at one or more desired time points, e.g., time point t_i, for each current sequence, e.g., sequence i in FIG. 3A. In some embodiments, the output comprises all pixels of at least some of the frames in sequence i co-located into a representative pixel position, based on average motion vector fields and occlusion state calculations performed in step 206. For example, the present process may calculate and output motion vector fields at a desired and/or required point within each sequence, e.g., time points t₁and t_iin FIG. 3A. For example, the output may reflect color-motion vector field as well as pixel values (e.g., intensity) at a point at the middle of the sequence, end of the sequence, and/or any point in between or after the end time of the sequence.

In some embodiments, with continued reference to step 208 in FIG. 2A, pixel values for all co-located pixels in representative frame n_rmay be combined to form an image output. Thus, the charge from same-colored pixels can be combined or binned, e.g., based on combining signal levels, a weighted average of values associated with pixel charges, and/or any other suitable method. In some embodiments, the resulting SNR of the combined pixels increases the SNR relative to the uncombined signal. In some embodiments, the resulting dynamic range of the combined pixels increases the dynamic range relative to the uncombined signal.

In some embodiments, at step 208, the present process may further be configured to generate a representative frame n_rfor each current sequence, e.g., sequence i in FIG. 3A, based on the outputted motion vector fields. In some embodiments, representative frame n_rcomprises all pixels of at least some of the frames in sequence i co-located into the representative pixel position, based on average motion vector fields and occlusion state calculations performed in step 206.

In some embodiments, occlusion state is defined per pixel per frame. In some embodiments, the occlusion is detected using pixel neighborhood similarity metrics between frames. In some embodiments, the occlusion is detected by detecting voids in the motion vector field.

Accordingly, in some embodiments, the present disclosure calculates the average motion over at least some of the frames, e.g., all frames or a subset of frames, in sequence i. In some embodiments, the present disclosure then applies motion-compensated average on the frames in sequence i, to produce an average ‘frozen’ image that represents an aggregate of all pixel values at each pixel position in frames in sequence i. In some embodiments, the aggregation is through averaging the pixel intensity values.

In some embodiments, step 208 comprises a global aligning, shifting, registration, and/or warping operation on at least a subset of frames in sequence i, to co-locate corresponding pixels in each frame in sequence i on a selected frame, e.g., a center frame, an intermediate frame, a middle frame, an end frame, etc., within sequence i.

In some embodiments, steps 206 and/or 208 may be performed with respect to a sparse frame sequence, comprising, e.g., a subset of frames in sequence i, e.g., the first, second, fourth, eights frames, etc., or any other selected subset. FIG. 3B is a schematic illustration of the iterative process of step 206 performed with respect to a sparse frame sequence.

In some embodiments, the output of step 208 may contain the motion vector and color intensity fields data, as well as time stamps. In some embodiments, at step 208, the output may contain key-points plus time stamps. In some embodiments, at step 208, the output may contain objects, segments or features, plus time stamps.

In some embodiments, at step 208, the output may contain color intensity and motion vectors, attached to a list of objects, segments or key-points.

In some embodiments, the present disclosure may be configured to perform continuous motion detection. In this implementation, the present method must account for variations in speed, i.e., acceleration, of motion in the scene, and overcome noise and errors in the motion estimation, so as to and to provide a continuity of motion across sequences and output time points. In practice, the average motion within a sequence or period of time differs from the average motion between a sequence and the preceding sequence, where the differences may be due to noise, computation errors, or acceleration. Accordingly, the present disclosure provides for computing inter-sequence motion vector fields to optimize for noise rejection and estimation of the acceleration component, based on motion fields and inter-sequence motion from preceding sequences. The difference between the motion field (V) and the inter-sequence motion field (V_inter) in practice is equivalent to the acceleration (A_cc), as can be seen by the discrete motion formulas:

V_t=X_t+0.5−X_t−0.5,

V_t−1=X_t−0.5−X_t−1.5,

A_cc=V_t−V_t−1=X_t+0.5−2_Xt−0.5+X_t−1.5,

However, a simplistic calculation of subtracting two noisy estimations leads to accumulation of errors. An alternative approach would be to approximate

$V_{i n t e r} = X_{t} - X_{t - 1} ≅ (X_{t + 0.5} + X_{t - 0.5}) / 2 - (X_{t - 0.5} + X_{t - 1.5}) / 2 = (X_{t + 0.5} - X_{t - 0.5}) / 2 + (X_{t - 0.5} - X_{t - 1.5}) / 2 = (V_{t} + V_{t - 1}) / 2, such that A_{c c} = 2 (V_{t} - V_{i n t e r}) .$

Accordingly, in some embodiments, at step 210, the present disclosure provides for the concurrent computation of two motion fields, e.g., the rate of motion (speed) at two time points, or a combination of rate of motion and acceleration, such that one is optimized for estimating the average motion at the specific sequence in time, while the other is a better estimation of the slower changes in motion over time.

In some embodiments, the present disclosure further utilizes information from earlier time points. Accordingly, in some embodiments, the present disclosure provides for inter-sequence motion vector fields calculations, as referring to both forms of acceleration or second motion. In an embodiment, at step 210 the inter-sequence motion of the representative image of the present sequence and pixels is computed. In some embodiments, at step 210, the output may contain the inter-sequence motion, provided in a form of a vector field. In an embodiment, in one simple form, the motion field is computed as the average motion within a sequence while the inter-sequence motion is computed as the average of the present and previous motion fields. In an embodiment, in another simple form, the inter-sequence motion is computed as the difference between the present and the previous motion fields. In the implementations, the inter-sequence motion is computed by taking outputs from multiple earlier times, and applying algorithms that perform fitting and smoothing over one or more earlier results. In some embodiments, the algorithms also take the internal frame to frame motion and multiple hierarchies motion within multiple frames to compute an even more accurate inter-sequence motion field. In an embodiment, at step 210, the inter-sequence motion is smoothed by linear averaging over motion fields from two or more sequences. In an embodiment, the inter-sequence motion is smoothed by at least one of: solving higher level of polynomial fit using a linear regression over multiple time points; using a non-linear regression over multiple sequences; using a Kalman filter, having the position, velocity and acceleration per pixel as internal states wherein the relation between the current and previous Kalman states are used for computing the inter-sequence motion; using the fitted parameters of motion are used for computing the inter-sequence motion; and using optical flow algorithms. In an embodiment, a regression operation may be computed on a single pixel manner, or over similar adjacent pixels, such that pixels with similar look and motion will be averaged together and fed into the regression. In an embodiment the inter-sequence motion field across sequences is computed by solving the regression problem over global space-time data set of consecutive sequences and hierarchies, using smoothness condition over color and motion, and using an initial guess from earlier calculated motion estimation. In an embodiment, the inter-sequence motion is computed per object, segment or key-point.

in some embodiments, at step 212, the present disclosure may be configured for outputting, in real time, the intra-sequence motion vector field, the inter-sequence motion vector field, and/or the pixel values associated with representative pixel positions at the desired time point.

In some embodiments, the amount of output data transferred over the output connection may sometimes be required to be reduced. Accordingly, in some embodiments, the output stream is reduced by implementing any combination video compression algorithms or HW accelerators to compress the motion and color intensity outputs. In some embodiments, at step 212, the output stream is reduced by outputting only key-points and or features and or segments and or objects, instead of the full color and vector motion fields. In some embodiments, at step 212, the output may contain the inter-sequence motion, provided in a form of a vector attached to a list of objects, segments or key-points.

In some embodiments, at step 212, the output may be used in conjunction with object detection and motion tracking of objects in the input stream.

For example, at steps 214-216, object detection and tracking may be performed using any suitable algorithm, based, e.g., in step 214, on clustering pixels neighboring pixels with similar motion vectors, wherein such pixels may be associated with an object. In other embodiments, the present disclosure may use a network of convolutional layers, e.g., convolutional network module 112 in FIG. 1, that perform motion detection or compensation through derivative convolution kernels, down- and up-sampling kernels, or low-pass kernels, or spatial shift kernels, within a single frame or between two or more frames in a sequence. In some embodiments, object detection or segmentation may be performed directly on the output images, as in step 216. In other embodiments, as illustrated in step 218, depth or other 3D information, such as position-motion sensors, ego-motion and other can be used for estimating 3D motion of objects, features and key-points in the scene. In other examples, additional and/or other motion detection algorithms may be used.

In some embodiments, a system of the present disclosure, such as system 100 in FIG. 1, may comprise more than one imaging devices, e.g., cameras, in a stereo arrangement. In such cases, the two cameras are synchronized using, e.g., HW synch or communication timing clock. Features and motion data may then be directly communicated between the two cameras, such that each camera unit can independently compute 3D motion from a stereo pair of synched frame streams. In some embodiments, an external processing unit may take the output of each single channel, including averaged image and motion vectors, to compute the 3D motion vectors, and optionally a 3D point cloud output. FIG. 4 describes an exemplary variation of system 100 in FIG. 1. In some embodiments, system 100 may be expanded to include more than one imaging devices 401, 402, in a stereo arrangement. The two imaging devices 401, 402 are synchronized using HW synch or communication timing clock. Features and motion data may be directly passed between the two imaging devices, such that each unit can independently compute 3D motion from stereo. Alternatively, an external processor unit 420 may take the output of each imaging device 401, 402, including averaged image and motion vectors, to compute the 3D motion vectors, as well as a 3D point cloud output.

In some embodiments, the present disclosure may use the representative frames generated at step 208 to perform image analysis, object detection, segmentation, and the like.

In some embodiments, a low-volume, low-cost and low-power implementation of a system such as exemplary system 100 in FIG. 1 may use a 3D die stacking process. Thus, an image sensor die may be directly connected through interconnects such as micro-pillars or TSV, to a memory module die, which in turn may be connected to a processor, and/or an HW accelerator die module, such as DLA, video compression accelerator, and the like. Other schemes may also be realized, such as having the memory, processor and other HW accelerators on a single die.

FIG. 5 illustrates an exemplary implementation of a system of the present disclosure using 3D Die stacking. In FIG. 5, stacked chip 500 comprises an image sensor die 510, which may be directly connected through interconnects 540 (such as micro-pillars or TSV) to a memory module die 520, which in turn may be connected to a processor, e.g., an HW accelerator die module 530, such as DLA, video compression accelerator, etc. Other schemes may also be realized, such as having the memory, processor and other HW accelerators on a single die.

FIGS. 6A-6C illustrate an alternative highly parallel HW implementation of a system of the present disclosure. In some embodiments, an alternative highly-parallel HW implementation of the present system may comprise an image sensor 600 divided into smaller blocks of pixels, e.g., blocks of 256×256 pixels, wherein each block has its own dedicated pixel processing block 610. In some embodiments, memory 630 and processing HW 640 capabilities may be stacked atop. As shown in FIG. 6C, each of the processing HW block 610 may contain, e.g., a processor module 650, memory 660, logic 670, and/or HW accelerators 680, such as DLA, and the like. Each block may also be connected to the adjacent blocks by a local bus 690. In an embodiment, the parallelized HW is controlled by a main controller CPU. In an embodiment, the main controller is packaged together with the highly-parallel HW. Altogether, these blocks may form a highly parallel computation HW on board the sensor, with the advantage of being low-volume, low-cost and low-power due to the short distances, which make the system useful as a sensor-edge-computing system for robotics, autonomous driving, surveillance and other IoT applications. In an embodiment, the present invention describes a system based on a ‘smart camera’, i.e., image sensor with ‘edge processing’, which includes a fast image sensor; onboard memory and onboard processing power and on board computation accelerator. In an embodiment, the present invention describes an autonomous robotic machine system, with smart camera visual sensors, utilizing the described method for scene object detection and motion tracking.

In some embodiments, the present method may be implemented using a monochrome sensor, where only a single intensity color plane is produced. In other embodiments, the sensor may be a color sensor such as a Bayer, a quad-Bayer (where each Bayer pixel is split into 4 sub-pixels), a nova-Bayer (9 sub-pixel Bayer), and/or any other spatial mosaic or depth-overlaid color arrangement. In the case of color sensors, the color data may be first converted into a monochrome intensity field, which is used for producing the motion vector field, and each color is motion averages separately before being de-mosaiced for producing the color image. In another embodiment, the color data is first de-mosaiced, and each color is used for computing the spatial-temporal derivatives for the motion field computation, wherein the gradients are computed over the sub-Bayer-pixels of quad-Bayer, nova-Bayer, etc., and the spatial gradients are used for color de-mosaicking. In some embodiments, the temporal gradients are computed before de-mosaicking using single color interpolation, while in other embodiments, the temporal gradients are computed after multi-color de-mosaicking. In some embodiments, the derivatives are computed per pixel of the image sensor resolution (e.g., 40MP), while the optical flow is estimated in a lower resolution such as the output resolution (e.g., 4K/UHD/8MP). In some embodiments, the motion field is computed per pixel of the image sensor resolution (e.g., 40MP), while the output is in a lower resolution (e.g., 4K/UHD/8MP), such that the initial resolution is used for sub-resolution shifts before down-sampling to the target output resolution.

In an embodiment, a region of interest (ROI) is defined dynamically, such that the fast-frame rate motion analysis is performed over the ROI only. In an embodiment, in concurrently with the motion detection in the ROI, the full frame is averaged over the fast rate sequence, and motion detection analysis is performed on the averaged full frame at a low rate. In an embodiment, the ROI is dynamically changed using externally provided input. In an embodiment, the ROI is dynamically changed based on the motion detected over the averaged full frame at low rate. In an embodiment, the ROI is dynamically changed based on an analysis of a low-resolution, low rate, calculation of the spatio-temporal gradients without solving the detailed motion field.

In some embodiments, the present system may further provide for time stamping to each frame and to any output frame, using a virtual frame counter, while working at a frame rate dictated by an external synchronization signal.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A system comprising:

at least one hardware processor; and

a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to:

receive a high frame rate video stream of a scene, wherein said scene comprises at least one object in motion relative to an imaging device acquiring said video stream,

process, in real time, each current sequence of n frames in said video stream to: (i) estimate pixel motion between a first pair of adjacent frames in said current sequence, (ii) iteratively estimate pixel motion with respect to each new frame in said current sequence, relative to a cumulative value of all preceding pixel motion estimations in said current sequence, to calculate a current motion vector field for each pixel in said current sequence, (iii) calculate, based on all of said current motion vector fields, aggregated pixel values corresponding to each of said pixels in said current sequence, and (iv) repeat (i)-(iii) with respect to a next one of said sequences in said video stream, and

output, in real time, at least one of (x) said current motion vector field, (y) a cumulative value of all of said calculated current motion vector fields, and (z) said aggregate pixel values.

2. The system of claim 1, wherein said program instructions are further executable to repeat steps (i)-(iv) over consecutive pairs of said processed current sequences, and wherein said output further comprises at least one of: (a) said current motion vector fields for each pixel over a current one of said consecutive pair of said sequences, (b) a cumulative value of all of said calculated current motion vector fields, and (c) said aggregate pixel values for each pixel over said current consecutive pair of said sequences.

3. (canceled)

4. The system of claim 1, wherein said cumulative value of all of said calculated current motion vector fields is calculated based, at least in part, on solving multi-frame multi-level temporal-spatial smoothness constraints.

5. (canceled)

6. The system of claim 1, wherein said estimating is performed with respect to at least one non-adjacent pair of frames in said current sequence.

7. (canceled)

8. The system of claim 1, wherein said pixel motion estimations are performed using an optical flow algorithm.

9. (canceled)

10. (canceled)

11. The system of claim 1, wherein said high frame rate is between 60-10,000 frames per second (fps).

12. The system of claim 1, wherein said program instructions are further executable to generate a current representative frame with respect to each of said current sequences, wherein said current representative frame is associated with a desired time point in said sequence, and wherein each pixel in said current representative frame is assigned (i) a pixel position associated with said desired time point, and (ii) said respective aggregate pixel value.

13. The system of claim 12, wherein said program instructions are further executable to output each of said current representative frames at a frame rate that is lower than said high frame rate.

14. (canceled)

15. The system of claim 12, wherein said program instructions are further executable to receive one of: depth information, position-motion sensor information, and ego-motion information, with respect to said scene, wherein said pixel positions comprise position information in relation to three-dimensional (3D) world coordinates.

16. The system of claim 1, wherein said program instructions are further executable to (i) receive two of said video streams of said scene, and (ii) determine depth information with respect to said scene based, at least in part, on said two video streams.

17. (canceled)

18. A method comprising:

receiving a high frame rate video stream of a scene, wherein said scene comprises at least one object in motion relative to an imaging device acquiring said video stream;

processing, in real time, each current sequence of n frames in said video stream by:

estimating pixel motion between a first pair of adjacent frames in said current sequence,

(ii) iteratively estimating pixel motion with respect to each new frame in said current sequence, relative to a cumulative value of all preceding pixel motion estimations in said current sequence, to calculate a current motion vector field for each pixel in said current sequence,

(iii) calculating, based on all of said current motion vector fields, aggregated pixel values corresponding to each of said pixels in said current sequence, and

(iv) repeating (i)-(iii) with respect to a next one of said sequences in said video stream; and

outputting, in real time, at least one of (x) said current motion vector field, (y) a cumulative value of all of said calculated current motion vector fields, and (z) said aggregate pixel values.

19. The method of claim 18, further comprising repeating steps (i)-(iv) over consecutive pairs of said processed current sequences, and wherein said outputting further comprises at least one of: (a) said current motion vector fields for each pixel over a current one of said consecutive pair of said sequences, (b) a cumulative value of all of said calculated current motion vector fields, and (c) said aggregate pixel values for each pixel over said current consecutive pair of said sequences.

20. (canceled)

21. The method of claim 18, wherein said cumulative value of all of said calculated current motion vector fields is calculated based, at least in part, on solving multi-frame multi-level temporal-spatial smoothness constraints.

22. (canceled)

23. The method of claim 18, wherein said pixel motion estimating is performed with respect to at least one non-adjacent pair of frames in said sequence.

24. (canceled)

25. The method of claim 18, wherein said estimating is performed using an optical flow algorithm.

26. (canceled)

27. (canceled)

28. The method of claim 18, wherein said high frame rate is between 60-10,000 frames per second (fps).

29. The method of claim 18, further comprising generating a current representative frame with respect to each of said current sequences, wherein said current representative frame is associated with a desired time point in said sequence, and wherein each pixel in said current representative frame is assigned (i) a pixel position associated with said desired time point, and (ii) said respective aggregate pixel values.

30. The method of claim 18, further comprising to outputting each of said current representative frames at a frame rate that is lower than said high frame rate.

31. (canceled)

32. The method of claim 30, further comprising receiving one of: depth information, position-motion sensor information, and ego-motion information, with respect to said scene, wherein said pixel positions comprise position information in relation to three-dimensional (3D) world coordinates.

33. (canceled)

34. A computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to:

receive a high frame rate video stream of a scene, wherein said scene comprises at least one object in motion relative to an imaging device acquiring said video stream;

process, in real time, each sequence of n frames in said video stream to:

(i) estimate pixel motion between a first pair of adjacent frames in said current sequence,

(ii) iteratively estimate pixel motion with respect to each new frame in said current sequence, relative to a cumulative value of all preceding pixel motion estimations in said current sequence, to calculate a current motion vector field for each pixel in said current sequence,

(iii) calculate, based on all of said current motion vector fields, aggregated pixel values corresponding to each of said pixels in said current sequence, and

(iv) repeat (i)-(iii) with respect to a next one of said sequences in said video stream; and

output, in real time, at least one of (x) said current motion vector field, (y) a cumulative value of all of said motion vector fields, and (z) said aggregate pixel values.

35.-49. (canceled)