METHOD AND SYSTEM FOR OPTIMIZING DEPTH IMAGING
There is provided a system and method for optimizing depth imaging. The method including: illuminating one or more scenes with illumination patterns; capturing one or more images of each of the scenes; reconstructing the scenes; estimating the reconstruction error and a gradient of the reconstruction error; iteratively performing until the reconstruction error reaches a predetermined error condition: determining a current set of control vectors and current set of reconstruction parameters; illuminating the one or more scenes with the illumination patterns governed by the current set of control vectors; capturing one or more images of each of the scenes while the scene is being illuminated with at least one of the illumination patterns; reconstructing the scenes from the one or more captured images using the current reconstruction parameters; and estimating an updated reconstruction error and gradient; and outputting at least one of control vectors and reconstruction parameters.
The following relates generally to image processing, and more specifically, to a method and system for optimizing depth imaging.
BACKGROUNDFrom natural user interfaces to self-driving cars and 3D printers, there is an ever-increasing need for sensors to capture the world in three-dimension (3D), and to do so in real time, accurately, and robustly. A particular type of camera, called an RGB-D camera, offers a source of input of 3D images. Generally, RGB-D cameras rely on some form of projected structured-light pattern or patterns to actively illuminate objects being imaged.
Fast and accurate structured-light imaging is getting increasingly popular. Already, the high pixel counts of modern smartphones and home-theater projectors theoretically allow 3D accuracies of 100 microns or less. Similar advances are occurring in the domain of time-of-flight (ToF) imaging as well, with inexpensive continuous-wave ToF sensors, programmable lasers, and spatial modulators becoming increasingly available. Unfortunately, despite the wide availability of all these devices, achieving optimal performance in a given structured-light imaging system is still a substantial challenge.
SUMMARYIn an aspect, there is provided a computer-implemented method for optimizing depth imaging, the method comprising: illuminating one or more scenes with illumination patterns governed by an initial set of control vectors; capturing one or more images of each of the scenes while the scene is being illuminated with at least one of the illumination patterns; reconstructing the scenes from the captured images with reconstruction parameters; estimating the reconstruction error and a gradient of the reconstruction error with respect to the control vectors and the reconstruction parameters; iteratively performing until the reconstruction error reaches a predetermined error condition: determining a current set of control vectors and current set of reconstruction parameters by updating at least one of the set of control vectors and the set of reconstruction parameters to reduce the reconstruction error; illuminating the one or more scenes with the illumination patterns governed by the current set of control vectors; capturing one or more images of each of the scenes while the scene is being illuminated with at least one of the illumination patterns; reconstructing the scenes from the one or more captured images using the current reconstruction parameters; and estimating an updated reconstruction error and an updated gradient of the reconstruction error with respect to the current control vectors and the current reconstruction parameters; and outputting at least one of the current control vectors and the current reconstruction parameters.
In a particular case of the method, estimating the reconstruction error comprises evaluating a function that penalizes depth errors with respect to a ground truth, and wherein iteratively reducing the reconstruction error comprises performing at least one of stochastic gradient descent and derivative-free optimization.
In another case, the initial control vectors comprise at least one of pre-existing control vectors, random control vectors, or low-contrast random control vectors.
In yet another case, updating the set of control vectors also comprises incorporating user-defined constraints comprising at least one of frequency content of the illumination patterns, amplitude of the illumination patterns, and total energy consumption of the illumination patterns.
In yet another case, the one or more scenes are computationally generated and restricted to lie in a selected subset of 3D space, wherein illuminating the one or more scenes with the illumination pattern comprises a computational simulation, wherein capturing the one or more images comprises computationally simulating image formation, and wherein estimating the gradient of the reconstruction error comprises determining a derivative based on an image formation model.
In yet another case, the one or more scenes comprise at least one surface, illuminating the one or more scenes with the illumination patterns comprises optical illumination, capturing the one or more images comprises optically capturing the one or more images, and estimating the gradient of the reconstruction error comprises optically estimating an image Jacobian with respect to the control vectors.
In yet another case, the one or more scenes comprise a randomly-textured surface that exhibits at least one of direct surface reflection, sub-surface scattering, or surface inter-reflection.
In yet another case, the control vectors comprise at least one of a discretized time-varying illumination pattern and a discretized time-varying pixel demodulation function.
In another aspect, there is provided a system for optimizing depth imaging, the system comprising one or more processors in communication with a data storage, the one or more processors configurable to execute: an illumination module to direct illumination of one or more scenes with illumination patterns governed by an initial set of control vectors; a capture module to receive one or more captured images of each of the scenes while the scene is being illuminated with at least one of the illumination patterns; a reconstruction module to: reconstruct the scenes from the captured images with reconstruction parameters; estimate the reconstruction error and a gradient of the reconstruction error with respect to the control vectors and the reconstruction parameters; and iteratively perform until the reconstruction error reaches a predetermined error condition: determining a current set of control vectors and current set of reconstruction parameters by updating at least one of the set of control vectors and the set of reconstruction parameters to reduce the reconstruction error; illuminating the one or more scenes with the illumination patterns governed by the current set of control vectors; capturing one or more images of each of the scenes while the scene is being illuminated with at least one of the illumination patterns; reconstructing the scenes from the one or more captured images using the current reconstruction parameters; and estimating an updated reconstruction error and an updated gradient of the reconstruction error with respect to the current control vectors and the current reconstruction parameters; and an output interface to output at least one of the current control vectors and the current reconstruction parameters.
In a particular case of the method, estimating the reconstruction error comprises evaluating a function that penalizes depth errors with respect to a ground truth, and wherein iteratively reducing the reconstruction error comprises performing at least one of stochastic gradient descent and derivative-free optimization.
In another case, the initial control vectors comprise at least one of pre-existing control vectors, random control vectors, or low-contrast random control vectors.
In yet another case, updating the set of control vectors also comprises incorporating user-defined constraints comprising at least one of frequency content of the illumination patterns, amplitude of the illumination patterns, and total energy consumption of the illumination patterns.
In yet another case, the one or more scenes are computationally generated and restricted to lie in a selected subset of 3D space, wherein illuminating the one or more scenes with the illumination pattern comprises a computational simulation, wherein capturing the one or more images comprises computationally simulating image formation, and wherein estimating the gradient of the reconstruction error comprises determining a derivative based on an image formation model.
In yet another case, the one or more scenes comprise at least one surface, illuminating the one or more scenes with the illumination patterns comprises optical illumination, capturing the one or more images comprises optically capturing the one or more images, and estimating the gradient of the reconstruction error comprises optically estimating an image Jacobian with respect to the control vectors.
In another aspect, there is provided a computer-implemented method for generating a depth image of a scene, the method comprising: illuminating the scene with one or more illumination patterns, each pattern comprising a plurality of discretized elements, intensity of each element governed by a code vector; capturing one or more images of the scene while the scene is being illuminated; for each pixel, generating an observation vector comprising at least one intensity recorded at the pixel for each of the captured images; for each pixel, determining the code vector that best corresponds with the respective observation vector by maximizing the zero-mean normalized cross-correlation (ZNCC); for each pixel, determining a depth value from the best-corresponding code vector; and outputting the depth values as a depth image.
In a particular case of the method, each observation vector incorporates intensities of neighbouring image pixels, and wherein each code vector incorporates neighbouring discretized intensities.
In another case, the method further comprising: using a trained artificial neural network to transform each observation vector to a higher-dimensional feature vector; and using a trained artificial neural network to transform each code vector to a higher-dimensional feature vector, wherein determining the code vector that best corresponds with the respective observation vector comprises maximizing the ZNCC between the transformed respective observation vector and the transformed code vectors.
In yet another case, each illumination pattern is a discretized two-dimensional pattern that is projected onto a scene from a viewpoint that is distinct from the captured images, wherein each element in the pattern is a projected pixel, and wherein determining the depth value from the best-corresponding code vector comprises triangulation.
In yet another case, each illumination pattern comprises multiple wavelength bands, wherein the observation vector at each pixel comprises the raw or demosaiced intensities of each wavelength band for the respective pixel.
In yet another case, the discretized elements of each illumination pattern comprise a discretized time-varying pattern that modulates the intensity of a light source, each element in the pattern is associated with a time-of-flight delay and a code vector, and wherein determining the depth value from the best-corresponding code vector comprises multiplication by the speed of light.
These and other aspects are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of systems and methods for animated lip synchronization to assist skilled readers in understanding the following detailed description.
A greater understanding of the embodiments will be had with reference to the Figures, in which:
Embodiments will now be described with reference to the figures. For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the Figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: “or” as used throughout is inclusive, as though written “and/or”; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; “exemplary” should be understood as “illustrative” or “exemplifying” and not necessarily as “preferred” over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.
Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
The following relates generally to image processing, and more specifically, to a method and system for optimizing depth imaging.
Generally, structured light applies a predefined illumination pattern which can be used in conjunction with three-dimension (3D) reconstruction algorithms to arrive at a 3D reconstruction of an imaged item or scene. The present inventors have advantageously determined illumination patterns, having greater performance than other approaches, using machine learning-based optimization.
Generally, the present inventors have determined that performance of a given pattern can depend on the precise imaging system hardware (i.e., the choice of projector and the choice of camera). The present embodiments, advantageously, make it possible to automatically learn patterns that are fine-tuned to the specific hardware, yielding up to orders of magnitude higher accuracy in some cases compared to other approaches. In addition to generating the patterns, the present embodiments also provide for “decoding” of such patterns; for example, transforming captured images into precise 3D geometry.
Accordingly, the present embodiments provide a machine learning based optimization approach for automatically generating structured-light patterns that are optimized to produce high 3D measurement accuracy. The present embodiments also provide a “decoding” algorithm to convert intensities observed at a specific pixel across two or more images into a 3D distance measurement (for example, “depth”). The present embodiments also provide a machine learning based optimization approach that can automatically generate structured-light patterns that are customized for a particular hardware system, or even higher 3D accuracy. The present embodiments also provide a machine learning based optimization approach that jointly determines (a) best possible patterns and (b) best possible “decoding” algorithms to turn pixel intensities into 3D measurements.
As an example, the present embodiments can address the problem of automatically generating sequences of structured-light patterns for active stereo triangulation of a static scene. Unlike other approaches that use predetermined patterns and reconstruction algorithms tied to them, embodiments described herein, as an example, can generate patterns on-the-fly in response to certain specifications: number of patterns, projector-camera arrangement, workspace constraints, spatial frequency content, and the like. Pattern sequences can be specifically optimized to minimize an expected rate of correspondence errors under specifications for an unknown scene, and can be coupled to a sequence-independent algorithm for per-pixel disparity estimation. To achieve this, embodiments described herein can be used to derive an objective function that is relatively easy to optimize within a maximum-likelihood framework. By minimizing the optimization parameters, automatic discovery of pattern sequences can be generated. For example, the present inventors generated such sequences in under three minutes on a laptop, which were determined to outperform other triangulation techniques.
For structured-light triangulation, the choice of projection patterns generally has a great effect on usefulness. Over the years, the field has seen significant boosts in performance, in robustness, 3D accuracy, speed and versatility, due to new types of projection patterns, and new vision algorithms tailored to them. Underlying such advancements is the question of what are the optimal patterns to use and what algorithm should process the images they create? This question was posed more than twenty years ago but the answer was generally deemed intractable. Generally, pattern design has largely been driven by practical considerations and by intuitive concepts borrowed from many fields (for example, communications, coding theory, number theory, numerical analysis, and the like).
The present embodiments provide an approach to determination of optimal patterns for structured light. In an application of the embodiments, an approach is shown for projecting a sequence of patterns one by one onto a static scene and using a camera to estimate per-pixel depth by triangulation. Starting from first principles, an objective function is derived over the space of pattern sequences that quantifies the expected number of incorrect stereo correspondences, and then it is minimized.
In an example, an optimization using the present embodiments takes as input a projector's resolution and the desired number of projection patterns. In addition to these parameters, the present embodiments can generate patterns that are precisely optimized for 3D accuracy using a particular system (see, for example,
In embodiments of the present disclosure, a maximum-likelihood decoding approach can be used for determining stereo correspondences independently of projection pattern. This approach is not only computationally competitive with pattern-specific decoders, but also makes the pattern optimization problem itself tractable. In this way, by giving a way to quantify the expected errors a pattern sequence will cause, the present embodiments lead to an objective function over sequences that can be optimized numerically.
Advantageously, the present embodiments can turn structured-light imaging from a problem of algorithm design (for example, for creating patterns, unwrapping phases, computing correspondences, handling projector defocus) into one of problem specification (how many patterns, what working volume, what imaging system, etc.). Also advantageously, the present embodiments can demonstrate discovery of pattern sequences that can outperform other encoding schemes on hard cases: low numbers of patterns, geometrically-complex scenes, low signal-to-noise ratios, and the like. Also advantageously, the present embodiments provide for the emergence of imaging systems that can confer robustness to indirect light without restrictions on frequency content, giving newfound degrees of freedom for pattern optimization; this larger design space can be explored automatically with the present approach. Also advantageously, the present embodiments can provide a formulation that gives rise to new families of pattern sequences with unique properties, including (1) sequences designed to recover approximate, rather than exact, correspondences, and (2) sequences designed with information about free space and stereo geometry already built in. This encodes geometric scene constraints directly into the optical domain for added reliability, via the patterns themselves, rather than enforcing them by post-processing less reliable 3D data.
Generally, structured-light triangulation requires addressing two basic questions: (1) what patterns to project onto a scene and (2) how to determine projector-camera stereo correspondences from the images captured of the scene. Generally, a “good” set of projection patterns can be thought of as solving a one-dimensional position encoding problem for pixels on an epipolar line. Conversely, determining the stereo correspondence of a camera pixel can be generally thought of as a position decoding problem.
For determining a code matrix, a set of K projection patterns can be implicitly assigned a K-dimensional code vector cp to each pixel p on the epipolar line (see the example of
For position decoding, a camera pixel q is considered. The K intensities observed at that pixel define a K-dimensional observation vector oq. Given this vector and the code matrix C, the goal of position decoding is to infer its corresponding projector pixel p*. This can be a difficult problem because observations are corrupted by measurement noise and because the relation between observation vectors and code vectors can be highly non-trivial for general scenes. The projector pixel p* can be formulated as a maximum-likelihood (ML) problem:
where Pr(oq|cp) is the likelihood that the code vector of pixel q's true stereo correspondence is column p of C. While this formulation may be vaguely close, in spirit, to Bayesian time-of-flight depth estimation, the image formation model and decoding procedure are very different. Note that the inferred correspondence p* may or may not agree with the true correspondence p (see the example of
For position encoding, the code matrix C can be chosen to minimize decoding error. For a given projector-camera system and a specific scene, this error is quantified by counting the incorrect correspondences produced by a decoder (such as a machine learning decoder of the present embodiments):
where Match(q) is the true stereo correspondence of image pixel q; ε is a tolerance threshold that permits small correspondence errors; 1( ) is the indicator function; and the summation is over all pixels on the epipolar line. Note that evaluating the error function in Equation (3) for a given scene and imaging system requires optimization, i.e., solving the decoding problem in Equation (2).
An optimal position encoding can be formulated as the problem of finding a code matrix Cε* that minimizes the expected number of incorrect correspondences:
where E[ ] denotes expectation over a user-specified domain of plausible scenes and imaging conditions. Cε* is referred to as the optimal code matrix for tolerance E.
The present embodiments can be used to solve to the nested optimization problem in Equation (4) that is efficient to compute and can exploit imaging-system-specific information and user constraints. In an embodiment, the problem is cast as an optimization in the space of plausible epipolar transport matrices. The present embodiments can thus use a correlation-based machine learning (ML) decoder for structured-light reconstruction that is nearly optimal in low-noise settings. Using this decoder, the present embodiments provide a softmax-based approximation to the objective function of Equation (4) and minimize it to get patterns that minimize the expected number of stereo mismatches.
To simplify formal analysis, it can be assumed that all light transport is epipolar. Specifically, it is assumed that observation vectors depend only on code vectors on the corresponding epipolar line. This condition applies to conventionally-acquired images when global light transport, projector defocus and camera defocus are negligible. It also applies to all images captured by an epipolar-only imaging system regardless of scene content; even in the presence of severe global light transport.
When epipolar-only imaging holds and the system has been calibrated radiometrically, the relation between code vectors and observation vectors is given by (see the example of
where O1, . . . , Om are the observation vectors of all pixels on an epipolar line; a1, . . . , aM are contributions of ambient illumination to these pixels; 1 is a column vector of all ones; matrix e is the observation noise; and T is the N×M epipolar transport matrix. Element T[p, q] of this matrix describes the total flux transported from projector pixel p to camera pixel q by direct surface reflection, global transport, and projector or camera defocus. An example of observation matrix O is shown in
The epipolar-only model of Equation (5) encodes the geometry and reflectance of the scene as well as the scene's imaging conditions. It follows that the expectation in the position-encoding objective function of Equation (4) is expressed most appropriately as an expectation over plausible epipolar transport matrices T, ambient vectors a, and noise matrices e.
For the space of plausible matrices T, even though the space of N×M matrices is extremely large, the matrices relevant to structured-light imaging belong to a much smaller space. This is because the elements of T associated with indirect light generally have far smaller magnitude than direct elements, and can thus be ignored. This in turn makes likelihoods and expectations very efficient to compute. In particular, the embodiments consider ML-decoding and optimal encoding for the following three families:
-
- (A) Direct-only T, unconstrained: The non-zero elements of T represent direct surface reflections and each camera pixel receives light from at most one projector pixel. It follows that each column of T contains at most one non-zero element. Moreover, the location of that element can be considered a true stereo correspondence. The observation vector is therefore a noisy scaled-and-shifted code vector:
-
-
- where vector eq denotes noise. It is assumed that the location of the non-zero element in each column of T is drawn randomly from the set {1, . . . , N} and its value, T[p, q], is a uniform i.i.d random variable over [0,1]. This amounts to being completely agnostic about the location and magnitude of T's non-zero elements.
- (B) Direct-only T with geometry constraints: The above family is restricted to exclude geometrically-implausible stereo correspondences. These are elements of T whose associated 3D rays either intersect behind the image plane or outside a user-specified working volume (see the example of
FIG. 4A ). These invalid elements are specified with a binary indicator matrix G (see the examples ofFIGS. 4B and 4C ). Given this matrix, it can be assumed that the location of the non-zero element in each column of T is drawn uniformly from the column's valid elements.FIG. 4B illustrates a geometric illustration of T being lower triangular because the 3D rays of all other elements intersect behind the camera.FIG. 4C illustrates a geometric illustration of how T's non-zero elements are restricted even further by knowledge of the working volume (e.g., black square in (a)): its depth range (red) and its angular extent from the projector (green) and the camera (blue) define regions in T whose intersection contains all valid correspondences. - (C) Direct-only T with projector defocus: The above two families do not model projector defocus. In some cases, this not only can prevent correct modeling of the defocused projection patterns that may illuminate some points, but also may ignore the rich shape information available in the defocus cue. Since a camera pixel may receive light from multiple projector pixels, the observation vector can be a noisy scaled-and-shifted mixture of code vectors:
-
-
-
- where T is a direct-only transport matrix from families (A) or (B). The coefficients bipq in Equation (7) account for the defocus kernel. This kernel is depth dependent and thus each matrix element T[p, q] is associated with a different set of coefficients. The coefficients themselves can be computed by calibrating the projector. Equation (7) can be made to conform to the epipolar image formation model of Equation (5) by setting the scene's transport matrix to be a new matrix T′ whose i-th row is T′[i, q]=T[p, q]bipq.
-
For the observation noise and ambient vector, the optimality of the ML position decoder generally relies on noise being signal independent and normally distributed. The position encoder, on the other hand, can accommodate any model of sensor noise as long as its parameters are known. In some cases, it can be assumed that the elements of the ambient vector a follow a uniform distribution over [0, amax], where amax is the maximum contribution of ambient light expressed as a fraction of the maximum pixel intensity.
In an example, suppose a code matrix C and an observation vector oq, which conforms to the epipolar-only image formation model, are given. A task is to identify the stereo correspondence of pixel q by seeking a generic solution to this problem that does not impose constraints on the contents of the code matrix: it can contain code vectors defined a priori, such as MPS or XOR codes, or be a general matrix computed automatically through optimization.
To solve the above, the present embodiments can determine a zero-mean normalized cross-correlation (ZNCC) between oq and the code vectors, and choose the one that maximizes it. This approach becomes optimal as noise goes to zero and as the variance of individual code vectors become the same.
For decoding, if observation vectors and code vectors are related according to Equation (6) then:
v is the variance of the variances of the N code vectors:
mean( ) and var( ) are over the elements of a code vector, a is the noise standard deviation, and Pr(oq|cp) is defined by marginalizing over ambient contributions and values of T[p, q]:
where the ZNCC Decoder is defined as:
For defocused decoding, if observation vectors and code vectors are related according to Equation (7) then:
where the N×N matrix Tq holds the defocus kernel at camera pixel q for all possible corresponding pixels p, i.e., Tq[i,p]=Bipq.
The near-optimality of the ZNCC decoder is advantageous for at least two reasons. First, it suggests that there is potentially no accuracy advantage to be gained by designing decoding algorithms tailor-made for specific codes (see for example
The approach can begin by developing a continuous approximation to the function Error( ) in Equation (3). This function counts the decoding errors that occur when a given code matrix C is applied to a specific scene and imaging condition, i.e., a specific transport matrix T, observation noise e, and ambient vector a. To evaluate the position-encoding objective function on matrix C, S fair samples are drawn over T, e and a:
In some cases, a softmax approximation can be used for decoding of errors. Consider a binary variable that tells whether or not the optimal decoder matched camera pixel q to a projector pixel p. This variable can be approximated by a continuous function in three steps using Equations. (15) to (17) below. Equation (15) states that in order for projector pixel p to be matched to q, the likelihood of p's code vector must be greater than all others. Equation (16) then follows allowing the replacement of likelihoods with ZNCC scores. Lastly, Equation (17) approximates the indicator variable with a softmax ratio; as the scalar μ goes to infinity, the ratio tends to 1 if pixel p's ZNCC score is the largest and tends to 0 otherwise:
To count all correct matches on an epipolar line, the softmax ratio can be evaluated at the true stereo match of every pixel q, and then their sum is computed. Using the notation in Equation (18):
Finally, incorporating the tolerance parameter ε to permit small errors in stereo correspondences:
For sampling of scenes and imaging conditions, a direct-only matrix is constructed whose geometric constraints are a matrix G. Firstly, a valid stereo correspondence randomly assigned to each camera pixel according to G; in this way, in some cases, the correspondences can be generated to be restricted to lie in a particular subset of 3D space, governed by matrix G. This specifies the location of the single non-zero element in each column of T (see for example
For optimization, an Adam optimizer is used to perform stochastic gradient descent on the objective function in Equation (13) with a fixed learning rate, for example, of 0.01. In some cases, user-specified parameters can be (1) the number of projector pixels N; (2) the number of camera pixels M; (3) the number of projection patterns K; (4) the desired tolerance parameter ε; and (5) the geometric constraint matrix G. The result of the optimization is a code matrix Cε*.
In an example, the optimization is initialized with a random K×N code matrix C and draw a total of S=500 samples (T, e, a) at iteration 1 to define the objective function of Equation (13). These samples act as a “validation set” and remain fixed until a predetermined error condition is reached (for example, until the error is below a threshold value, until the error is minimized, or until convergence). For gradient calculations, a minibatch is used containing two new randomly-drawn samples per iteration. In an example, optimization converges in around 250 iterations (152 seconds on an 8-core 2.3 GHz laptop for a six-pattern matrix). It was found that increasing the number of samples had no appreciable effect on the quality of Cε* (i.e., the number of decoding errors on other randomly-generated scenes and imaging conditions). In contrast, it was found that the value of the softmax multiplier pi has an appreciable affect; there is significant degradation in quality for μ<300, but increasing it beyond that value has little effect. In this example, μ=300 was used for all results shown.
For frequency-constrained projection patterns, many structured-light techniques advocate use of projection patterns with spatial frequency no larger than a user-specified threshold F. This can be viewed as an additional design constraint on the optimal code matrix. To explicitly enforce it, the embodiments can project the code matrix computed at each iteration onto the space of matrices satisfying the constraint.
For advanced sensor noise modeling, although the ZNCC decoder is generally optimal for additive Gaussian noise, the objective function in Equation (13) can incorporate any sensor noise model; for example, samples are simply drawn of e from the camera's noise distribution. The present inventors determined that this can improve significantly the real-world performance of the optimized codes.
To generate a space of optimal code matrices, in an example experiment of the present embodiments,
For the example illustration of
-
- Row 1: The maximum spatial frequency of the patterns is set to F=4 and the image PSNR to be maximal for our imaging conditions (frame rate=50 Hz, camera gain=1, known read noise, pixel intensity that spans the full interval [0, 1]). Then the optimal code matrix is computed for a 608-pixel projector for different numbers of patterns and no other constraints.
- Row 2: Then K=4 is selected and optimal matrices are computed for different bounds on the maximum spatial frequency, with everything else fixed as above.
- Row 3: The frequency is set to 8 and optimal matrices are computed for different values of pixel PSNR (i.e., the maximum image intensity gets increasingly smaller), again with everything else fixed as above.
- Rows 4 and 5: The same approach is followed for different lower bounds on disparity (i.e., the maximum scene depth is increasingly being restricted), and different tolerances in correspondence error.
In an example experiment described herein, images were captured at 50 Hz and 8 bits with a 1280×1024 monochrome camera supplied by IDS (model IDS UI-3240CP-M), fitted with a Lensation F/1.6 lens (model CVM0411). For pattern projection, a 100-lumen DLP projector by Keynote Photonics (model LC3000) was used with a native resolution of 608×684 and only the red LED turned on. Gamma correction was disabled, verified the system's linear radiometric response, and measured the sensor's photon transfer curve. This made it possible to get a precise measure of PSNR independently for each pixel on the target. Three different models of pixel noise were experimented with for position-encoding optimization: (1) additive Gaussian, (2) Poisson shot noise with additive read noise, and (3) exponential noise with additive read noise.
For ground truth, a random noise pattern of bounded frequency was printed onto a white sheet of paper and placed on a planar target 60 cm away from the stereo pair (see for example
For quantitative evaluation, focus was placed on the most challenging cases: very small number of patterns and low PSNR. To evaluate low-PSNR performance, the aperture was reduced so that the brightest pixel intensity under a white projection pattern is 60, and the pixels are counted whose correspondences are within ε of the ground truth. The example of
In the top row and the first two columns of the bottom row of
The qualitative results of the example experiments for reconstructions of several objects are shown in
The top of
Advantageously, the embodiments described herein, with the position-encoding objective function, can be viewed as an extremely simple one-layer neural network.
Embodiments described herein provide a method and system to provide three-dimensional (3D) imaging using a projector with a set of patterns and a camera to capture intensities of light reflected from a scene to create accurate 3D models of that scene.
Generally, the principle of triangulation is used to determine correspondence between points or pixels projected by the projector and points of pixels captured by the camera. In this way, the system needs to determine approximately every point on the projector correspondence with a point on the camera.
In order to determine this correspondence, a process of projecting different patterns onto the scene and capturing the reflected light at the camera is repeated. For each pixel, the camera senses different intensities by measuring intensities for each respective projected pattern, knowing what the intensity of the pixel that was projected. Typically, the correspondence of pixels can be determined by projecting lots and lots of patterns of light. However, this can be problematic where there is not a lot of time or energy, where patterns need to be projected quickly (such as for moving objects), or where imaging is done outdoors, and it is not desirable to expend lots of energy projecting very bright patterns.
Embodiments described herein can advantageously be used to get good geometry determinations of the scene by determining correspondence with a relatively low amount of patterns, for example 20 patterns, and a relatively low amount of energy.
Embodiments described herein can be used to design patterns that are custom designed for a particular system arrangement and setting. For example, where it is known where the camera is positioned and where the projector is positioned. In this case, tailored patterns can be determined that optimize for that system very quickly, for example within a couple minutes. Embodiments described herein can be used to determine geometry in a way that is relatively robust to noise, especially for low light conditions that have more noise relative to signal.
Additionally, embodiments described herein can be used to generate correspondence algorithms that are independent of the patterns that are being generated. Thus, algorithms presented herein provide pixel correspondence that is simple and general, and can be used regardless of what patterns are used. Thus, in some cases, the correspondence algorithms can make any structured light system more accurate by capturing geometry for any pattern.
Also provided herein is a method and system for determining correspondence regardless of the projector and camera used, and their respective settings. Instead of assuming information about the camera and the projector are known, methods of the present embodiments allow the system to discover such properties of the camera and the projector.
Embodiments of the method and system use neural networks to learn optimal projection patterns to generalize previous approaches and give significant improvements in accuracy.
In a method of the present embodiments, an object of known geometry is placed in the scene, with the projector projecting onto it and the camera receiving light reflected off it. In a particular case, the object is a planar board with one of its faces directed between the projector and the camera. In a particular case, this planar board has a pattern (texture) affixed to it; for example, a random greyscale image.
In this example, a planar board is used because the geometry of the board is easily known. The texture is used because it can force the system to resolve correspondence regardless of what a local neighborhood of a particular point looks like.
In an exemplary case, determining correspondence for each pixel received by the camera on the image with a corresponding projector pixel can be done by considering a neighborhood of that pixel, for example typically 3-pixels-wide-by-3-pixels-high. In this example, the projected patterns are separated one-dimensional strips (columns) that are 1-pixel-wide with 3 or more pixels in height. In some cases, each column can have the same intensity.
In this example, to train the neural network, many patterns are projected onto the known planar board to most or all of the points on the planar board; in some cases, 30, 50, or 100 patterns depending on the desired accuracy. With all these projected patterns, it can be expected that resulting captured training dataset will likely give reasonably good geometry. Then the system fits the planar surface to the captured training dataset. Then for each pixel, because the system fits an object of known geometry to the captured training dataset, the system can know which captured pixel generally corresponds to each projected pixel. Because it is a known planar board, even if there are a few outliers, the system can use it as a ground truth.
The system can project ‘K’ patterns onto the scene of known geometry to yield potentially thousands of training samples (one per image row). The system can then capture images of the scene and randomly sample, for example, 15% of rows. A gradient is determined using:
is evaluated at the samples.
In this way, measurement of how a small intensity change at pixel q of projection pattern k affects the intensity of camera pixel p. The system thus projects the pattern kin a current iteration and captures the image. The system can then modify the pattern by adding a small value to pixel q. The modified pattern is projected and a new image is captured. The above gradient is determined from their difference.
In this way, the encoding scheme is generated in real time, and optimized for a particular setup and signal-to-noise ratio of actual capture session.
To determine accuracy of the neural network, the system can project, for example, four predetermined patterns onto the planar board. The patterns are captured by the pixels of the camera, passed through the neural network, and correspondence is outputted. This correspondence can be checked to ensure that it is correct with respect to what is expected for a planar surface. This checking can produce a loss function that can be optimized against the ground truth. In this way, the system can trust that the geometry is captured accurately.
When accuracy is evaluated, the system determines what fraction of pixels get the correspondence exactly correct, or determines an area to see how well the neural network performs in matching pixels together. For example, ϵ0 is a measurement of how many are exactly correct, ϵ1 is a measurement of how many are correct within one pixel away, ϵ2 is a measurement of how are correct within two pixels away, and so on.
In an example of the above, four patterns can be used and captured as input to the neural network. With a 3×3 matrix of captured pixels, and four different patterns, there are 36 pixels in total that describes a local neighborhood across all the projected patterns; thus, a 36 dimensional vector. This 36 dimensional vector can be passed through, for example, a neural network having convolutional layers of 50 dimensions. The system then does the same for the projected pixels. In this example, a column of 3 pixels high, and four different patterns, produces a 12 pixel dimensional vector. This vector is passed through into the 50 dimensional convolutional layers.
In this example, the pixels can be matched by passing the above output through a Zero-mean Normalized Cross-Correlation (ZNCC). This output is then passed through softmax to determine which neighborhoods provide most likely correspondence. In this way, the neural network can learn weights of most likely correspondence between the pixels. In experimentation, this gives a high degree of accuracy, for example, at or above 70% accuracy.
Advantageously, the embodiments described herein can start with random patterns, and cameras and projectors with unknown properties, and learn pixel correspondence itself. Thus, allowing the system to determine depth and geometry without having to use specified equipment even though different types of cameras and projectors work differently even with the same structured light patterns. This allows a user to swap out different equipment or patterns as necessary.
Additionally, conventional systems typically use grey-scale cameras and projectors. With the present embodiments, the system can use color patterns and color cameras, which can possibly mean using less patterns and thus having comparably better performance.
In some cases, it may be useful to use a material of the known training object to train the system if the user is ultimately trying to scan a class of objects with that material because it can provide even better performance and accuracy.
The system of the present embodiments is thus able to reconstruct (almost) anything, quickly, with a low power source, at high accuracy (for a given system), and with relatively high spatial density. Additionally, the system may be able to generalize these abilities to new imaging systems without any calibration or new programming, or prior training data.
In some cases, the system can use post-processing; for example, clipping, local cleanup, global optimization, or the like.
In embodiments of the present disclosure, the present inventors developed optical auto-tuning for optimal performance of a structured-light imaging system. Optical auto-tuning allows for optimization that can learn on the fly, at least, (1) optimal illuminations to use for multi-shot depth acquisition of a static scene, and (2) optimal mapping from the captured shots to the scene's depth map. See for example
In the present embodiments, optical auto-tuning can proceed by controlling in real-time the system it is optimizing, and capturing images with it. In some cases, the only inputs to the optimization required are the number of shots and an optional penalty function to be applied to the depth error of each pixel. In some cases, present embodiments of optical auto-tuning can be completely automatic, requiring no manual initialization, parameter tuning, system calibration, or prior training data. In some cases, present embodiments of optical auto-tuning can minimize a rigorously-derived estimate of the expected reconstruction error for the system at hand. In some cases, present embodiments of optical auto-tuning can optimize this objective without having a precise image formation model for the system or the scenes of interest.
In some cases of the present embodiments of optical auto-tuning, the hardest computations in the optimization, such as calculating derivatives that depend on an accurate model of the system, can be performed in the optical domain, which provides demonstratable computational efficiency. Advantageously, present embodiments of optical auto-tuning can treat the imaging system as a perfect (or near perfect) “end-to-end model” of itself, with realistic noise and optical imperfections all included. See for example
The present disclosure provides, in an embodiment, an optimization approach that runs partly in the numerical and partly in the optical domain. Optical auto-tuning starts from a random set of K illuminations; uses them to illuminate an actual scene; captures real images to estimate the gradient of the expected reconstruction error; and updates its illuminations according to Stochastic Gradient Descent (SGD). In some cases, the system's light sources can be flexible enough to allow small adjustments to their illumination and an independent mechanism is available to repeatedly acquire higher-accuracy (but can be still noisy) depth maps of that scene.
Previous approaches and techniques generally require very precise models of the system or extensive training data, whereas the present embodiments may not require either. Further, the present embodiments advantageously replace “hard” numerical computations with “easy” optical ones. Further, optical auto-tuning can, in some cases, train a small neural network with a problem-specific loss; noisy labels and noisy gradients; and with training and data-augmentation strategies implemented partly in the optical domain.
Advantageously, present embodiments of optical auto-tuning allow for a common computational framework for the optimization of many types of systems. from grayscale, to color, to coded imaging; making optimization possible regardless of modality. Advantageously, present embodiments of optical auto-tuning remove many of the calibration steps required for high accuracy structured-light imaging (color and radiometric calibration, defocus modeling, and the like). Advantageously, present embodiments of optical auto-tuning produce patterns of much higher frequency than used by other approaches. This suggests that the bandwidth of spatial frequencies useful for structured light is far broader and can lead to accuracy improvements when exploited.
Referring now to
The output interface 106 enables another electronic device or computing device to transmit data or receive the outputs from the system 100, as described herein. On some embodiments, the output interface 106 enables users to view such outputs, via for example, a display or monitor. In some cases, the outputs from the system 100 can also be stored in the data storage 104. The input interface 110, alone or in conjunction with the output interface 106, taking direction from the illumination module 108 and/or the capture module 109, can communicate with certain devices, such as an image sensor 130 and a projector 140, which can be internal or external to the system 100. The image sensor 130 can be any suitable image acquisition device; for example, a visible spectrum camera, an infrared camera, a smartphone camera, a per-pixel coded-imaging camera, or the like. The projector 140 can be any suitable device for projecting illumination, in any suitable spectrum, onto the scene; for example, a digital micromirror device (DMD)-based projector, a laser-based projector, a Liquid Crystal Technology on Silicon (LCoS)-based projector, and the like. The projector 140 having a level of granularity or spatio-temporal resolution as described herein.
The projector 140 projects structured light onto a scene and can be used to control image formation in an extremely fine-grained, almost continuous, manner. In some cases, the projector 140 can adjust a scene's illumination at the resolution of individual gray levels of a single projector pixel. In some cases, the projector 140 can comprise spatial light modulators that can do likewise for phase or polarization. In some cases, the projector 140 can comprise programmable laser drivers that can smoothly control the temporal waveform of a laser at sub-microsecond scales. In some cases, the projector 140 can comprise sensors with coded-exposure or correlation capabilities can adjust their spatio-temporal response at pixel- and microsecond scales.
The system 100 can be used to optimize programmable imaging systems that use the projector 140 for fine-grained control of illumination and sensing. For example, the system 100 can approximate a differentiable imaging system. Generally, differentiable imaging systems have the property that a small adjustment to their settings can cause a small, predictable change to the image they output (as exemplified in
In the present embodiments, an imaging system is considered differentiable if the following two conditions hold:
-
- 1) The behaviour of its sources, sensors and optics during the exposure time is governed by a single N-dimensional vector, called a control vector, that takes continuous values; and
- 2) For a stationary scene S, the directional derivatives of the image with respect to the system's control vector; i.e.,
-
-
- are well defined for control vectors c adjustments a, where img(c,) is the noise-less image.
-
Advantageously, differentiable imaging systems open the possibility of optical auto-tuning, iteratively adjusting their behaviour in real time via optical-domain differentiation, to optimize performance in a given task.
For depth imaging, the optimization module 112 determines a solution to the optimization. The determination uses:
-
- a differentiable imaging system that outputs a noisy intensity image i in response to a control vector c;
- a differentiable reconstruction function that estimates a depth map d from a sequence of K≥1 images acquired with control vectors c1, . . . , cK
-
- where θ is a vector of additional tunable parameters (which comprise ‘reconstruction parameters’ referred to herein); and
- an error function err( ) that penalizes differences between the estimated depth map and the ground-truth depth map g.
The optimization module 112 determines the solution to the optimization by determining the parameters that minimize expected reconstruction error:
with expectation taken over noise and a space of plausible scenes.
At block 202, the illumination module 108 directs the projector 140 to illuminate the scene with an initial set of illuminations. At block 204, the capture module 109 communicates with the image sensor 130 to capture an image of the scene during the illumination. At block 206, the optimization module 112 estimates a gradient of a reconstruction error. At block 208, the reconstruction module 114 determines a reconstruction, the reconstruction comprising iteratively performing until the reconstruction error is minimized: at block 210, the optimization module 112 updates the illuminations by minimizing the gradient of the reconstruction error; at block 212, the illumination module 108 directs the projector 140 to illuminate the scene with updated illuminations; at block 214, the capture module 109 communicates with the image sensor 130 to capture an updated image of the scene during the illumination; at block 216, the reconstruction module 114 estimates a reconstruction depth map; and at block 218, the optimization module 112 estimates an updated gradient of the reconstruction error. At block 220, the output interface outputs the reconstruction.
In some cases, the initial set of illuminations can be selected at random. In some cases, the optimization module 112 further determines a control vector for each of the sets of illuminations, the control vector comprising a pattern for the illuminations. In further cases, the reconstruction module 114 further determines a differentiable reconstruction function to estimate a depth map for each image captured, the differentiable reconstruction function comprising the respective control vectors. In further cases, estimating the gradient of the reconstruction error comprises penalizing a difference between the estimated reconstruction depth map and a ground-truth depth map. In further cases, the ground-truth depth map is determined using a randomly-textured surface. In further cases, minimizing the gradient of the reconstruction error comprises determining the control vectors that minimize the reconstruction error using a trainable Stochastic Gradient Descent (SGD). In further cases, the gradient of the reconstruction error is determined using an image Jacobian comprising the control vectors and the pixels in the respective captured images. In further cases, estimating the reconstruction depth map comprises determining directional derivatives using the image Jacobian.
In further cases, determining the reconstruction further comprising determining stereo correspondence, comprising: treating intensities observed at a neighbourhood of pixels of the captured image as a feature vector; comparing the captured pixel intensities to a vector of intensities at linear segments of the structured light pattern projected at the scene; and using a trained artificial neural network, selecting portions of the captured image that are most similar to portions of the structed light pattern that is most similar according to the zero-mean normalized cross-correlation (ZNCC) score.
In the present embodiments, it is assumed that both images and depth maps are represented as row vectors of M pixels. Different combinations of light source, sensor, reconstruction function and error function lead to different instances of the system optimization problem (as exemplified in
In the hypothetical case where there is a perfect forward model for an image formation process, there would be a perfect model for (1) the system's light sources, optics, and sensors, (2) the scenes to be imaged, and (3) the light transport between them. In this case, optimization techniques, for example, Stochastic Gradient Descent (SGD), allow for minimization of a system-optimization objective numerically. By approximating it by a sum that evaluates reconstruction error for realistic noise and for a large set of fairly-drawn, synthetic training scenes. Then its gradient is evaluated with respect to the unknowns θ, c1, . . . , cK; and SGD can be applied to (locally) minimize it.
Replacing the first expectation in the error function with a sum, there is provided:
where dt, gt are the reconstructed shape and ground-truth shape of the t-th training scene, St, respectively, and xerr( ) is its expected reconstruction error.
Practically, there may not be sufficient information about the imaging system and its noise properties to reproduce them exactly, or the forward image formation model may be too complex or expensive to simulate. Differentiable imaging systems of the present embodiments can allow the system 100 to overcome these technical limitations by implementing the difficult gradient calculations directly in the optical domain.
In an embodiment, SGD can be used to evaluate a gradient with respect to θ and c1, . . . , cK of the expected error:
with points of evaluation omitted for brevity and T denoting the matrix transpose. Of all the individual terms in the above equations, only one depends on a precise model of the system and scene: the image Jacobian J(c,S).
The system 100 can captures an M-pixel image in response to an N-element control vector, J(c,S) is an N×M matrix. In this way, element [n, m] of this matrix tells the system how the intensity of image pixel m will change if element n of the control vector is adjusted by an infinitesimal amount. As such, it is related to the system's directional image derivatives by a matrix-vector product:
It follows that having physical access to both a differential imaging system and a scene S, the system 100 can compute individual columns of the above matrix without necessarily requiring any computational model of the system or the scene. The system 100 just needs to implement a discrete version of the matrix-vector product in the optical domain, as illustrated in the example of
The above optical subroutine makes it possible to turn numerical SGD, which depends on system and scene models, into a ‘free’ optical approach. In view of such approach, the system 100 can replace image-capture operations that require modeling of systems and scenes.
In other cases, other optimization approaches and/or algorithms can be used; for example, those which do not rely on derivatives (called derivative-free optimization algorithms) can be used to optimize the reconstruction error without necessarily requiring estimating the derivatives and the Jacobian. One example of such approach is Particle-Swarm-Optimization (PSO), which updates the optimization parameters in each iteration based on the history of evaluated objective functions in the previous iterations. However, this type of approach may not be as efficient in terms of convergence rate as SGD.
Practical implementations of optical-domain SGD can face a number of technical challenges, for example: (1) imaging a large set of real-world training objects is objectively hard, (2) a closed-form expression generally must be derived for a scene's expected reconstruction error in order to evaluate its gradient, the image Jacobian, and (3) is generally too large to acquire by brute force. The system 100 addresses these technical challenges as described herein; for example, by exploiting the structure of the system-optimization problem for triangulation-based systems.
In
In both of the above approaches, the optimization starts with initializing the optimization parameters (namely the control vectors and reconstruction parameters). The choice of initialization parameters can have a noticeable impact on the optimization. For example, in the present embodiments, three types of initializations can be used: 1) initializing all the control vectors and reconstruction parameters with random values; 2) initializing the optimization parameters with down-scaled random values added by a constant (which results in low-contrast random values); and 3) initializing the control vectors with predetermined functions, such as those used previously or as part of the literature. Starting with pre-existing parameters can lead to a faster and better convergence. For example, in the case of structured-light 3D imaging systems (where the control vectors refer to illumination pattern), the parameters can be initialized with Sinusoidal patterns, Micro-Phase shifting patterns, Gray code, or the like. For example, in the case of Time-of-Flight 3D imaging system (where control vectors refer to modulation and demodulation signals), the initialization can be set to sinusoidal patterns, train of pulses, step function, Hamiltonian functions, or the like.
In some cases, in both numerical and optical SGD, the user can define a set of constraints for the optimized control vectors. Although these constraints can potentially refer to any user-defined functions, three specific constraints are contemplated with respect to the present embodiments: 1) the frequency content of control vectors; 2) the maximum amplitude of the control vectors; and 3) the total energy consumption caused by the control vectors. For example, in the case of structured-light 3D imaging and ToF imaging, the control vectors may refer to illumination patterns, and the systems in practice can impose constraints on the amplitude and the frequency content of the projection patterns, and their total energy consumption.
The reconstruction module 114 can address the problem of optimizing projector-camera systems for structured-light triangulation (as exemplified in
In an example, suppose an object is placed in front of the image sensor 130 whose ground-truth correspondence map, g, is known. In principle, since the column correspondence of each camera pixel must be estimated independently of all others, each pixel can be thought of as a separate instance of the reconstruction task. To reduce correlations between these instances, the reconstruction module 114 can use a randomly-textured surface for training. This allows the reconstruction module 114 to treat each camera row as a different “training scene” of randomly-textured points (an example is shown in
In an experiment conducted by the present inventors,
In a similar approach, a different randomly-textured surface which exhibits subsurface scattering, surface inter-reflection or other forms of indirect light can be used as a training scene. Such a training scene can lead the optical auto tuning framework to particularly optimize the patterns for reconstructing scenes with indirect light. In an experiment conducted by the present inventors,
In an embodiment, the system 100 can treat the projector 140 and image sensor 130 as two non-linear “black-box” functions proj( ) and cam( ), respectively. These account for device non-linearities as well as internal low-level processing of patterns and images (for example, non-linear contrast enhancement, color processing, demosaicing, denoising, or the like). An example of image formation in general projector-camera systems is illustrated in
Between the projector 140 and image sensor 130, light propagation is linear and can thus be modeled by a transport matrix T(S). In some cases, this matrix is unknown and generally depends on the scene's shape and material properties, as well as the system's optics. It follows that the image and its Jacobian are given by
where noise may include a signal-dependent component and irr denotes the vector of irradiances incident on the image sensor's 130 pixels. Thus, the system 100 can use optical auto-tuning in the absence of indirect light will force it to account for its inherent non-linearities, optical imperfections, and noise properties.
In an embodiment, for linear systems and low signal-independent noise, correspondence can be determined to be optimal in a maximum-likelihood sense by: (1) treating the intensities I1 [m], . . . , IK[m] observed at pixel m as a K-dimensional “feature vector,” (2) comparing it to the vector of intensities at each projector column, and (3) choosing the column that is most similar according to the zero-mean normalized cross-correlation (ZNCC) score:
where for two vectors v1, v2, their ZNCC score is the normalized cross correlation of v1−mean(v1) and v2−mean(v2).
The reconstruction module 114 can generalize the above approach in three ways. First, by expanding feature vectors to include their 3×1 neighborhood, i.e., the intensities ik [m-1], ik [m+1] in each image and ck [n−1], ck [n+1] in each pattern. This makes it possible to exploit intensity correlations that may exist in tiny image neighborhoods:
where fm, fn are vectors collecting these intensities. Second, the reconstruction module 114 can model the projector's response curve as an unknown monotonic, scalar function g( ) consisting of a predetermined number of linear segments; for example, 32 segments. This introduces a learnable component to the reconstruction function, whose 32-dimensional parameter vector can be optimized by optical SGD along with c1, . . . , CK. Third, the reconstruction module 114 can add a second learnable component to better exploit neighborhood correlations, and to account for noise and system non-linearities that cannot be captured by the scalar response g( ) alone. In an embodiment, this learnable component can comprise two residual neural network (ResNet) blocks for the camera and projector, respectively; however, any suitable machine learning paradigm can be used.
where ( ) and ( ) are neural nets with two fully-connected layers of dimension 3K×3K and a rectified linear unit (ReLU) in between. Thus, in this embodiment, the total number of learnable parameters in the reconstruction function, and thus in vector, is 36K2+32.
For linear projector-camera systems and low signal-independent noise, a tight approximation to the expected error of a row can be obtained from the ZNCC score vectors of its pixels:
where denotes dot product; T is the softmax temperature; zm is given above; index is a vector whose i-th element is equal to its index i; and err( ) is defined herein. Strictly speaking, this approximation to the row-specific expected error may not apply to ZNCC3 and ZNCC3-NN similarities or general non-linear systems. Nevertheless, the present inventors use it in the optical SGD objective as it was found it to be very effective in practice.
Although the image Jacobian in the present embodiments can very large, it is also generally very sparse. This makes it possible to acquire several rows of the Jacobian “in parallel” from just one invocation of the optical-domain subroutine. In particular, an adjustment vector with N/L equally-spaced non-zero elements will produce an image whose pixels will be the sum of N/L rows of the Jacobian. It follows that if L is large enough to avoid overlap between the non-zero elements in these rows, the rows can be recovered exactly.
In an embodiment, to generate more distinct sets of correspondences for optical auto-tuning, the reconstruction module 114 can circularly shift the patterns by random number of pixels every few iterations. Shifting the patterns effectively leads to training on a different batch of scenes, and can provide a more accurate approximation for the SGD error. Moreover, with circular shift, the captured images during the optimization do not require to cover the whole field of view of the projector. Thus, it can help speed up the optimization, by looking at smaller region of camera image.
Although the optimized patterns generalize well to other imaging conditions, the system 100 can optimize the system under the specific desired imaging scenario to get the best performance. One noteworthy example is low-SNR regime (due to presence of severe noise, limited irradiance on scene, and the like). However, the Jacobian computed in such a scene may be dominated by noise, and therefore prevents the auto-tuning of the system directly in very low light scenes. While minor noise can help optimization be more robust, it may be very hard to learn with extreme noise. In such cases, a data augmentation can be used to synthetically generate less-noisier scene samples in low light conditions to use for training. In this way, not only is the captured image (consisting of multiple rows) used for evaluating the update in each iteration, but also the down-scaled (i.e. darker) version of the image. This approach can also be seen as synthetically introducing more varying scenes to the optimization. The present inventors' example experiments indicate that this approach has a noticeable impact on the generalization of the optimized patterns to low-SNR conditions.
Many structured-light techniques, require to choose a specific frequency as its building block. For instance, ZNCC-optimized patterns generally rely on an upper bound for its frequency content, or multiple phase shifting (MPS) generally needs the user to select the main frequency of its constructing sinusoidal patterns. Choosing the frequency for these techniques can have tremendous effect on their performances. The selection of optimal frequency depends on the scene and the imaging system and can be a tedious task. However, advantageously, the present embodiments do not require frequency input from a user. In this way, the patterns can automatically update their frequency content in response to the specific characteristics of the system.
In an example case, the SGD optimizer can use RMSprop neural network optimizer and select Tensorflow as the framework. The patterns can be initialized with a constant matrix added by small uniform noise. The learning rate can be set to, for example, 0.001, and have it decay to half every, for example, 500 iterations. A step-size of, for example, L=7 for training on board and L=23 for training on objects with indirect light. The present inventors have noticed that the Jacobian changes very slightly in two subsequent iterations. Therefore, to speed up the optimization, in some cases, the Jacobian can be estimated each, for example, 15 iterations, and use the same Jacobian to evaluate the overall gradients in that span. In some cases, a random circular shift can be applied to patterns every 15 iterations. In the example case, a number of camera rows for auto-tuning the system can be empirically set to 15% of the total number of rows. Since the scene can be sensitive to small vibrations, the system 100 can capture the ground-truth every 50 iterations to ensure its robustness, by projecting, for example, 30 ZNCC-optimized patterns. We validated our choice of ground-truth measurement by comparing it with projecting 160 conventional phase-shifted patterns. In the example experiment, for all the scenes with limited amount of indirect light (including the training board), exact correspondence matches can exceed 97% of the pixels and the remaining 3% are one pixel away. In this experiment, it was found that the optimization takes less than an hour for auto-tuning 4 patterns with standard consumer-level projectors and cameras, and converges in less than 1000 iterations.
In an example, the present inventors measured performance of optically-optimized sequence of patterns, and their generalization to different imaging conditions. In this example experiment, the optical auto-tuning framework, described herein, for generating the optimized sequence of grey-scale patterns, for a particular 3D imaging system, consisting of a non-linear consumer-level projector (LG-PH550) and a linear camera (IDS-U13240CP). All the patterns were optimized with a textured board as the training object (as exemplified on the left side of
TABLE 2 demonstrates a full quantitative comparison with other encoding schemes (K=4) for the scene shown in
TABLE 2 illustrates MPS and ZNCC where the best maximum frequencies (16 and 32 respectively) were selected. For max-ZNCC3-NN, the neural network was trained for each pattern individually. Since the projector is non-linear, to evaluate other techniques, the system 100 was linearized through calibration. In some cases, the optical patterns run on native system without any calibration nor any specification for its frequency content.
In the example experiments, the general optical auto-tuned patterns were found to perform well with a wide variety of objects and imaging conditions (for example, different shapes and materials, low-SNR conditions, and the like). In some cases, if there exists any prior knowledge about the system, objects or imaging conditions, the system 100 can tune the patterns for the desired setup. For instance, optical auto-tuning can be performed on an object with indirect light, to specifically optimize the system for reconstructing other scenes with indirect light.
As another example experiment,
The top of
To explore the capability of the optical auto-tune framework of the present embodiments, the optimization approach was applied to totally different systems. First, as shown in
The optical auto-tuning framework of the present embodiments provides an approach for, at least, learning optimal illumination patterns for active 3D triangulation. The patterns, although may be learnt on a specific object, are shown to be generalizable to a wide variety of shapes, materials, and imaging conditions. In this way, the optical auto-tuning framework can not only can be very effective in optimizing the structured light systems, but also can be applied to other inverse problems in computational imaging where the image formation model may not be obvious.
In another embodiment, the reconstruction and/or optimization approaches described herein can be used for Time-of-Flight 3D imaging. In a particular case, using Continuous-Wave Time-of-Flight (C-ToF) cameras can present a different approach for 3D imaging, where a projector 140 comprising a modulating light source (for example, a modulated laser light source) emits multiple periodic light signals (called modulation signal) to the scene. In this case, the modulation signal defines a time-varying illumination pattern for illuminating the scene. The image sensor 130 captures the received light during a full cycle with a corresponding exposure profile (called demodulation signal) for each emitted signal. The reconstruction module 114 can estimate a scene depth at each pixel using observations captured by the capture module 109 for each pair of modulation and demodulation functions. In an example illustrated in
In a particular case, to formulate the image formation model, without loss of generality, it can be assumed that the projector and image sensor are collocated. The image formation model for C-ToF imaging system can be formulated as:
where oq denotes the vector of observation at pixel q, bq refers to the albedo at pixel q, aq is the ambient lights for pixel q in the captured images, and eq is the vector of noise in the observations. Furthermore, d(q) specifies the depth at pixel q. F(d(q)) denotes the vector consisting of the cross-correlation between the shifted modulation signal (corresponding to depth d) and the demodulation function for each pair of signals:
where Fi(d) denotes the i-th element of vector F(d(q)); Di(t) and Mi(t) denote the i-th pair of demodulation and modulation functions respectively; and c refers to the speed of light. The above formulation treats the F(d) as the code-vector corresponding the depth d.
In a similar manner to structured-light triangulation, as described herein, the system 100 can achieve optimal performance for estimating the depth using the captured images corresponding to each pair of modulation-demodulation signal by determining optimal modulation and demodulation functions for achieving the best performance in depth estimation.
In an embodiment, the system 100 can convert the ToF decoding problem to a discrete problem by discretizing the range of depths, and determine the depth bin which contains the actual scene's depth. Then the decoding can determine the depth as described herein for structured light triangulation: given a set of observations and the cross-correlation code-vectors at each depth bin, determine which depth bin maximizes a likelihood function. The ZNCC decoder described herein can be used to determine an optimization for detecting the corresponding code-vector and consequentially to estimate the depth for each pixel. More specifically the depth can be estimated as
where that p is the index of each bin, and di refers to the center of i-th bin of depth, and N is the number of depth bins which specifies the level of discretization.
In a similar manner to structured-light triangulation, as described herein, the optical domain SGD and numerical SGD presented at TABLE 1 can be used for optimizing the control vectors refer to each pair of discretized modulation and demodulation signal (as shown in
While embodiments of the present disclosure describe optimization of control vectors and projection patterns, it is understood the optimization techniques can be applicable to other suitable applications; for example, optimizing energy usage.
Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto. The entire disclosures of all references recited above are incorporated herein by reference.
Claims
1. A computer-implemented method for generating a depth image of a scene, the method comprising:
- illuminating the scene with one or more illumination patterns, each pattern comprising a plurality of discretized elements, intensity of each element governed by a code vector;
- capturing one or more images of the scene while the scene is being illuminated;
- for each pixel, generating an observation vector comprising at least one intensity recorded at the pixel for each of the captured images;
- for each pixel, determining the code vector that best corresponds with the respective observation vector by maximizing the zero-mean normalized cross-correlation (ZNCC);
- for each pixel, determining a depth value from the best-corresponding code vector; and
- outputting the depth values as a depth image.
2. The method of claim 1, wherein each observation vector incorporates intensities of neighbouring image pixels, and wherein each code vector incorporates neighbouring discretized intensities.
3. The method of claim 2, further comprising:
- using a trained artificial neural network to transform each observation vector to a higher-dimensional feature vector; and
- using a trained artificial neural network to transform each code vector to a higher-dimensional feature vector,
- wherein determining the code vector that best corresponds with the respective observation vector comprises maximizing the ZNCC between the transformed respective observation vector and the transformed code vectors.
4. The method of claim 1, wherein each illumination pattern is a discretized two-dimensional pattern that is projected onto a scene from a viewpoint that is distinct from the captured images, wherein each element in the pattern is a projected pixel, and wherein determining the depth value from the best-corresponding code vector comprises triangulation.
5. The method of claim 1, wherein each illumination pattern comprises multiple wavelength bands, wherein the observation vector at each pixel comprises the raw or demosaiced intensities of each wavelength band for the respective pixel.
6. The method of claim 1, wherein the discretized elements of each illumination pattern comprise a discretized time-varying pattern that modulates the intensity of a light source, each element in the pattern is associated with a time-of-flight delay and a code vector, and wherein determining the depth value from the best-corresponding code vector comprises multiplication by the speed of light.
Type: Application
Filed: Mar 30, 2022
Publication Date: Jul 21, 2022
Inventors: Kiriakos Neoklis KUTULAKOS (Toronto), Seyed Parsa MIRDEHGHAN (Richmond Hill), Wenzheng CHEN (Toronto)
Application Number: 17/657,243