PROBABILISTIC KEYPOINT REGRESSION WITH UNCERTAINTY
Keypoints are predicted in an image. A neural network is executed that is configured to predict each of the keypoints as a 2D random variable, normally distributed with a 2D position and 2×2 covariance matrix. The neural network is trained to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth. The trained neural network is used to predict keypoints of an image without generating a heatmap.
This application claims the benefit of and priority to U.S. Provisional Application No. 63/317,436, filed Mar. 7, 2022, the entire contents of which are incorporated herein by reference.
BACKGROUNDLandmark detection is a computer vision task where keypoints of an image, for example of a human body or face (e.g., characteristic points) or more generally any a priori object, are detected and localized in images and video. Keypoints can be used, for example, to detect a person's head position and rotation. Landmark detection can be challenging due to variability as well as a number of factors such as pose and occlusions. It is with respect to these considerations and others that the disclosure made herein is presented.
SUMMARYLandmarks often play a key role in image analysis, but many aspects of identity or expression cannot be represented by a limited number of landmarks. In order to reconstruct, for example, faces more accurately, landmarks are often combined with additional signals such as depth images, or techniques such as differentiable rendering. The present disclosure provides a way to use more landmarks in an efficient and costeffective manner. Besides faces, the present disclosure may be applied more generally to other types of images.
In an embodiment, synthetic training data may be used to guarantee perfect landmark annotations. By fitting a morphable model to these dense landmarks, stateoftheart results for monocular 3D face reconstruction may be achieved with realtime responsiveness. Dense landmarks are an ideal signal for integrating face shape information across frames which can be demonstrated with accurate and expressive facial performance capture in both monocular and multiview scenarios.
Keypoint confidence, or certainty, is useful when later algorithms consume the keypoints. For example, when fitting a 3D model to 2D keypoints, if a keypoint confidence is low, that keypoint may be considered to be unreliable and discounted during model fitting. This may occur if that keypoint is occluded, for example. Estimating uncertainty may also be useful to train better landmark estimators.
The present disclosure includes an algorithm for directly predicting keypoints (2D points of interest) in an image, with uncertainty. In this way a system, e.g., a neural network, can expose how confident it is about each keypoint. Referring to
The present disclosure enables direct regression for real time applications without the need, for example, of a heatmap. As used herein, a heatmap may be an image where the value stored at each pixel corresponds to the likelihood for the landmark to be at that pixel's location in the image. In an embodiment, the task may be reformulated from a 2D point estimation problem to a 2D random variable estimation problem. Each keypoint may be predicted as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma. The network may be trained to maximize the loglikelihood that samples from each predicted keypoint equal the ground truth. Keypoint uncertainty arises during training since the network is penalized for being wrong about keypoint location, as well as being uncertain. The following provides a derivation (assuming uniform prior on sigma):
The ground truth consists of a set of keypoint coordinates Y∈^{N×2}:
Each keypoint is predicted as a 2D random variable, normally distributed with location (x′, y′) and (circular) standard deviation σ. For a predicted keypoint random variable, the relative likelihood that a sample from that random variable will equal the ground truth keypoint location is:
For the full set of N keypoints, a set of coordinates are predicted S∈^{N}:
S=[σ_{0},σ_{1}, . . . ,σ_{N}]
We work with loglikelihoods as they are more convenient. Since it is desired to maximize the loglikelihood that samples from each predicted keypoint equal the ground truth, a loss is minimized that is the sum of negative log likelihoods.
For clarity, this is split into two parts:
Loss_{σ} penalizes the network for being too uncertain about keypoint predictions, and Loss_{μ} penalizes the network for making poorly localized keypoint predictions. Additionally, the symmetric Gaussian in the above example may be extended to nonsymmetric Gaussian, in some cases.
The distribution of uncertainty values can also be influenced at training time by introducing a suitable prior. The formulation above assumes a uniform prior over the predicted sigmas. A natural choice of prior on sigma (or precision) of the (2D) Gaussian distribution is the Wishart distribution (the conjugate prior of a Gaussian distribution) although others can be used. This prior is a gamma distribution in the univariate case. This has the effect of encouraging the network at training time to allocate more neural resources to cases where it is currently doing poorly (where sigma is large) and less neural resources to where it is already comparatively certain (where sigma is small) in order to balance the usefulness of its keypoint predictions to downstream model fitting. See below for the derivation assuming Gamma prior.
Definitions (Precision, Gaussian Distribution, Gamma or Wishart)
a and b are (manually) tuned constants “shape and inverse scale”
The first term
is constant so has no effect on training (ignore it). Add the following instead of −log(Gam) to the loss:
=>Loss+=br−(a−1)log(τ)
The same in terms of σ:
Additionally, object detection via keypoint uncertainty may be implemented. A sliding window may be applied over an image, and average keypoint confidence for each window may be measured. If a window with a high average keypoint certainty is not found, it can be determined that the object is not in the image. Otherwise, the window which reported the highest average keypoint confidence may be taken to contain the object.
Some of the use cases enabled using the described techniques include receiving image input from regular color (RGB) cameras rather than depth cameras, prediction of many more landmarks, and use cases in combination with a model fitter that predicts intrinsic camera parameters (e.g., focal length). This is important to achieve good results for recovering 3D structure from RGB images taken by a variety of cameras. Additional use cases include performing 3D reconstruction from multiple views, where the uncertainty in each view is taken into account, and where the extrinsic parameters of each camera are simultaneously optimized. More generally, the image inputs may be received from various types of cameras such as web cameras, depth, cameras on a headmounted display (HMD), IR cameras, event cameras, etc. and the images can be RGB, depth mapped, IR, etc., Placement of the cameras can be outsidein (e.g., sensors are stationary as in web cams), or introspective positional tracking, where the cameras or sensors are located on the device being tracked (e.g., HMD). In the case of HMDs, the dense landmarks on the observed parts of the face may also be used to localize the HMD itself relative to the face.
This Summary is not intended to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
The Detailed Description is described with reference to the accompanying FIGS. In the FIGS., the leftmost digit(s) of a reference number identifies the FIG. in which the reference number first appears. The same reference numbers in different FIGS. indicate similar or identical items.
A keypoint is typically localized in an image by having a neural network generate a heatmap image, where the heatmap has high pixel values in parts of the image close to the keypoint, and low pixel values in parts of the image far away from the keypoint. Another algorithm may be run on the heatmap image (argmax) to find the largest value which is the peak. The location of the peak is the 2D location of the keypoint. The value of the peak of the heatmap may sometimes be used as a measure of uncertainty.
The disclosure provides a way to directly regress a 2D coordinate and a measure of uncertainty without using heatmap images. This has several technical advantages:
Generating heatmap images is computationally expensive. In fact, drawing these heatmaps can be the most expensive part of the neural network. With direct keypoint regression, this computation time can be saved.
In a multicore system, transferring heatmap images from one core to another can be a bottleneck. Instead, transferring just the keypoint coordinates and confidence values uses much less bandwidth. For a 128×128 image, this corresponds to over a 99% improvement in bandwidth.
In a computationally constrained system, running argmax on the heatmap images can be expensive. By directly regressing keypoints, running argmax can be avoided. Additionally, argmax is only pixelprecise, whereas direct regression can be more precise at the subpixel level.
Directly predicting keypoints can allow for scaling up to more keypoints than would be computationally possible with heatmaps.
By predicting uncertainty in the manner described, it may be more clear how to best use uncertainty estimates in further systems, since sigma will be in the same units as the prediction. When heatmappeak values are used as a measure of uncertainty, it does not have an obvious unit for integration in further systems.
By penalizing the uncertainties at training time with a suitable prior, the resources of the neural network can be automatically allocated in order to achieve greater equality of outcome with regards to its prediction accuracy, with more capacity as necessary being assigned to harder keypoints.
In some embodiments, the disclosed techniques can be used to train a neural network that predicts 2D keypoints in RGB images. The disclosed techniques can also be used, for example, to efficiently create depth maps. The rigid pose may then be computed by using a PerspectivenPoint solver, where these uncertainty values are used to weigh the 2D projection error of each keypoint. The keypoints occluded by objects or with higher motion blur may have higher uncertainty values.
The disclosed techniques may be applied to a multicamera fitting scenario, where a model of an object (e.g., face) is fitted to 2D probabilistic keypoint estimates that come from multiple cameras. Some keypoints under certain views may be less certain compared to the same keypoint under a different view. For example, when a face is viewed from the front, both eyes may be clearly visible with high confidence. But, when viewed from the side, only one eye may be visible, and the other eye may be discounted by high uncertainty when fitting to that view. In this way, although all keypoints are predicted for all views, each view of the face contributes most strongly to the portion of the face best visible. Additionally, when using a probabilistic keypoint energy over multiple cameras, it is possible to use the camera calibration itself. In some embodiments, the extension of fitting model to landmarks can allow for camera calibration to be performed together with other parameters.
Regarding object detection via keypoint uncertainty, in existing methods, training object detectors requires “positive” images that contain the object, as well as “negative” images that do not contain the object. With the disclosed techniques, no negative images are required, making it possible to create an object detector trained with only positive images. This can be useful when it is difficult to reliably collect negative images, e.g., images without a person in the images.
Landmarks are points in correspondence across all faces, such as the tip of the nose or the corner of the eye. They often play a role in facerelated computer vision, e.g., being used to extract facial regions of interest, or helping to constrain 3D model fitting. However, many aspects of facial identity or expression cannot be encoded by a typical sparse set of 68 landmarks alone. For example, without landmarks on the cheeks, we cannot tell whether or not someone has high cheekbones. Or, without landmarks around the outer eye region, we cannot tell if someone is softly closing their eyes, or scrunching up their face.
In order to reconstruct faces more accurately, previous work has therefore used additional signals beyond color images, such as depth images or optical flow. However, these signals may not be available or reliable to compute. Instead, given color images alone, others have approached the problem using analysisbysynthesis: minimizing a photometric error between a generative 3D face model and an observed image using differentiable rendering.
However, these approaches are limited by the approximations that must be made in order for differentiable rendering to be computationally feasible. In reality, faces are not purely Lambertian, and many important illumination effects are not explained using spherical harmonics alone, e.g., ambient occlusion or shadows cast by the nose.
It would be desirable to just use more landmarks, which disclosed embodiments describe. Examples of the present disclosure include a method that predicts over 700 landmarks both accurately and robustly. Instead of only the frontal “hockeymask” portion of the face, the described landmarks cover the entire head, including the ears, eyeballs, and teeth. As shown in
Even with as few as 68 landmarks, it is difficult to precisely define landmarks that do not align with a salient image feature. Thus synthetic training data may be used which guarantees consistent annotations. Furthermore, instead of representing each landmark as just a 2D coordinate, we predict each one as a 2D random variable: a 2D circular Gaussian with position and uncertainty. This allows the predictor to express uncertainty about certain landmarks, e.g., occluded landmarks on the back of the head, and also improves landmark accuracy.
Since the disclosed dense landmarks represent points of correspondence across all faces, we can perform 3D face reconstruction by fitting a morphable face model to them. Although previous approaches have fit morphable models to landmarks in a similar way, the present disclosure shows that landmarks are the only signal required to achieve state of the art results for monocular face reconstruction in the wild. All that is required is enough landmarks, that are predicted with sufficient accuracy.
The probabilistic nature of the predictions also makes them ideal for fitting a 3D model over a temporal sequence, or across multiple views. An optimizer can discount uncertain landmarks, and rely on more certain ones. The present disclosure demonstrates this with accurate and expressive results for both multiview and monocular facial performance capture. Finally, the present disclosure shows that predicting dense landmarks and model fitting can be highly efficient, demonstrating realtime facial performance capture at over 60 FPS on a single CPU thread.
The present disclosure shows that parametric appearance models, illumination models, or differentiable rendering are not needed for highquality 3D face reconstruction, but only a number of accurate 2D landmarks and a 3D model to fit to them. In addition, the present disclosure shows that combining probabilistic landmarks and model fitting enables intelligently aggregation of face information across multiple images by demonstrating robust and expressive results for both multiview and monocular facial performance capture.
In recent years, methods for 3D face reconstruction have become increasingly complicated, involving differentiable rendering and complex neural network training strategies. In contrast, the present disclosure proposes effective methods by maintaining simplicity. The disclosed approach consists of two stages: First we predict probabilistic dense 2D landmarks L using a traditional convolutional neural network (CNN). Then, we fit a 3D face model, parameterized by Φ, to the 2D landmarks by minimizing an energy function E(Φ; L). Images themselves are not part of this optimization; the only data used are 2D landmarks. Referring to
One difference between the disclosure and previous approaches is the number and quality of landmarks. Previous approaches have not predicted many 2D landmarks with such accuracy. This allows high quality 3D face reconstruction results by fitting a 3D model to these landmarks alone.
Probabilistic landmark regression. We predict each landmark as a random variable with probability density function of a circular 2D Gaussian.
p(x,σ)=(xσ)p(σ)
p(σ)=1
So Li={μi, σi}, where μi=[xi, yi] is the expected position of that landmark, and σi (the standard deviation) is a measure of uncertainty. Our training data includes labels for landmark positions μ′i=[x′i, y′i], but not for σ. The network learns to output σ to show that it is certain about some landmarks, e.g., visible landmarks on the front of the face, and uncertain about others, e.g., landmarks hidden behind hair. Referring to
Loss_{σ} penalizes the network for being too uncertain, and Loss_{μ} penalizes the network for making poorly localized landmark predictions while penalizing certainty. λ_{i }is a perlandmark weight that focuses the loss on certain parts of the face. This is the only loss term used during training.
The disclosed embodiments enable prediction of occluded keypoints and keypoints outside of an image, which can be applied in at least two ways—to the ground truth, or to the network predictions. For ground truth, the use of synthetic training data makes it easier to create representative/accurate ground truth for keypoints that are not observed by the camera/sensor. For network predictions, it is not possible to predict keypoints outside of the image when using a heatmap approach, which is an important advantage of the keypoint regression as disclosed herein.
Landmarks are commonly predicted via heatmaps. However, generating heatmaps is computationally expensive. Instead, we keep things simple, and directly regress position and uncertainty using a traditional CNN. We take any offtheshelf architecture, e.g., ResNet and change the final fullyconnected layer to output three values perlandmark: two for position and one for uncertainty.
Landmark coordinates are normalized from [0, S] to [−1, 1], for a square image of size S×S. Rather than directly outputting σ, we predict log σ, and take its exponential to ensure σ is positive. We train PyTorch models from the timm library using the AdamW optimizer.
Given probabilistic dense 2D landmarks L, one goal is to find optimal model parameters Φ* that minimize the following energy:
E_{landmarks }is the only term that encourages the 3D model to explain the observed 2D landmarks. The other terms use prior knowledge to regularize the fit.
An advantage of the disclosed approach is how naturally it scales to multiple images and cameras. Described herein is the general form of the disclosed method, suitable for F frames over C cameras, i.e., multiview performance capture. In the monocular case, C=1, simplifying some terms.
3D face model. A linear face model may be used, comprising N=7,667 vertices and K=4 skeletal joints (the head, neck, and two eyes). Vertex positions may be determined by the meshgenerating function M (β, ψ, θ): Rβ+ψ+θ→R3N which takes parameters β∈Rβ for identity, ψ∈Rψ for expression, and θ∈R3K+3 for skeletal pose (including root joint translation).
(β,ψ,θ)=(β,ψ),θ,(β);W)
where (V, θ, J; W) is a standard linear blend skinning (LBS) function that rotates vertex positions V∈^{3N }about joint locations J∈^{3K }by local joint rotations in θ, with pervertex weights W∈^{K/N}. The face mesh and joint locations in the bind pose are determined by (β,ψ): ^{β+ψ}→^{3N }and (β):^{β}→^{3K }respectively.
To simplify notation, it can be assumed that the mesh contains landmark vertices only.
Camera model. Perspective cameras are assumed. Each is described by a worldtocamera rigid transform X∈^{3×4}=[RT] comprising rotation and translation, and a pinhole camera projection matrix Π∈^{3×3}. So, the imagespace projection of the j^{th }landmark in the i^{th }camera is x_{i,j}=Π_{i}X_{i}_{j}. In the monocular case, X is the identity transform and can be ignored.
Parameters Φ may be optimized to minimize E. The main parameters of interest control the face, but camera parameters can also be optimized if they are unknown or known to be imprecise.
Facial identity β is shared over a sequence of F frames, but expression Ψ and pose Θ vary per frame. For each of our C cameras we have six degrees of freedom for rotation R and translation T, and a focal length parameter f (assuming square pixels and principal point at image center). In the monocular case, we only optimize focal length.
E_{landmarks }encourages the 3D model to explain the predicted 2D landmarks,
where, for the k^{th }landmark seen by the j^{th }camera in the i^{th }frame, [μ_{ijk}, σ_{ijk}] is the 2D location and uncertainty predicted by our dense landmark CNN, and x_{ijk}=Π_{i}X_{i}(β, ψ_{i}, θ_{i})_{k}, is the 2D projection of that landmark on our 3D model. The similarity of Equation 4 to Loss_{μ} in Equation 3 is no accident: treating landmarks as 2D random variables during both prediction and modelfitting allows our approach to elegantly handle uncertainty, taking advantage of landmarks the CNN is confident in, and discounting those it is uncertain about.
E_{identity }penalizes unlikely face shape by maximizing the relative loglikelihood of shape parameters β under a multivariate Gaussian Mixture Model (GMM) of G components. This GMM was fit to the library of 3D head scans used to train our 3D face model. So, E_{identity}=−log (p(β)) where p(β)=Σ_{j=1}^{G}γ_{i}(βν_{i}, Σ_{i}). ν_{i }and Σ_{i }are the mean and covariance matrix of the i^{th }component, and γ_{i }is the weight of that component.
E_{expression}=∥ψ∥^{2 }and E_{joints}=θ_{i:i∈[2,K]}∥^{2 }encourage the optimizer to explain the data with as little expression and joint rotation as possible. We do not penalize global translation or rotation by ignoring the root joint θ_{1}.
E_{temporal}=Σ_{i=2,j,k}^{F,C,L}∥x_{i,j,k}−x_{i1,j,k}∥^{2 }reduces jitter by encouraging face mesh vertices x to remain still between neighboring frames i−1 and i.
E_{intersect }encourages the optimizer to find solutions without intersections between the skin and eyeballs or teeth. Referring to
and use these to solve the symmetric, positivesemidefinite linear system, (J^{T }J+λdiag(J^{T}J))δ_{k}=−J^{T}r via Cholesky decomposition. We then apply the update rule, Φ_{k+1}=Φ_{k}+δ_{k}.
In practice it is not necessary to form the residual vector r nor the Jacobian matrix J. Instead, for performance reasons, the quantities J^{T }J and J^{T }r are directly computed as we visit each term r_{i}(Φ_{k}) of the energy. Most of the computational cost is incurred in evaluating these products for the landmark data term, as expected. However, the Jacobian of landmark term residuals is not fully dense. Each individual landmark depends on its own subset of expression parameters, and is invariant to other expression parameters. A static analysis of the sparsity of each landmark term may be performed with respect to parameters, ∂ri/∂Φj, and
we use this set of i,j indices to reduce the cost of our outer products from O(n^{2}) to O(m_{i}^{2}), where Φ∈^{n }and m_{i }is the sparsified dimensionality of ∂r_{i}/∂Φ^{T}. We further enhance the sparsity by ignoring any components of the Jacobian with an absolute value below a certain empiricallydetermined threshold.
By exploiting sparsity in this way, the landmark term residuals and their derivatives become very efficient to evaluate. This formulation avoids the correspondence problem usually seen with depth images, which requires a more costly optimization. In addition, adding more landmarks only grows the DNN in the last fullyconnected layer. It therefore becomes possible to implement a very detailed and wellregularized fitter with a relatively small compute burden, simply by adding a sufficient number of landmarks. The cost of the Cholesky solve for the update δk is independent of the number of landmarks.
Turning now to
It should also be understood that the illustrated methods can end at any time and need not be performed in their entireties. Some or all operations of the methods, and/or substantially equivalent operations, can be performed by execution of computerreadable instructions included on a computerstorage media, as defined herein. The term “computerreadable instructions,” and variants thereof, as used in the description and claims, is used expansively herein to include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computerreadable instructions can be implemented on various system configurations, including singleprocessor or multiprocessor systems, minicomputers, mainframe computers, personal computers, handheld computing devices, microprocessorbased, programmable consumer electronics, combinations thereof, and the like.
It should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system such as those described herein) and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. Thus, although the routine 700 is described as running on a system, it can be appreciated that the routine 700 and other operations described herein can be executed on an individual computing device or several devices.
Referring to
Operation 701 may be followed by operation 703. Operation 703 illustrates fitting a 3D model, parameterized by model parameters Φ to the 2D landmarks L by minimizing an energy function E (Φ; L).
Operation 703 may be followed by operation 705. Operation 703 illustrates fitting a 3D model, parameterized by model parameters Φ to the 2D landmarks L by minimizing an energy function E (Φ; L).
Turning now to
Operation 711 may be followed by operation 713. Operation 713 illustrates training, by the computing system, the neural network to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth.
Operation 713 may be followed by operation 715. Operation 715 illustrates using the trained neural network to predict keypoints of an image without generating a heatmap.
Turning now to
Referring to
Operation 801 may be followed by operation 803. Operation 803 illustrates training a network to maximize the loglikelihood that samples from each predicted keypoint equal the ground truth.
Turning now to
Operation 811 may be followed by operation 813. Operation 813 illustrates training, by the computing system, a neural network to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth.
Operation 813 may be followed by operation 815. Operation 815 illustrates using the trained neural network to predict keypoints of an image without generating a heatmap.
As used herein, a parametric model may be a predefined model that can be modified by some number of parameters. Some of these parameters can control the rotation, translation, and scale of the object (i.e. some global changes) while others control more local deformations (e.g., one number can control how much a face model “smiles”). In the fitting process, the goal is to find the value of all of these parameters that best explain the predicted keypoints.
In the example system illustrated in
An example architecture for a realtime system in which the disclosed embodiments can be implemented include mobilenetv2. An example architecture for an offline system in which the disclosed embodiments can be implemented include resnet101. These architectures are known to those skilled in the art.
The various aspects of the disclosure are described herein with regard to certain examples and embodiments, which are intended to illustrate but not to limit the disclosure. It should be appreciated that the subject matter presented herein may be implemented as a computer process, a computercontrolled apparatus, or a computing system or an article of manufacture, such as a computerreadable storage medium. While the subject matter described herein is presented in the general context of program modules that execute on one or more computing devices, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures and other types of structures that perform particular tasks or implement particular abstract data types.
Those skilled in the art will also appreciate that the subject matter described herein may be practiced on or in conjunction with other computer system configurations beyond those described herein, including multiprocessor systems. The embodiments described herein may also be practiced in distributed computing environments, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Networks established by or on behalf of a user to provide one or more services (such as various types of cloudbased computing or storage) accessible via the Internet and/or other networks to a distributed set of clients may be referred to as a service provider.
In some embodiments, a server that implements a portion or all of one or more of the technologies described herein, including the techniques to implement methods for predicting keypoints in an image may include a generalpurpose computer system that includes or is configured to access one or more computeraccessible media.
In various embodiments, computing device 1100 may be a uniprocessor system including one processor 1110 or a multiprocessor system including several processors 1110 (e.g., two, four, eight, or another suitable number). Processors 1110 may be any suitable processors capable of executing instructions. For example, in various embodiments, processors 1110 may be generalpurpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x116, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 1110 may commonly, but not necessarily, implement the same ISA.
System memory 1120 may be configured to store instructions and data accessible by processor(s) 1110. In various embodiments, system memory 1120 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flashtype memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing one or more desired functions, such as those methods, techniques and data described above, are shown stored within system memory 1120 as code 1125 and data 1126.
In one embodiment, I/O interface 1130 may be configured to coordinate I/O traffic between the processor 1110, system memory 1120, and any peripheral devices in the device, including network interface 1140 or other peripheral interfaces. In some embodiments, I/O interface 1130 may perform any necessary protocol, timing, or other data transformations to convert data signals from one component (e.g., system memory 1120) into a format suitable for use by another component (e.g., processor 1110). In some embodiments, I/O interface 1130 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 1130 may be split into two or more separate components. Also, in some embodiments some or all of the functionality of I/O interface 1130, such as an interface to system memory 1120, may be incorporated directly into processor 1110.
Network interface 1140 may be configured to allow data to be exchanged between computing device 1100 and other device or devices 1160 attached to a network or network(s) 1190, such as other computer systems or devices as illustrated in
In some embodiments, system memory 1120 may be one embodiment of a computeraccessible medium configured to store program instructions and data as described above for
Various storage devices and their associated computerreadable media provide nonvolatile storage for the computing devices described herein. Computerreadable media as discussed herein may refer to a mass storage device, such as a solidstate drive, a hard disk or CDROM drive. However, it should be appreciated by those skilled in the art that computerreadable media can be any available computer storage media that can be accessed by a computing device.
By way of example, and not limitation, computer storage media may include volatile and nonvolatile, removable and nonremovable media implemented in any method or technology for storage of information such as computerreadable instructions, data structures, program modules or other data. For example, computer media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CDROM, digital versatile disks (“DVD”), HDDVD, BLURAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computing devices discussed herein. For purposes of the claims, the phrase “computer storage medium,” “computerreadable storage medium” and variations thereof, does not include waves, signals, and/or other transitory and/or intangible communication media, per se.
Encoding the software modules presented herein also may transform the physical structure of the computerreadable media presented herein. The specific transformation of physical structure may depend on various factors, in different implementations of this description. Examples of such factors may include, but are not limited to, the technology used to implement the computerreadable media, whether the computerreadable media is characterized as primary or secondary storage, and the like. For example, if the computerreadable media is implemented as semiconductorbased memory, the software disclosed herein may be encoded on the computerreadable media by transforming the physical state of the semiconductor memory. For example, the software may transform the state of transistors, capacitors, or other discrete circuit elements constituting the semiconductor memory. The software also may transform the physical state of such components in order to store data thereupon.
As another example, the computerreadable media disclosed herein may be implemented using magnetic or optical technology. In such implementations, the software presented herein may transform the physical state of magnetic or optical media, when the software is encoded therein. These transformations may include altering the magnetic characteristics of particular locations within given magnetic media. These transformations also may include altering the physical features or characteristics of particular locations within given optical media, to change the optical characteristics of those locations. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this discussion.
In light of the above, it should be appreciated that many types of physical transformations take place in the disclosed computing devices in order to store and execute the software components and/or functionality presented herein. It is also contemplated that the disclosed computing devices may not include all of the illustrated components shown in
Although the various configurations have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter.
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an openended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
While certain example embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions disclosed herein. Thus, nothing in the foregoing description is intended to imply that any particular feature, characteristic, step, module, or block is necessary or indispensable. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions disclosed herein. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of certain of the inventions disclosed herein.
It should be appreciated any reference to “first,” “second,” etc. items and/or abstract concepts within the description is not intended to and should not be construed to necessarily correspond to any reference of “first,” “second,” etc. elements of the claims. In particular, within this Summary and/or the following Detailed Description, items and/or abstract concepts such as, for example, individual computing devices and/or operational states of the computing cluster may be distinguished by numerical designations without such designations corresponding to the claims or even other paragraphs of the Summary and/or Detailed Description. For example, any designation of a “first operational state” and “second operational state” of the computing cluster within a paragraph of this disclosure is used solely to distinguish two different operational states of the computing cluster within that specific paragraph—not any other paragraph and particularly not the claims.
In closing, although the various techniques have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter.
The disclosure presented herein encompasses the subject matter set forth in the following example clauses.

 Clause 1: A method for predicting keypoints by a computing system, the method comprising:
 instantiating, by the computing system, a neural network configured to predict each of the keypoints as a 2D random variable, normally distributed with a 2D position and 2×2 covariance matrix;
 training, by the computing system, the neural network to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth; and
 using the trained neural network to predict keypoints of an image without generating a heatmap.
 Clause 2: The method of clause 1, wherein distribution of uncertainty values is influenced during the training by introducing a suitable prior.
 Clause 3: The method of any of clauses 12, wherein the 2D random variable is a 2D circular Gaussian with position and uncertainty.
 Clause 4: The method of any of clauses 13, wherein training the neural network comprises training with a Gaussian negative log likelihood (GNLL) loss:

 Clause 5: The method of any of clauses 14, wherein a relative likelihood that a sample from the 2D random variable will equal the ground truth keypoint location is:
wherein the sample is at (x′, y′) and the ground truth is at (x, y).

 Clause 6: The method of any of clauses 15, further comprising minimizing a loss that is a sum of negative loglikelihoods.
 Clause 7: The method of clauses 16, wherein the ground truth consists of a set of keypoint coordinates Y∈^{N×2}:

 Clause 8: The method of any of clauses 17, further comprising predicting occluded keypoints and keypoints outside of the image.
 Clause 9: The method of any of clauses 18, wherein the keypoints are predicted at a subpixel precision.
 Clause 10: The method of any of clauses 19, wherein the image is received from a set of introspective sensors attached to a HMD, the method further comprising:
 performing 3D facial reconstruction from multiple views of the introspective sensors.
 Clause 11: The method of any of clauses 110, further comprising localizing or calibrating the HMD or accessories thereof
 Clause 12: The method of any of clauses 111, wherein ground truth images and keypoints comprise pairs of images captured by a camera or sensor with annotated ground truth.
 Clause 13: The method of any of clauses 112, wherein ground truth images and keypoints are generated or rendered by a computer.
 Clause 14: The method of any of clauses 113 wherein training data for the neural network include keypoints that are occluded or outside of ground truth images.
 Clause 15: A computing system for fitting a model using observation data, the computing system comprising:
 one or more processors; and
 a computerreadable storage medium having computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising:
 executing a neural network configured to predict keypoints as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training a network to maximize a loglikelihood that samples from each predicted keypoint equal a ground truth; and using the trained network to predict keypoints of an image.
 Clause 16: The computing system of clause 15, wherein each of the keypoints is predicted as a random variable with probability density function of a circular 2D Gaussian p(x, σ)=(xσ)p(σ); p(σ)=1.
 Clause 17: The computing system of any of clauses 15 and 16, wherein a relative likelihood that a sample from the 2D random variable will equal the ground truth keypoint location is:
wherein the sample is at (x′, y′) and the ground truth is at (x, y).

 Clause 18: The computing system of any clauses 1517, further comprising computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising minimizing a loss that is a sum of negative loglikelihoods.
 Clause 19: The computing system of any clauses 1518, wherein the ground truth consists of a set of keypoint coordinates Y∈^{N×2}:

 Clause 20: A computerreadable storage medium having computerexecutable instructions stored thereupon which, when executed by a processor of a computing system, cause the computing system to perform operations comprising:
 executing a neural network configured to predict generating predictions for keypoints as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training a network to maximize a loglikelihood that samples from each predicted keypoint equal a ground truth; and
 using the trained network to predict keypoints of an image.
The disclosure presented herein encompasses the subject matter set forth in the following additional example clauses.

 Clause 1: A method for predicting keypoints by a computing system, the method comprising:
 generating, by the computing system, predictions for each of the keypoints of an image as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training, by the computing system, a neural network to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth; and
 using the trained neural network to predict keypoints of an image without generating a heatmap.
 Clause 2: The method of clause 1, further comprising fitting a parametric model to the predicted keypoints.
 Clause 3: The method of any of clauses 12, wherein distribution of uncertainty values is influenced during the training by introducing a suitable prior.
 Clause 4: The method of any of clauses 13, wherein the image is of an object for which an apriori model is available, the method further comprising preprocessing the image by:
 running a sliding window over the image;
 measuring average keypoint confidence for each window of the sliding window;
 if a window with a high average keypoint certainty is not found, determining that the object is not in the image; and
 otherwise, taking the window which reported a highest average keypoint confidence to contain the object.
 Clause 5: The method of any of clauses 14, wherein the image is input from one or more of regular color (RGB) cameras, depth cameras, IR sensors, headmounted cameras, event cameras, or web cameras.
 Clause 6: The method of any of clauses 15, wherein the method is performed in combination with a model fitter that predicts intrinsic camera parameters.
 Clause 7: The method of clauses 16, wherein the model fitter is configured for a single view where uncertainty of each landmark is taken into account.
 Clause 8: The method of any of clauses 17, wherein the model fitter is configured for multiple views.
 Clause 9: The method of any of clauses 18, wherein an energy function for fitting the parametric model comprises

 Clause 10: A computing system for fitting a model using observation data, the computing system comprising:
 one or more processors; and
 a computerreadable storage medium having computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising:
 generating predictions for keypoints as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training a neural network to maximize a loglikelihood that samples from each predicted keypoint equal a ground truth; using the trained neural network to predict keypoints of an image; and
 fitting a parametric model to the predicted keypoints.
 Clause 11: The computing system of clause 19, wherein distribution of uncertainty values is influenced during the training by introducing a suitable prior.
 Clause 12: The computing system of any of clauses 10 and 11, wherein the image is of an object, further comprising computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations for preprocessing the image, comprising:
 running a sliding window over the image;
 measuring average keypoint confidence for each window;
 if a window with a high average keypoint certainty is not found, determining that the object is not in the image; and
 otherwise, taking the window which reported the highest average keypoint confidence to contain the object.
 Clause 13: The computing system of any clauses 1012, wherein the predicting and training is performed in combination with a model fitter that predicts intrinsic camera parameters.
 Clause 14: The computing system of any clauses 1013, wherein the model fitter is configured for a single view where uncertainty of each landmark is taken into account.
 Clause 15: The computing system of any clauses 1014, further comprising computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising performing 3D reconstruction from multiple views from multiple cameras, where an uncertainty in each view is taken into account, and wherein extrinsic parameters of each camera are concurrently optimized.
 Clause 16: The computing system of any clauses 1015, wherein the cameras are HMD cameras.
 Clause 17: The computing system of any clauses 1016, computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising using uncertainties of the neural network to estimate which parts of an object being tracked are visible.
 Clause 18: A method for predicting parameters of a 3D model that corresponds to image data using a convolutional neural network (CNN) running on a computing system, the method comprising:
 predicting probabilistic dense 2D landmarks L using the convolutional neural network (CNN);
 fitting a 3D model, parameterized by model parameters Φ to the 2D landmarks L by minimizing an energy function E (Φ; L); and
 outputting the fitted 3D model.
 Clause 19: The method of clause 18, wherein a model of an object is fitted to 2D probabilistic keypoint estimates from multiple cameras.
 Clause 20: The method of any of clauses 18 or 19, further comprising fitting a 3D model over a temporal sequence.
Claims
1. A method for predicting keypoints by a computing system, the method comprising:
 instantiating, by the computing system, a neural network configured to predict each of the keypoints as a 2D random variable, normally distributed with a 2D position and 2×2 covariance matrix;
 training, by the computing system, the neural network to maximize a loglikelihood that samples from each of the predicted keypoints equal a ground truth; and
 using the trained neural network to predict keypoints of an image without generating a heatmap.
2. The method of claim 1, wherein distribution of uncertainty values is influenced during the training by introducing a suitable prior.
3. The method of claim 1, wherein the 2D random variable is a 2D circular Gaussian with position and uncertainty.
4. The method of claim 1, wherein training the neural network comprises training with a Gaussian negative log likelihood (GNLL) loss: Loss = ∑ i = 1 ❘ "\[LeftBracketingBar]" L ❘ "\[RightBracketingBar]" λ i ( log ( σ i 2 ) ︸ Loss σ + μ i  μ i ′ 2 2 σ i 2 ︸ Loss μ ) ( 3 )
5. The method of claim 1, wherein a relative likelihood that a sample from the 2D random variable will equal the ground truth keypoint location is: 1 2 π σ 2 e  ( x  x ′ ) 2 + ( y  y ′ ) 2 2 σ 2
 wherein the sample is at (x′, y′) and the ground truth is at (x, y).
6. The method of claim 1, further comprising minimizing a loss that is a sum of negative loglikelihoods.
7. The method of claim 1, wherein the ground truth consists of a set of keypoint coordinates Y∈N×2: Y = [ x 0 y 0 x 1 y 1 ⋮ ⋮ x N y N ].
8. The method of claim 1, further comprising predicting occluded keypoints and keypoints outside of the image.
9. The method of claim 1, wherein the keypoints are predicted at a subpixel precision.
10. The method of claim 1, wherein the image is received from a set of introspective sensors attached to a HMD, the method further comprising:
 performing 3D facial reconstruction from multiple views of the introspective sensors.
11. The method of claim 10, further comprising localizing or calibrating the HMD or accessories thereof.
12. The method of claim 1, wherein ground truth images and keypoints comprise pairs of images captured by a camera or sensor with annotated ground truth.
13. The method of claim 1, wherein ground truth images and keypoints are generated or rendered by a computer.
14. The method of claim 12, wherein training data for the neural network include keypoints that are occluded or outside of ground truth images.
15. A computing system for fitting a model using observation data, the computing system comprising:
 one or more processors; and
 a computerreadable storage medium having computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising:
 executing a neural network configured to predict keypoints as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training a network to maximize a loglikelihood that samples from each predicted keypoint equal a ground truth; and
 using the trained network to predict keypoints of an image.
16. The computing system of claim 15, wherein each of the keypoints is predicted as a random variable with probability density function of a circular 2D Gaussian p(x, σ)=(xσ)p(σ); p(σ)=1.
17. The computing system of claim 15, wherein a relative likelihood that a sample from the 2D random variable will equal the ground truth keypoint location is: 1 2 π σ 2 e  ( x  x ′ ) 2 + ( y  y ′ ) 2 2 σ 2
 wherein the sample is at (x′, y′) and the ground truth is at (x, y).
18. The computing system of claim 15, further comprising computerexecutable instructions stored thereupon which, when executed by the processor, cause the computing system to perform operations comprising minimizing a loss that is a sum of negative loglikelihoods.
19. The computing system of claim 15, wherein the ground truth consists of a set of keypoint coordinates Y∈N×2: Y = [ x 0 y 0 x 1 y 1 ⋮ ⋮ x N y N ].
20. A computerreadable storage medium having computerexecutable instructions stored thereupon which, when executed by a processor of a computing system, cause the computing system to perform operations comprising:
 executing a neural network configured to predict generating predictions for keypoints as a 2D random variable, normally distributed with location (x, y) and standard deviation sigma;
 training a network to maximize a loglikelihood that samples from each predicted keypoint equal a ground truth; and
 using the trained network to predict keypoints of an image.
Type: Application
Filed: Jun 28, 2022
Publication Date: Sep 7, 2023
Inventors: Thomas Joseph CASHMAN (Cambridge), Erroll William WOOD (Cambridge), Martin DE LA GORCE (Cambridge), Tadas BALTRUSAITIS (Cambridge), Daniel Stephen WILDE (Cambridge), Jingjing SHEN (Cambridge), Matthew Alastair JOHNSON (Cambridge), Julien Pascal Christophe VALENTIN (Zurich)
Application Number: 17/851,933