Systems and Methods for Rapid Development of Object Detector Models
A computer vision system configured for detection and recognition of objects in video and still imagery in a live or historical setting uses a teacher-student object detector training approach to yield a merged student model capable of detecting all of the classes of objects any of the teacher models is trained to detect. Further, training is simplified by providing an iterative training process wherein a relatively small number of images is labeled manually as initial training data, after which an iterated model cooperates with a machine-assisted labeling process and an active learning process where detector model accuracy improves with each iteration, yielding improved computational efficiency. Further, synthetic data is generated by which an object of interest can be placed in a variety of setting sufficient to permit training of models. A user interface guides the operator in the construction of a custom model capable of detecting a new object.
Latest PERCIPIENT .AI INC. Patents:
This application is a divisional and claims the benefit of U.S. patent application Ser. No. 17/938,042 filed Oct. 4, 2022, now U.S. Pat. No. 11,636,312 issued Apr. 25, 2023, which in turn claims the benefit of U.S. Patent Application Ser. No. 63/337,595 filed May 2, 2022, and further claims the benefit of U.S. Patent Application Ser. No. 63/329327 filed Apr. 8, 2022. It is also a continuation-in-part of U.S. patent application Ser. No. 17/866396 filed Jul. 15, 2022, which is a 371 conversion of PCT Application PCT/US21/13940 filed Jan. 19, 2021, which in turn is a continuation-in-part of U.S. patent application Ser. No. 16/120,128 filed Aug. 31, 2018, which in turn claims the benefit of U.S. Patent Application Ser. No. 62/553,725 filed Sep. 1, 2017. Further, this application is a continuation-in-part of U.S. patent application Ser. No. 17/866389 which is a 371 conversion of PCT Application No. PCT/US21/13932, both of which in turn claim the benefit of U.S. Patent Applications Ser. Nos. 62/962,928 and 62/962,929, both filed Jan. 17, 2020, and also U.S. Patent Application Ser. No. 63/072,934, filed Aug. 31, 2020. The present application claims the benefit of each of the foregoing, all of which are incorporated herein by reference.
FIELD OF THE INVENTIONThe present invention relates generally to computer vision systems configured for detection and recognition of objects in video and still imagery in a live or historical setting, and more particularly relates to the development of teacher-student object detector models that improve computational efficiency and, in related aspects, enable training of a network with reduced numbers of training images through the use of machine assisted labeling, active learning and iterative techniques to achieve desired levels of accuracy in object detector models.
BACKGROUND OF THE INVENTIONConventional computer vision and machine learning systems are configured to identify objects, including people, cars, trucks, etc., by developing a computer vision model trained to recognize features of the object or objects. More generally, conventional imagery processing systems utilize one or more learned models to detect objects of interest in still frame images or in frames of video.
Deep learning methods and techniques have become standard for use in computer vision applications as well as other areas of artificial intelligence. In particular, Convolutional Neural Networks (CNN) are generally regarded as providing state-of-the-art results. An approach to building a computer vision model using CNNs can be generalized as having four steps, where the first step includes a key difference between classifiers and detectors. In either, the first step is to create a dataset comprised of annotated images, if one does not already exist. For classifiers, the objective annotations simply provide a confidence value that a given image includes at least one occurrence of an object of interest, and the image receives only one label indicating the class of the object of interest regardless how may such objects occur in that image. Detectors, in contrast, both identify and locate each occurrence of an object of interest in an image, where a bounding box is drawn around each occurrence with a label for each bounding box. For example, if the object of interest in a dog and a given image includes two dogs, a classifier will label the entire image “Dog”. A detector will put a bounding box around each of the two dogs, and label both boxes “Dog.”
As a second step, features pertinent to the task at hand are extracted from each image. This is a key point in modeling the problem. For example, the features used to recognize faces, features based on facial criteria, are obviously not the same as those used to recognize tourist attractions or human organs, although some models are trained to detect many classes of objects. Then, as a third step, a deep learning model is trained based on the features isolated. Such training means feeding the machine learning model many images and it will learn, based on those features, how to solve the task at hand, i.e., detecting images that include objects with those features. Training typically includes both positive and negative training, where negative training refers to images that do not include the objects of interest. Last, the fourth step is to evaluate the model using images that weren't used in the training phase. By doing so, the accuracy of the training model can be tested.
While the foregoing strategy works for the initial development of a model, it presents a number of challenges, especially for detectors as opposed to classifiers, where there is a desire to be able to detect an additional class of object for which the original model was not trained. Among those challenges is the need, in most systems, to have a dataset with a large number of images. A common rule of thumb is 1,000 images for each class. A further issue for developing object detection models as opposed to classifiers is the time required to achieve the labeling necessary to develop an object detection model from scratch. One of the challenges of labeling is the confusion that can result when every instance of the object of interest in an image is not labeled. The requirement for both accuracy and a sizeable number of images makes it difficult for persons who have not received specialized training to develop an object detection model that will perform with reasonable accuracy. It should also be understood that the images need to be diverse and representative of real-world conditions to yield good results.
When there is a need to build a model capable of detecting a different object than the existing model was designed, prior art solutions face even more challenges. One significant challenge with adding newer object types is the increased labeling costs, while a second significant challenge is the increased computational expense. For example, a model M1 has been trained on dataset D1 for a set of objects O1. The customer is now interested in an additional set of objects O2, with instances of O2 in the images of a new dataset D2. The new dataset D2 is as yet unlabeled for any objects, which results in the aforesaid rising labeling costs and rising run time costs.
The rising labeling costs result because one cannot simply label O2 objects in D2 and then train a combined (O1+O2) detector on the dataset (D1+D2). This is because D1 may have occurrences of objects from the set of objects O2 that will not be labeled since the user was initially interested in only the set of objects O1. Where objects appear in an image, but are not labeled, during training the detector identifies the unlabeled objects as not being of interest, and so detection of those objects is excluded from the model. The result is that training a combined (O1+O2) detector on (D1+D2) in this manner will lead to false negatives. To avoid this, in conventional approaches the images of D1 must be revisited to label any of the set of objects O2 present in those images. For similar reasons, all O1 objects would need to be labeled in the images of dataset D2. Both labeling tasks can be very laborious, as the list of objects, and the amount of data keeps growing over time.
An alternative approach might be to label only O2 objects in D2, and build a new model M2 that only detects O2 objects. In the production environment, for each new image, the model M1 would first be run on the production images, followed by running the model M2 on the production images. The downside is that such a process results in additional computational time per image, which increases every time the customer is interested in a new set of objects. If the process of running an additional model for every new set of objects is continued indefinitely, there will be a time when the total computational expense becomes prohibitive.
When the objective of the system is only classification, rather than detection, a technique termed Transfer Learning has been used to reduce computational expense, for example by the use of a Teacher-Student model. In classifiers using such Transfer Learning techniques, an already-trained model serves as a starting point. Typically that already-trained model has seen a large number of images and learned to distinguish among the classes. If so, that classifier can be taught a few new classes in the same domain (i.e., generally same type of objects) based on a relatively small number of training images. However, the greater complexity of object detection models has made such conventional Transfer Learning techniques unworkable for detection.
Because of the foregoing challenges, typical end-users have refrained from developing their own object detection models and instead have relied on third parties to develop such models. This limitation makes it difficult for such a customer to develop a model where the mere knowledge that a search is being performed for the object of interest is highly confidential.
As a result, there has been a long-felt need for a system and method that would provide the benefit of Transfer Learning to object detection models. Further, there has been a long-felt need for a system and method for developing object detection models that can be executed by an end-user without significant specialized training and allows the subject of the search to remain confidential within the end-user organization. Still further, there has been a long felt need for a schema for developing an object detection model for a new object where only comparatively few images are available for training while still yielding acceptable accuracy and minimizing computational expense.
SUMMARY OF THE INVENTIONThe present invention substantially resolves the limitations of conventional systems performing object detection in that it provides a process and system by which a user without specialized training can develop custom object detection models using substantially fewer images than in conventional systems, while permitting the developer of the model to maintain confidentiality regarding the object of interest. In an embodiment, the system receives from a user an image dataset comprising a quantity of images, either in the form of still frame images or in the form of video snippets comprising a sequence of video frames. The volume of images is reduced compared to conventional systems, and at least in some embodiments forms a small dataset.
A randomly selected batch of images is selected from the image dataset and occurrences of the object of interest are labeled on the images included in the batch. Those labeled images form training data for a deep learning network, and once the network is trained a first iteration of a custom model is developed, where that model typically is specific to a much more limited number of classes, and in at least some cases just one class. The system also includes a system production model, which in some embodiments has been extensively trained to detect a variety of classes of objects. The system production model and the iterated model operate as teacher models in an teacher-student network, where the classes that each of the teacher models are trained to detect are combined in an optimization process to yield a merged, or student model that can detect all of the objects that either teacher model can detect. In an exemplary embodiment the optimization process comprises a classifier and a regressor run at anchor boxes across the image, which are specific locations in the image, at various aspect ratios. The merged model is run against a production dataset which can comprise either still frame images or frames from video sequences, which are fed back to the original images and video for correction of labeling errors and/or updating of missed labelings. By iterating through several rounds of selecting a batch of images and/or correcting any mislabeled images from a prior batch, the merged or student model converges and yields usable results.
To further improve results, the iterated model is also provided to a machine assisted labeling process as well as an active learning process. The outputs of both are also fed back to the original images and video to allow correction of mislabeled images or partly labeled images. In addition, the video output is supplied to a tracking process that identifies the location of objects in sequential frames of the video, whereby only a single frame can be labeled initially and the tracking process receives that labeling data and can apply it to the remaining frames of the video snippet. The output of the tracking process then is combined with the labeled still frame images to yield training data for the custom model. It will be appreciated by those skilled in the art that, in implementations where the system production model does not yet exist, the development of the system production model can be efficiently achieved by the process of developing the iterated model, in some embodiments including the machine assisted labeling and active learning subprocesses described in greater detail hereinafter.
For instances in which an object, or at least a CAD or similar model thereof, is available but a suitable volume of different images is not, in a related aspect of the invention a wide variety of synthetic images can be generated.
It is therefore one object of the present invention to provide a deep learning system capable of combining two or more teacher models trained for detection of different objects into a single student model capable of detecting all of the objects of the two or more teacher models.
It is another object of the present invention to provide a deep learning system capable of combining two or more teacher models, trained for detection of different objects where at least one is trained for detection of a plurality of objects, into a single student model configured to detect all of the objects of the two or more teacher models.
A further object of the present invention is to provide a deep learning system capable of combining two or more teacher models, each trained for detection of one or more objects wherein at least one of the teacher models is trained to detect objects different from the objects the other teacher models are trained to detect, into a single student model configured to detect a combination of objects comprising one or more objects from each teacher model.
It is a further object of the present invention to provide a deep learning system capable of combining two or more teacher models trained for classification and detection of different objects into a single student model that is able to detect all the objects.
It is a further object of the invention to provide a system wherein new classes of objects can be added to a previously developed and trained object detector network without requiring an operator to relabel objects in the training data of the previously developed object detector.
It is yet another object of the present invention to provide a deep learning system and method configured to achieve acceptable accuracy using partially labeled datasets to train a model.
It is another object of the present invention to provide a deep learning system and method that achieves acceptable accuracy while using different sets where only a single or only some objects of interest are labeled.
It is a further object of the present invention to provide a deep learning network and process that uses active learning to reduce the number of labeled images required to develop a model successfully.
It is a still further object of the present invention to provide a system and process for improving object classification and detection accuracy through the use of machine assisted labeling.
Yet another object of the present invention is to provide a system and process for improving object detection through the use of active learning.
It is another object of the present invention to provide a system and process for improving object detection through the iterative application of low shot learning to a reduced image set.
It is an additional object of the present invention to provide a deep learning system capable of combining two or more teacher models trained for classification and detection of different objects into a single student model containing a combination of objects from each of the teacher models.
A still further object of the invention is to provide an optimization process for teacher-student models comprising distillation.
Another object of the present invention is to provide an optimization process for teacher-student models comprising a classifier and a regressor,
Yet a further object of the present invention is to provide a system and method having improved computational efficiency for optimizing a merged model.
It is yet another object of the present invention to provide a system and method for developing a system production model through use of an iterative model augmented with active learning, machine labeling, or both.
A further object of the present invention is to provide a system and method for ordering data such as images according to an uncertainty score.
Yet a further objection of the present invention is to provide a system and method for visually differentiating labels proposed by the system for operator review from labels below a threshold.
A still further object of the present invention is to provide a system and method for providing to an operator an opportunity to review images having an uncertainty score above a threshold value, where the operator can be either automated or human.
These and other objects of the invention can be better appreciated from the following Detailed Description of the Invention, taken together with the appended Figures briefly described below.
The present invention enables a user to create an object detection model for custom objects, and to then use that custom model to find those objects is video and still frame imagery where that imagery can be either live or pre-recorded. In an embodiment of an aspect of the invention, the training of the custom object detection model is achieved with a volume of training data substantially less than in many prior art systems. In an embodiment of a further aspect of the invention, the custom model, together with a backbone object detection neural network that is pretrained on a variety of objects, forms the teacher portion of a teacher-student ensemble network which permits development of an optimized student object detection model with significantly improved computational efficiency. In an embodiment, each of the networks is a “Single Shot Multibox Detector” or “SSD” neural network for the detection task, with classification and regression performed at and relative to anchor boxes, where, in at least some implementations, the predefined, fixed grid of anchor boxes is spread uniformly through the image. While the following description assumes a supervised learning model, those skilled in the art will recognize, once they have digested the teachings herein, that unsupervised learning can also be used in at least some embodiments. In particular, if a model is “pre-trained” on a large amount of video data, all using unsupervised data—basically “self supervision”, the amount of fine tuning that would be needed to build a specific model would be significantly reduced. While such a general purpose model that will work for any scene requires substantial compute power and data storage, data with considerable redundancy will greatly reduce compute and data storage needs considerably. Where the data source is a specific camera or group of cameras, which is a common configuration wherein a specific camera will see a highly regular scene with a lot of redundancy, an unsupervised learning system can reduce the tuning time.
To develop a custom object detector model, a set of representative images of the object of interest is gathered. The images can come from an existing or newly captured dataset or, in some embodiments, can be generated synthetically, as discussed in greater detail below in connection with
Each of the images is then labeled by identifying all of the occurrences of the object of interest and drawing a tight bounding box enclosing the entire object without extraneous elements. The minimum number of images for generation of a model can vary depending upon the size of the dataset and the nature of the objects being sought, but is typically between 10 and 1,000, with 50 images an exemplary number.
Once a sufficient quantity of images has been labeled, training is performed by the associated SSD, which may be operating with any of a variety of backbones, for example Resnet50, Resnet34, InceptionV3, or numerous other SSD variations, but, for at least some embodiments, with the weights unfrozen so that the detectors can be fine tuned for a specific task by propagating the gradient of the loss function from the top to the bottom. The output of the SSD's comprises a first model. That model, together with an extensively trained system production model, comprise the “teacher” side of a teacher-student network, where the teacher networks are merged in an optimizing step using a novel form of distillation and the output of that step is a student model capable of detecting objects in all of the classes for which the system production model is trained plus all classes that can be detected by the iterated model. In some embodiments, no system model will have been previously developed. In such a case, the event that no system production model has been developed,
The model is then tested against a set of images for validation, which provides an indication of how well the model performs. As discussed below, and depending upon the embodiment, various feedback and iterative techniques can be implemented to improve the model. In at least some embodiments, it is desirable to provide interoperability between the system production model and the custom, or iterated, model. Thus, in an embodiment, the two teacher models use a common vocabulary of object classes, where an operator seeking to designate a new class can see the previously trained classes and thus avoid duplication. Further, in an embodiment, the models use the same deep neural network framework, although such commonality is not required in all embodiments. In other embodiments, interoperability can be achieved where the neural network models are understandable in both frameworks, for example using the ONNX format although the ONNX format does not always yield successful results without operator intervention. It will be appreciated by those skilled in the art that, if the networks are interoperable, the custom model can be merged with the system production model. Further, should the system production model yield poor results, for example as the result of poor labeling, the images from the system production model can be supplied to the image set of the present invention such that any labeling errors can be corrected, resulting in a more accurate production model.
At step 115, the classes of objects are defined, for example, “red ball”, or “sunflower”, or any other appropriate term. The descriptors for the class are assigned by the operator in many embodiments, although it will be appreciated that, if synthetic data is used, the object is already defined and, as with step 110, no human involvement is required. Next, at step 120, at least some of the images from the collected image set are labeled by applying bounding boxes tightly around each occurrence of the object in the images. While human intervention is required to applying bounding boxes for many types of images, for at least synthetic images the labeling can be performed automatically, since the process of generating a synthetic image includes knowing where the object is within the image. Next, at step 125, the model is trained by processing the labeled images in an appropriate neural network, where the result is an iterated model 130. The training process is typically an SSD as described above although in some instances a Low Shot Learning approach can work to get to an iterated model faster with less labor in acquiring training data. Other types of deep learning networks suitable for detecting objects in imagery are also acceptable. In an embodiment, the backbone or deep residual network of the SSD can be the Resnet50 architecture, although architectures such as InceptionV3, Resnet34 with an additional layer for smaller objects, or any other functionally equivalent architecture may also be acceptable.
The output of the iterated model 130 is a set of images and labeling data, where the top layer classifier for the iterated model will have two outputs, specifically new-class versus background. That output is supplied to an optimization process 135, described in more detail in connection with
Active learning 145, discussed in greater detail in connection with
Through repetition of the cycle of labeling, training, creating the iterated model, testing for uncertainty, then sending the least certain images back to the operator for reassessment, the model is iteratively improved. Because the uncertainty threshold or selection process can be adjusted according to any convenient criteria, the size of the group of images sent back for review by the operator can be comparatively small compared to the full dataset, with the result that a relatively small volume of images can, through iterative assessment, refine the iterated model 130 until it achieves acceptable accuracy. This reduces the labor involved and can also reduce computational expense.
As noted above, the output of the iterated model 130 is also supplied to an optimization process 135, which also receives as an input the images and a system production model 150. The system production model 150 and the iterated model 130 form the teacher pair of networks, where each is trained for different objects and, through optimization process 135, their trainings are combined into a single student model, specifically merged model 155, trained to detect any object or objects that could have been detected by either (or both) the system production model or the iterated model. The merged model will have (N+1)+1=(N+2) outputs where the last “+1” is for the background class. Omitted from
The output of the merged model is then deployed, step 160, where it is applied to the production data 165. The results of that deployment are then fed back to step 120, as were the images labeled by the machine-assisted labeling process 140 and the active learning process 145, to allow the operator to correct the labeling of any images that the operator determines were mislabeled. It will be appreciated that, depending upon the embodiment, the feedback from one or more of the feedback sources 140, 145 and 165 is optional.
Further, in implementations where the system production model still needs to be developed, the foregoing steps can be used to create the system production model simply by executing the above-described process steps but without inclusion of the system production model and its associated dataset as inputs. As just one example, in an embodiment, the first execution of the process of the invention, including the aforementioned feedback as desired, classifies and detects a first object. That model, while capable of classifying and detecting only a first object can be used as a nascent system production model, where each successive execution of the process adds an additional object to the objects that can be detected by that developing system production model. The collection of training data developed through successive addition of objects to the developing system production model becomes the system production training dataset. For purposes of the present invention, the foregoing description of the development of the system production model is not intended to be limiting, and the system production model can be developed in any suitable manner, The following description of the invention assumes a pre-existing system production model unless specifically stated to the contrary, although it will be apparent to those skilled in the art, upon digesting the details presented hereinafter, how to modify those processes and systems to develop the system production model if one does not yet exist.
Referring next to
To begin, in an embodiment a user assigns a name to an object of interest and then labels a batch of unlabeled images 200A. In some embodiments, the batch may range in size from about ten images to 1000 or more images, at least partly based on the size of the production data set. The images in the batch are then labeled, step 210, where step 120 of
The result of the training step 125 is the first iteration of iterated model 130, which also functions as a teacher as discussed further below and shown in simplified form in
Still referring to
Still referring to
In an additional aspect of some embodiments of the invention, tracking of video snippets, indicated at 270 in
Referring again to
While distillation is known where the task is to classify an image into different categories, the present invention extends this concept to a detection task where the model is required to report not only whether an object of a particular class exists in an image, but also the location of that object in the image, with the location typically represented as within a tight bounding box around the object. In an embodiment, the present invention enables combining two or more teacher models trained for different objects into a single student model containing all the objects, and also enables using only partially labeled datasets to train a model. That is, at least some embodiments of the invention enable using different sets where only a single one or only some objects of interest are labeled, thus saving substantial effort in that it becomes unnecessary to review all the data and relabel all the objects in all the images.
Thus, in
In an embodiment of the invention, a “Single Shot Multibox Detector” (SSD) neural network is used for the detection task. Classification and regression are performed at and relative to predefined, fixed grid of boxes called “anchor boxes”. For a large set of “anchor boxes” spread uniformly through the image, the SSD algorithm trains a network to perform two tasks, classification and regression, where classification is determining the probability distribution of the presence of any of the objects of interest, or the background at an anchor box and regression is determine the bounding box of the object that is detected at the anchor box.
Classification is modeled as a softmax function to output confidence of a foreground class or the background class:
for foreground classes Ck and background class B, for the anchor box X. Note here that background is treated just as one of the class amongst all the classes modeled by the softmax function. The background class is trained by extracting negative examples around the positive examples in the labeled images. The loss function for training the classifier is a cross-entropy loss defined for every association of anchor box to a label denoted by XLabel,Anchor [Eq. 1, below]:
Regression is modeled as a non-linear multivariate regression function that outputs a four-dimensional vector representing center coordinates, width and height of the bounding box enclosing the object in the image. The loss function for training regressor is a smoothL1 loss function Lloc(C,X). Only foreground objects are used for training the regressor as background class has no boundaries [Eq. 2, below]:
Here xLabel,Anchor is 1 for an association between a positive label and a predefined anchor box. ΔxBox is the offset of the ground truth label relative to the associated anchor box, ΔxPred is the predicted bounding box from the network.
For training a standard SSD[3] model, parameters are learned that minimize Lconf(C,X)+Lloc(C,X) defined over a selective set of positive and negative anchor boxes X, chosen using the using the labels from manually annotated images. These labels are called hard labels with one-hot encoding for positive samples.
As a part of the workflow for the present invention, an operator will train multiple detectors by labeling multiple sets of data where only a particular object of interest is labeled in each dataset. Distillation enables an operator to train a single student model from multiple teacher models without losing accuracy, and without requiring the operator to label all the objects on all the datasets. The advantage of doing this is the performance gain resulting from running a single detector instead of multiple detectors.
The teacher in this case constitutes multiple networks of similar complexity, where each network is able to detect a new class of object as trained by the user. The student is a new network of similar complexity as the teacher models, where the goal is to distill the knowledge from multiple teacher models into a single student model.
While the distillation process can be performed on any number of teacher networks, as an example, the algorithm can be illustrated by using two teacher networks M1 and M2 to train a student network M. The teacher networks are trained to detect class C1 and C2 with the respective “background” classes B1 and B2. “Background”, in this context, means regions that do not contain the object of interest. (Labeled-Data)1 and (Labeled-Data)2 are employed for training M1 and M2 that have only their respective classes labeled.
In an embodiment, the student model is a single deepnet model M with two classes and a single background class B that is an intersection of classes B1 and B2. The probability mapping for the combined model can be performed as follows. For the input X, the model for (Labeled-Data)1 and (Labeled-Data)2 has class probability as P|C1|M1,X) and P|C2|M2,X) respectively. Corresponding background probabilities are P|B1|M1,X) and P|B2|M2,X) respectively. The probabilities for the teacher models are computed as follows:
PTeacher(B|X)=P(B1|M1,X)×P(B2|M2,X)
PTeacher(C1|X)=P(C1|M1,X)×P(B2|M2,X)
PTeacher(C2|X)=P(C2|M2,X)×P(B1|M1,X)
In this example, the loss terms for training the SSD comprise a loss term for the classifier and a term of the regressor, shown in Eq. 1 and Eq. 2, above. In the present invention, the loss function for training a student model is a linear combination of two loss functions:
- Loss1: Positive labels are hard labels that are extracted from (Labeled-Data)1 and (Labeled-Data)2 where only positive labels are sampled and no negative samples are extracted because it isn't known whether a negative sample for class C1 has a class C2 object (and vice-versa).
- a. For training the classifier, only positive examples are used in the cross-entropy loss in Eq. 1, above.
-
- Here xLabel,Anchor is 1 for the correct class and 0 for rest of the classes.
- b. For training the regressor, only bounding boxes for positive labels are required. The smooth-L1 loss is used, as defined in the loss function, Eq. 2, above
- Loss2: For each object, extract a quantity (for example, 400) of the top detection bounding boxes Pos1 and Pos2 with a score greater than 0.01 both from model M1 and M2 respectively:
- a. These are soft labels for the SSD classifier and are used as cross-entropy loss for training the classifier. Instead of using hard binary targets, soft targets are used in the cross entropy loss for training student model M
-
- b. For training the regressor, for each sample, compute the regression target by weighing the smooth-L1 loss by the classification score:
Here, X represents the anchor box associated to positive soft labels and Δx represents the difference between the soft label and the associated anchor box X. So a highly confident classification score will have more influence in optimizing the corresponding regression loss (smoothL1 loss). A bounding box that does not have a high confidence C1 or C2 box will be most likely a background and will not have any significant influence on the regression function.
The combined loss is α*Loss1+(1−α)*Loss2, where α is used to control the weights of the combined loss and, in an embodiment, is set to 0.25. Note that any amount of representative unlabeled data can also be used to train a student model from the teacher models M1 and M2. There, only the Loss2 term is employed, as there are only soft labels from the models, and no hard labels as used in the Loss1 term.
Referring next to
Referring next to
The system organizes the unlabeled data according to each datum's uncertainty score, after which the operator is invited to label a batch of the unlabeled data having the highest uncertainty scores. The model is then retrained using all of the labeled data, yielding an improved result. This cyclic process of labeling, training and querying is continued until the model converges or the validation accuracy is deemed satisfactory by the user. By using active learning, the customers are able to train a model with high accuracy by only labeling a small subset of the raw data, for example as few as ten images for some models and as many as 1000 images or more for other models, based at least in part on the size of the dataset.
In some instances, the object is available physically but there are insufficient images of the object in context, i.e., with an appropriate background, to create a dataset adequate to train a model to yield sufficiently accurate results. In other cases, no physical example exists, but a 3D computer model is available. In such circumstances, the generation of synthetic images can offer a number of advantages. An embodiment of such an approach can be appreciated from
The details of the object are then provided from 525 to a blending process, 535, which also receives data representative of at least color, tone, texture and scale of the scene depicted in a background image, 540, as well as characterizing information specifying position and angle of view of a virtual camera, 545, together with characteristics of the virtual camera such as distortion, foreshortening, compression, etc. The virtual camera can be defined by any suitable digital representation of a model of camera. The process 535 modifies the object in accordance with the context of the background image, including color and texture matching as well as scaling the object to be consistent with its location in the background image, and adjusts the image of the object by warping, horizontally or vertically tilting the object, and other similar photo post-processing techniques to give the synthetic representation of the object proper scale, perspective, distortion representative of the camera lens, noise, and related camera characteristics. The blended and scaled object image from step 555 is then provided to a renderer 560 which places the blended and scaled object into the background image. To achieve that result, the renderer 560 also receives the background image 540 and the camera information, 545 and 550. The result is a synthetic image 565 of the object in the background image, usable in dataset 200 of
Referring next to
Next with reference to
The multisensor processor 615 can be a server computer such as maintained on premises or in a cloud network, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions 635 (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” is to be understood to include any collection of machines that individually or jointly execute instructions 635 to perform any one or more of the methods or processes discussed herein.
In at least some embodiments, the multisensor processor 615 comprises one or more processors 650. Each processor of the one or more processors 650 can comprise a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a controller, one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these. In an embodiment, the machine 615 further comprises static memory 655 together with main memory 645, which are configured to communicate with each other via bus 660. The machine 615 can further include one or more visual displays as well as associated interfaces, all indicated at 665, for displaying messages or data. The visual displays may be of any suitable type, such as monitors, head-up displays, windows, projectors, touch enabled devices, and so on. At least some embodiments further comprise an alphanumeric input device 670 such as a keyboard, touchpad or touchscreen or similar, together with a pointing or other cursor control device 675 such as a mouse, a trackball, a joystick, a motion sensor, a touchpad, a tablet, and so on), a storage unit or machine-readable medium 640 wherein the machine-readable instructions 635 are stored, a signal generation device 680 such as a speaker, and a network interface device 685. A user device interface 690 communicates bidirectionally with user devices 620 (
Although shown in
While machine-readable medium or storage device 640 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., instructions 635). The term “machine-readable medium” includes any medium that is capable of storing instructions (e.g., instructions 635) for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein. The term “machine-readable medium” includes, but is not limited to, data repositories in the form of solid-state memories, optical media, and magnetic media. The storage device 640 can be the same device as data store 630 (
Where the multisensor data from inputs 700A-700n includes full motion video from terrestrial or other sensors, the processor 615 can, in an embodiment, comprise a face detector 720 chained with a recognition module 725 which comprises an embedding extractor, and an object detector 730. In an embodiment, the face detector 720 and object detector 730 can employ a single shot multibox detector (SSD) network, which is a form of convolutional neural network. SSD's characteristically perform the tasks of object localization and classification in a single forward pass of the network, using a technique for bounding box regression such that the network both detects objects and also classifies those detected objects. Using, for example, the FaceNet neural network architecture, the face recognition module 725 represents each face with an “embedding”, which is a 128-dimensional vector designed to capture the identity of the face, and to be invariant to nuisance factors such as viewing conditions, the person's age, glasses, hairstyle, etc. Alternatively, various other architectures, of which SphereFace is one example, can also be used. In embodiments having other types of sensors, other appropriate detectors and recognizers may be used. Machine learning algorithms may be applied to combine results from the various sensor types to improve detection and classification of the objects, e.g., faces or inanimate objects. In an embodiment, the embeddings of the faces and objects comprise at least part of the data saved by the data saver 710 and encoders 705 to the data store 630. The embedding and entities detections, as well as the raw data, can then be made available for querying, which can be performed in near real time or at some later time.
Queries to the data are initiated by analysts or other users through a user interface 735 which connects bidirectionally to a reasoning engine 740, typically through network 620 (
Queries are processed in the processor 615 by a query process 755. The user interface 735 allows querying of the multisensor data for faces and objects (collectively, entities) and activities. One exemplary query can be “Find all images in the data from multiple sensors where the person in a given photograph appears”. Another example might be, “Did John Doe drive into the parking lot in a red car, meet Jane Doe, who handed him a bag”. Alternatively, in an embodiment, a visual GUI can be helpful for constructing queries. The reasoning engine 740, which typically executes in processor 615, takes queries from the user interface via web services and quickly reasons through, or examines, the entity data in data store 630 to determine if there are entities or activities that match the analysis query. In an embodiment, the system geo-correlates the multisensor data to provide a comprehensive visualization of all relevant data in a single model. Once that visualization of the relevant data is complete, a report generator module 760 in the processor 615 saves the results of various queries and generates a report through the report generation step 765. In an embodiment, the report can also include any related analysis or other data that the user has input into the system.
The data saver 715 receives output from the processing system and saves the data on the data store 630, although in some embodiments the functions may be integrated. In an embodiment, the data from processing is stored in a columnar data storage format, such as Parquet as just one example, that can be loaded by the search backend and searched for specific embeddings or object types quickly. The search data can be stored in the cloud (e.g. AWS S3), on premise using HDFS (Hadoop Distributed File System), NFS, or some other scalable storage. In some embodiments, web services 745 together with user interface (UI) 735 provide users such as analysts with access to the platform of the invention through a web-based interface. The web based interface provides a REST API to the UI. The web based interface, in turn, communicates with the various components with remote procedure calls implemented using Apache Thrift. This allows various components to be written in different languages.
In an embodiment, the UI is implemented using React and node.js, and is a fully featured client side application. The UI retrieves content from the various back-end components via REST calls to web service. The User Interface supports upload and processing of recorded or live data. The User Interface supports generation of query data by examining the recorded or live data. For example, in the case of video, it supports generation of face snippets from uploaded photograph or from live video, to be used for querying. Upon receiving results from the Reasoning Engine via the Web Service, the UI displays results on a webpage.
A user interface comprises another aspect of the present invention, and various screens of an embodiment of a user interface are shown in
If the operator decides that the existing models would not yield the desired results, the operator can click on “New”, shown at 925, in which case in an embodiment a screen such as shown in
The operator is invited to define a new object by clicking on “New Object”, 1115, which causes, in an embodiment, the screen 1120 of
When the screen 1250 of
In
Depending upon the embodiment, the process of
Once the model has been trained sufficiently, such that the merged model 155 (
To increase or decrease the number of detections, the confidence threshold can be adjusted to any desired level, for example by slider 1580, shown in
The display of confidence percentages can also vary depending upon the selections of the data to be displayed to the operator. For example, in an embodiment of the Analysis Results display, confidence percentages are hidden by default in the video player, and by default also hidden for objects displayed in the larger view shown at 1555. At the same time, by default all detections exceeding a default low confidence threshold, for example one percent, may be returned as search results, optionally arranged by confidence percentage. In contrast, the defaults for Live Monitoring Alerts may be, for example, to return all detections above a default threshold of 20% confidence, with confidence percentages always visible. As noted above, the default values can be adjusted via the settings accessible at icon 1560.
In an embodiment, “inspect” mode reveals to the operator all detections of any searched object or objects above a default confidence level, for example 20%, with the identities of the searched objects visible at 1590. Optionally, the user can be permitted to select which of the objects shown at 1590 are revealed in inspect mode, surrounded by their respective bounding boxes. Again, the confidence threshold can be adjusted in at least some embodiments. Alternatively, inspect mode can also be configured to reveal all objects detected by the system, whether or not a given object is part of the analysis results, or can be configured to allow the operator to incrementally add types or classes of objects that the system will reveal in inspect mode. Inspect mode can thus be used by an operator to reveal associations between or among detected objects, where the types of detections to be revealed varies with each iteration of a search. Inspect mode can also be use for verification step, to ensure that they system is successfully detecting all objects in a frame or a video sequence regardless whether included in a given search. In any of the modes a given scene can be captured by clicking on “capture scene”, shown at 1595.
Having fully described a preferred embodiment of the invention and various alternatives, those skilled in the art will recognize, given the teachings herein, that numerous alternatives and equivalents exist which do not depart from the invention. It is therefore intended that the invention not be limited by the foregoing description, but only by the appended claims.
Claims
1. A method for developing in one or more processors a student machine learned model for classification and detection of one or more previously specified objects and at least one newly specified object, comprising the steps of
- providing in one or more processors and associated storage a first teacher model comprising a first machine learned model capable of detecting and classifying at least one previously specified object,
- providing in one or more processors and associated storage a second teacher model comprising a second machine learned model configured for being trained to detect and classify at least one newly specified object,
- providing to the first teacher model and the second teacher model a first training dataset representative of the previously specified objects,
- providing to the first teacher model and the second teacher model a second training dataset representative of a newly specified object identified through the use of at least some bounding boxes,
- processing, in both the first teacher model and the second teacher model, at least the first training dataset and the second training dataset and generating a first training output and a second training output, respectively,
- optimizing the first and second training outputs to generate an optimized training output from the processing step by applying classification algorithms for determining the probability distribution, at an anchor box, of the presence of either the background or any of the objects of interest and applying regression algorithms for determining the bounding box of an object that is detected at the anchor box,
- supplying the optimized training output as the student machine learned model configured to classify and detect at least one of the one or more previously specified objects and at least one of the newly specified objects.
2. The method of claim 1 wherein at least the second training dataset comprises in part video snippets.
3. The method of claim 1 wherein at least the second training dataset comprises in part synthetic data.
4. The method of claim 1 wherein the first teacher model and the second teacher model are interoperable.
5. The method of claim 4 wherein at least one of the first teacher model and the second teacher model is a single shot multibox detector.
6. The method of claim 1 wherein the second training output is provided to an operator for correction and the corrected output is processed in a second iteration of the processing step.
7. The method of claim 1 comprising the further step of providing a validation dataset to the first teacher model and the second teacher model.
8. The method of claim 1 wherein at least some images are provided to an operator as the result of an uncertainty calculation for distinguishing an object from background.
9. The method of claim 8 wherein the uncertainty calculation is based in part on a variable threshold.
10. The method of claim 1 wherein a grid of anchor boxes is distributed uniformly throughout an image.
11. The method of claim 1 wherein classification is modeled as a softmax function to output confidence of a foreground class or a background class.
12. The method of claim 1 wherein regression is modeled as a non-linear multivariate regression function.
13. The method of claim 12 wherein the multivariate regression function outputs a four-dimensional vector representing center coordinates, width and height of the bounding box enclosing the object in the image.
14. The method of claim 1 wherein the system training output is provided to an operator for correction and the corrected output is processed in a second iteration of the processing step.
15. The method of claim 1 in which the second teacher model comprises a plurality of second teacher models, each comprising a second machine learned model capable, following training, of detecting and classifying at least one newly specified object.
16. The method of claim 1 further comprising the step of applying at least one of active learning and machine assisted labeling to the output of the second teacher model for correction of missed or mislabeled images.
17. A system for developing a student machine learned model for classification and detection of one or more previously specified objects and at least one newly specified object comprising
- one or more processors and associated storage coupled to the one or more processors and having stored therein instructions executable by the processors wherein the instructions when executed comprise a first machine learned model configured as a first teacher model capable of detecting and classifying one or more previously specified objects identified through the use of at least some anchor bounding boxes, a second machine learned model configured as a second teacher model capable, following training, of detecting and classifying at least one newly specified object, a first training dataset representative of the previously specified objects and a second training dataset representative of a newly specified object identified through the use of at least some bounding boxes,
- the processors being operable when executing the instructions to process, in both the first machine learned model and the second machine learned model, at least the first training dataset and the second training dataset to generate a first training output and a second training output, respectively to optimize the first training output and the second training output in order to generate an optimized training output by applying classification algorithms for determining the probability distribution, at an anchor box, of the presence of either the background or any of the objects of interest and applying regression algorithms for determining the bounding box of an object that is detected at the anchor box, and to supply the optimized training output as the student machine learned model configured to classify and detect at least some of the one or more previously specified objects and at least one of the newly specified objects.
18. The system of claim 17 wherein at least one of the first and second machine learned models is selected from a group comprising a single shot multibox detector and a low shot learning detector.
19. The system of claim 17 in which the second machine learned model comprises a plurality of teacher models, each capable, following training, of detecting and classifying at least one newly specified object.
20. The system of claim 17 wherein new unlabeled data is processed in both the first teacher model and the second teacher model.
21. The system of claim 17 wherein the optimized training output is provided to an operator for correction and the instructions cause the processor to reiterate execution of the process including the corrected output.
22. The system of claim 16 wherein at least one of active learning and machine assisted labeling is applied to the output of the second teacher model for correction of missed or mislabeled images.
23. One or more computer-readable non-transitory storage media embodying software that is operable when executed to:
- provide a first teacher model comprising a first machine learned model capable of detecting and classifying one or more previously specified objects identified through the use of at least some anchor bounding boxes,
- provide a second teacher model comprising a second machine learned model capable, following training, of detecting and classifying at least one newly specified object,
- provide a first training dataset representative of the previously specified objects to the first teacher model and the second teacher model,
- provide a second training dataset representative of a newly specified object identified through the use of at least some bounding boxes to the first teacher model and the second teacher model,
- process, in both the first teacher model and the second teacher model, at least the first training dataset and the second training dataset and generate a first training output and a second training output, respectively,
- optimize the first training output and the second training output by applying classification algorithms for determining the probability distribution, at an anchor box, of the presence of either the background or any of the objects of interest and applying regression algorithms for determining the bounding box of an object that is detected at the anchor box to generate an optimized training output,
- supply the optimized training output as the student machine learned model configured to classify and detect at least some of the one or more previously specified objects and at least one of the newly specified objects.
24. The system of claim 23 where at least the second training dataset comprises in part video snippets.
25. The storage media of claim 23 wherein the second training dataset comprises at least in part video snippets.
26. The storage media of claim 23 wherein the second training dataset comprises at least in part synthetic data.
27. The storage media of claim 23 wherein classification is modeled as a softmax function to output confidence of a foreground class or a background class.
28. The storage media of claim 23 wherein regression is modeled as a non-linear multivariate regression function.
29. The storage media of claim 28 wherein the multivariate regression function outputs a four-dimensional vector representing center coordinates, width and height of the bounding box enclosing the object in the image.
30. The storage media of claim 23 wherein the software, when executed, applies to the output of the second teacher model at least one of active learning and machine assisted labeling for correction of missed or mislabeled images.
Type: Application
Filed: Apr 24, 2023
Publication Date: Oct 19, 2023
Applicant: PERCIPIENT .AI INC. (Santa Clara, CA)
Inventors: Vasudev Parameswaran (Fremont, CA), Atui KANAUJIA (San Jose, CA), Simon CHEN (Pleasanton, CA), Jerome BERCLAZ (San Jose, CA), Ivan KOVTUN (San Jose, CA), Alison HIGUERA (San Josae, CA), Vidyadayini TALAPADY (Sunnyvale, CA), Derek YOUNG (Carbondale, CO), Balan AYYAR (Oakton, VA), Rajendra SHAH (Cupertino, CA), Timo PYLVANAINEN (Menlo Park, CA)
Application Number: 18/138,604