Systems and methods for contextualizing computer vision generated tags using natural language processing
This disclosure relates to systems, methods, and computer readable media for performing filtering of computer vision generated tags in a media file for the individual user in a multi-format, multi-protocol communication system. One or more media files may be received at a user client. The one or more media files may be automatically analyzed using computer vision models, and computer vision generated tags may be generated in response to analyzing the media file. The tags may then be filtered using Natural Language Processing (NLP) models, and information obtained during NLP tag filtering may be used to train and/or fine-tune one or more of the computer vision models and the NLP models.
Latest Entefy Inc. Patents:
- System and method of information retrieval from encrypted data files through a context-aware metadata AI engine
- Dynamic distribution of a workload processing pipeline on a computing infrastructure
- Application program interface analyzer for a universal interaction platform
- Temporal transformation of location-based queries
- Mixed-grained detection and analysis of user life events for context understanding
This application is a continuation of U.S. patent application Ser. No. 14/986,219, filed Dec. 31, 2015, and entitled “SYSTEMS AND METHODS FOR FILTERING OF COMPUTER VISION GENERATED TAGS USING NATURAL LANGUAGE PROCESSING” which is hereby incorporated by reference in its entirety.
TECHNICAL FIELDThis disclosure relates generally to systems, methods, and computer readable media for filtering of computer vision generated tags using natural language processing and computer vision feedback loops.
BACKGROUNDThe proliferation of personal computing devices in recent years, especially mobile personal computing devices, combined with a growth in the number of widely-used communications formats (e.g., text, voice, video, image) and protocols (e.g., SMTP, IMAP/POP, SMS/MMS, XMPP, etc.) has led to a communication experience that many users find fragmented and difficult to search for relevant information in these communications. Users desire a system that will discern meaningful information about visual media that is sent and/or received across multiple formats and communication protocols and provide more relevant universal search capabilities, with ease and accuracy.
In a multi-protocol system, messages can include shared items that include files or include pointers to files that may have visual properties. These files can include images and/or videos that lack meaningful tags or descriptions about the nature of the image or video, causing users to be unable to discover said content in the future via search or any means other than direct user lookup (i.e., a user specifically navigating to a precise file in a directory or an attachment in a message). For example, a user may have received email messages with visual media from various sources that are received through emails in an email system over the user's lifetime. However, due to the passage of time, the user may be unaware where the particular visual media (e.g., image/picture and video) may have been stored or archived. Therefore, the user may have to manually search through the visual images or videos so as to identify an object, e.g., an animal or a plant that the user remembers viewing in the visual media when it was initially received. This can be time consuming, inefficient and frustrating for the user. In some cases wherein the frequency of visual media sharing is high, this process can result in a user not being able to recall any relevant detail of the message for lookup (such as exact timeframe, sender, filename, etc.) and therefore “lose” the visual media, even though the visual media is still resident in its original system or file location.
Recently, a great deal of progress has been made in large-scale object recognition and localization of information in images. Most of this success has been achieved by enabling efficient learning of deep neural networks (DNN), i.e., neural networks with several hidden layers. Although deep learning has been successful in identifying some information in images, a human-comparable automatic annotation of images and videos (i.e., producing natural language descriptions solely from visual data or efficiently combining several classification models) is still far from being achieved.
In large systems, recognition parameters are not personalized at a user level. For example, recognition parameters may not account for user preferences when searching for content in the future, and can return varying outputs based on a likely query type, importance, or object naming that is used conventionally (e.g., what a user calls a coffee cup versus what other users may call a tea cup, etc.). Therefore, the confidence of the output results may change based on the query terms or object naming.
The subject matter of the present disclosure is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above. To address these and other issues, techniques that enable filtering or “de-noising” computer vision-generated tags or annotations in images and videos using feedback loops are described herein.
Disclosed are systems, methods, and computer readable media for extracting meaningful information about the nature of a visual item in computing devices that have been shared with participants in a network across multiple formats and multiple protocol communication systems. More particularly, but not by way of limitation, this disclosure relates to systems, methods, and computer readable media to permit computing devices, e.g., smartphones, tablets, laptops, wearable devices, and the like, to detect and establish meaningful information in visual images across multi-format/multi-protocol data objects that can be stored in one or more centralized servers. Also, the disclosure relates to systems, methods, and computer-readable media to run visual media through user-personalized computer vision learning services to extract meaningful information about the nature of the visual item, so as to serve the user more relevant and more universal searching capability. For simplicity and ease of understanding, many examples and embodiments are discussed with respect to communication data objects of one type (e.g., images). However, unless otherwise noted, the examples and embodiments may apply to other data object types as well (e.g., audio, video data, emails, MMS messages).
As noted above, the proliferation of personal computing devices and data object types has led to a searching experience that many users find fragmented and difficult. Users desire a system that will provide instant and relevant search capabilities whereby the searcher may easily locate a specific image or video which has been shared with them using any type of sharing method and which may or may not contain any relevant text-based identification matching the search query strand such as a descriptive filename, meta data, user-generated tags, etc.
As used herein, computer vision can refer to methods for acquiring, processing, analyzing, and understanding images or videos in order to produce meaningful information from the images or videos.
In at least one embodiment, a system, method, and computer-readable media for filtering Computer Vision (CV) generated tags or annotations on media files is disclosed. The embodiment may include running or implementing one or more image analyzer (IA) models from an image analyzer (IA) server on the media files for generating CV tags. In an embodiment, the models can include object segmentation, object localization, object detection/recognition, natural language processing (NLP), and a relevance feedback loop model for training and filtering.
In another embodiment, the image analyzers (IA) may be sequenced based on a particular user and the evolving nature of algorithms. For example, the sequencing of IA analyzer models may change as algorithms for actual NLP detection, classification, tagging, etc. evolve. The sequencing of IA analyzer models may also be changed based on user. For example, knowing that user A typically searches for people and not scenery, the AI sequencing may be adjusted to run additional models for facial recognition and action detection, while avoiding models for scene detection.
In another embodiment, the relevance feedback model can include a feedback loop where ‘generic’ tags that are created for objects may be processed or filtered with personalized NLP and searches for the filtered tags in the ‘specific object’ or ‘segmentation’ models, and, if there is a match, then the tags' confidence may be increased. This loop may be repeated until a desired overall confidence threshold is reached.
In another embodiment, an object segmentation model may be run on image files that may have been shared with the user in a multi-protocol, multi-format communication system. The object segmentation model may be configured to analyze pictures using one or more algorithms, so as to identify or determine distinct objects in the picture. In an embodiment, an object localization model may be performed on the image, along with each of the detected ‘pixel-level masks’ (i.e., the precise area that the object covers in the image), to identify locations of distinct objects in the image. Object localization may be used to determine an approximation of what the objects are and where the objects are located in the image.
In an embodiment, deep object detection may be implemented by using one or more image corpora together with NLP models to filter CV generated tags. NLP methods may be used to represent words and contextually analyze tags in text form. An NLP model may allow for a semantically meaningful way to filter the tags and identify outliers in the CV generated tags.
In another embodiment, a relevance feedback loop may be implemented, whereby the NLP engine may filter, or “de-noise,” the CV generated tags by detecting conceptual similarities to prioritize similar tags and deprioritize irrelevant tags. For example, when the system detects a questionable tag (i.e., confidence level is low), the system may recheck the tag to ascertain whether discarding the tag is advised. Furthermore, a CV tag-filtering engine based on a training set annotated at the bounding-box level (object's location) may create rules related to the spatial layout of objects and therefore adapt the NLP classifier to filter related definitions based on these layouts. For example, in everyday photos/images, the ‘sky’ is usually above the ‘sea’. The system may search for pictures from external datasets based on the subject of the discarded tag to verify whether removing the outlier was accurate. Results obtained from the search may be used to train NLP and computer vision using the images in the image dataset of the subject matter of the discarded tag.
In a non-limiting example, a user might want to find a picture or image that a certain person (e.g., his friend Bob) sent to him that depicts a certain subject (e.g., Bob and Bob's pet Llama), via a general query. The universal search approach of this disclosure allows a user to search for specific items—but in a general way—using natural language, regardless of the format or channel through which the message/file came. So, the user could, for example, search for “the picture Bob sent me of him with his Llama” without having to tell the system to search for a JPEG file or the like. The user could also simply search for “Llama” or “‘Bob’ and ‘animal’” to prompt the search system to identify the image via it's CV tags (which contain general concepts such as “animal” and specific concepts such as “Bob” and “Llama”), as opposed to locating the image via filename, metadata, message body context, or any other standard parameter.
As new data/content is on-boarded into the system, the data/content can be categorized and sharded, and insights that can be derived from analyzing the data, for example, language patterns, can be used to create an overarching user-personality profile containing key information about the user. That key information can be used to influence the weights of the various criteria of the index analyzer for that particular user. The index analyzer for a particular user can be automatically updated on an ongoing, as-needed, as-appropriate, or periodic basis, for example. Additionally, a current instance of an analyzer can be used by a user to perform a search, while another (soon to be more current) instance of the analyzer updates. Thus, for example, the words and expressions that a particular user uses when searching, can become part of a machine learned pattern. If a user on-boards email accounts, an index analyzer will pull historical data from the accounts and analyze that data. One or more analyzers discussed herein can comprise one or more variations of algorithms running independently or in combination, sequentially, or in parallel.
Referring now to
Server 106 in the server-entry point network architecture infrastructure 100 of
Referring now to
Referring now to
System unit 205 may be programmed to perform methods in accordance with this disclosure. System unit 205 comprises one or more processing units, input-output (I/O) bus 225 and memory 215. Access to memory 215 can be accomplished using the communication bus 225. Processing unit 210 may include any programmable controller device including, for example, a mainframe processor, a mobile phone processor, or, as examples, one or more members of the INTEL® ATOM™, INTEL® XEON™, and INTEL® CORE™ processor families from Intel Corporation and the Cortex and ARM processor families from ARM. (INTEL, INTEL ATOM, XEON, and CORE are trademarks of the Intel Corporation. CORTEX is a registered trademark of the ARM Limited Corporation. ARM is a registered trademark of the ARM Limited Company). Memory 215 may include one or more memory modules and comprise random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), programmable read-write memory, and solid-state memory. As also shown in
Referring now to
The processing unit core 210 is shown including execution logic 280 having a set of execution units 285-1 through 285-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The execution logic 280 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 290 retires the instructions of the code 250. In one embodiment, the processing unit core 210 allows out of order execution but requires in order retirement of instructions. Retirement logic 295 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processing unit core 210 is transformed during execution of the code 250, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 262, and any registers (not shown) modified by the execution logic 280.
Although not illustrated in
Data flow 300 starts at 302 where messaging content may be received and imported into a multi-protocol, multi-format communication system on a user client device (or user-client). For example, messaging content may be received as messages and/or other shared items that can include media files or point to media files within the message. Media files may include visual properties such as, for example, pictures or videos that may be included in the messaging content. In an embodiment, the messaging content including the media files (for example, pictures/images or videos) may be displayed to the user as messaging content in a user interface at a client application.
Next, one or more image analyzer (IA) models may be automatically run on the images and videos to determine computer vision tags or annotations for one or more distinct objects in the images (in 304) or videos (in 306). Media files that are received may be separated into images and videos, and one or more IA models may be run on the images and videos based on the format of the media files.
As shown in
Also shown in
Object detection may be run on the image in 308. In an embodiment, object detection may be implemented as one or more object detection models to determine generic classes of objects. The object detection model analyzes the image to determine tags for generic categories of items in the image such as, for example, determining tags at different abstraction levels such as person, automobile, plant, animal or the like, but also dog, domestic doc, Labrador dog. Inter-model fusion may be performed in 316, whereby tags obtained from running several object detection models on the image may be combined to generate tags in 324 defining labels for each detected object.
Object localization may be run on the image in 310. In an embodiment, object localization may be implemented as one or more object localization models. For example, one or more object localization models may be performed on the image to identify locations of distinct objects in the image. Object localization may be used to determine an approximation of what the objects are (i.e., labels) and where the objects are located (i.e., object window defining pixel coordinates (x, y, width, height) on the image. Inter-model fusion may be performed in 318 whereby tags obtained from running several object detection models on the image may be combined to generate tags in 326 defining labels and boundaries for each detected object.
Object segmentation may be run on the image in 312. Object segmentation may be implemented as one or more object segmentation models. In an embodiment, an object segmentation model may analyze the image to identify or determine distinct objects in the image (i.e., labels) and segmentation mask/object outline of the object (i.e., pixels identified to a cluster in which they belong) such as, for example, ‘animal’ and its mask or ‘table’ and its mask. In an example of a picture/image of a conference room having chairs and a conference table, object segmentation may be performed to segment the image by identifying one or more objects in the picture such as, for example, identification of three objects where each object may be one of the chairs in the image. In an embodiment, one or more additional object segmentation models may be applied to recognize faces and humans in the image. Object segmentation may generate a segmentation map that may be used to filter tags obtained in other IA models. Inter-model fusion may be performed in 320, whereby tags obtained from running several object segmentation models on the image may be combined to generate tags in 328 that define labels and segmentation mask/object outline for each detected object.
Scene/place recognition may be performed on the image in 314. In an embodiment, scene/place recognition may be implemented as one or more scene/place recognition modes that may be trained to recognize the scenery depicted in the image, for example, scenery defining outdoors, indoors, sea or ocean, seashore, beach, or the like. Model fusion may be performed in 322, whereby tags obtained from running several scene recognition models on the image may be combined to generate tags in 330 that define scenes in the image. For example, the scene/place recognition model may be used to enrich the set of tags obtained from models 308, 310, 312 and drive the filtering of tags in 308, 310, 312 by filtering out conceptual mismatches to determine whether an object detected in another model 308, 310, 312 may be found at the location in the image, for example, a dog cannot be detected at a location where sky is identified in the image.
In an embodiment, deep detection may use a deep neural network (DNN) that may produce meaningful tags that provide a higher precision of detection after proper training on a large set of images belonging to all desired categories. For training the DNN, one may use one or more sets of annotated images (generally referred to as a dataset or corpus) as a baseline. An image dataset/corpus may be a set of annotated images with known relational information that have been manually tagged and curated. In one example, a baseline image dataset/corpus that may be used can be a subset of the image-net dataset (which is available at http://www.image-net.org/). In an example, the image dataset/corpus may be augmented by web crawling other image sources and combining these image sources into the baseline dataset/corpus for training the image dataset/corpus. In another embodiment, an image dataset/corpus may be trained by using textual information that may be received in a message that has been shared with the user. For example, textual information received in a message, either in the body or subject line such as, for example, “an image of a plane” may be used to identify tags or annotations that may be used for content in the image.
In an embodiment, after generic classification (in 308), or localization (in 310), or segmentation (in 312), or scene detection (in 314), the image in 304 may be further analyzed through a specific model based on one or more categories that were identified in the image. For example, if one of the pieces of the image was classified as belonging to a plant category, the image may be analyzed through a specific plant dataset/corpus for identifying the specific type of plant using the plant dataset/corpus. Alternatively, if the image was classified as a glass category, the image may be classified as a specific utensil such as, for example, classified as a cup. These insights may be gathered for the entire image using models that may be implemented based on the category that were identified for the objects in the image. Particularly, the system may gather insights (i.e., identification of tags for the image) during implementing one or more of the specific models on the pieces of the image and store these tags in memory. In an embodiment, results that are obtained from implementing one or more models may be ranked based on a confidence level.
Next, in 332, after generic classification (in 308), localization (in 310), segmentation (in 312) or scene detection (in 314), intra-model fusion may be performed on the outputs of tags determined in steps 324, 326, 328, and 330. In an embodiment, the system may combine tags obtained from each model (in 324, 326, 328, and 330) (to combine the insights from the several models for, in embodiments and determine tags of different nature. For example, the results from combining insights are concatenated. Information that is concatenated is used to break up the image intelligently so that each object does not include portions of other objects (i.e., an object contour does not include portions of other objects in the image). For example, in an image with a person and a car, the image may be intelligently broken up so that the face of the person is distinct from portions associated with the car so that the system can identify objects in the image, how big the objects are in relation to other objects in the image and their location in the image. The output of intra-model fusion may produce tags for objects and their confidence values for the object tags in the image.
In an embodiment, in intra-level fusion (in 332), the system may weight importance of the objects in the image using a depth model. The depth model may determine depth or focus in the image in order to perceive if the objects identified in the image may be further back or closer in front. For example, based on a determination that an object identified is further back, a rule may be implemented that rates the object as less important. Similarly, another rule may weight an object more important if it has less depth. An index of weights for the image may be determined based on the depth model that may be implemented on the image.
Next, in 334, a Natural Language Processing (NLP) model may be implemented to filter the tags that are generated in intra-model fusion (in 332). In some embodiments, tag filtering can include inter-level and intra-level tag filtering. Filtering may be used to filter the automatically generated tags by selecting tags having the highest confidence values and/or selecting tags that are conceptually closer.
Inter-Level Tag Filtering
Object detection models may be of similar nature or not, i.e. trained to detect a large variety of objects (e.g. hundreds of object classes) hereby called ‘generic,’ or trained to detect specific objects (e.g. tens of classes or even of single class such as human faces, pedestrians, etc.) hereby called ‘specific.’
Running object detection models of similar nature, i.e., of only ‘generic’ or only ‘specific’, may produce competing lists of tags with the same or similar properties that may also containing different assessed confidence values. Inter-level tag filtering may use confidence re-ranking and NLP-base methods to filter and prioritize those tags by, for example, 1) selecting the tags that are conceptually closer; and 2) accumulating the confidence of those tags and selecting the most confident ones. For example, as shown in
Intra-Level Tag Filtering
Running object detection models of different nature, i.e., of ‘generic’ and ‘specific’ nature, may produce competing or complementary lists of tags and confidence values, e.g. tags such as ‘Labrador Retriever’, ‘gun dog’, ‘dog’, ‘domestic dog’, ‘Canis lupus familiaris’, ‘animal’, ‘cat’, ‘street’). Intra-level filtering based on NLP methods may produce a natural hierarchy of those tags by removing the outliers (‘cat’, ‘street’) as in the inter-level filtering case and by also creating an abstract-to-less-abstract hierarchy (‘animal’, ‘dog’, ‘domestic dog’, ‘gun dog’, ‘Labrador Retriever’, ‘Canis lupus familiaris’).
Using NLP methods to represent words and contextually analyze text, the NLP model may learn to map each discrete word in a given vocabulary (e.g., a Wikipedia corpus) into a low-dimensional continuous vector space based on simple frequencies of occurrence. This low-dimensional representation may allow for a geometrically meaningful way of measuring distance between words, which are treated as points in a mathematically tractable manifold. Consequently, the top-5 tags of
In an embodiment, a relevance feedback loop may be implemented whereby the NLP engine may “de-noise” the CV generated tags by detecting conceptual similarities to prioritize similar tags and de-prioritize irrelevant tags. For example, when the system detects a questionable tag (i.e., confidence level is low), the system may recheck the tag to ascertain whether discarding the tag is advised. Furthermore, the CV tag engine based on a training set annotated at the bounding-box level (object's location) may create rules related to the spatial layout of objects and therefore adapt the NLP classifier to filter related definitions based on these layouts. For example, in everyday photos/images, the ‘sky’ is—usually—above the ‘sea’. The system may search for pictures from external datasets based on the subject of the discarded tag to verify whether removing the outlier was accurate. Results obtained from the search may be used to train NLP and computer vision using the images in the image dataset of the subject matter of the discarded tag.”
Referring now to
The following examples pertain to further embodiments.
Example 1 is a non-transitory computer readable medium comprising computer readable instructions, which, upon execution by one or more processing units, cause the one or more processing units to: receive a media file for a user, wherein the media file includes one or more objects; automatically analyze the media file using computer vision models responsive to receiving the media file; generate tags for the image responsive to automatically analyzing the media file; filter the tags using Natural Language Processing (NLP) models; and utilize information obtained during filtering of the tags to fine-tune one or more of the computer vision models and the NLP models, wherein the media file includes one of an image or a video.
Example 2 includes the subject matter of Example 1, wherein the instructions to filter the tags using NLP models further comprise instructions that when executed cause the one or more processing units to select tags that are conceptually closer.
Example 3 includes the subject matter of Example 1, wherein the instructions to train each of the computer vision models and the NLP models further comprise instructions that when executed cause the one or more processing units to recheck outlier tags in an image corpus for accuracy of the outlier tag.
Example 4 includes the subject matter of Example 1, wherein the instructions to automatically analyze the media file further comprise instructions that when executed cause the one or more processing units to automatically analyze the media file using one or more of an object segmentation model, object localization model or object detection model.
Example 5 includes the subject matter of Example 1, wherein the instructions further comprise instructions that when executed cause the one or more processing units to analyze the media file using an object segmentation model for identifying the extent of distinct objects in the image.
Example 6 includes the subject matter of Example 1, wherein the instructions further comprise instructions that when executed cause the one or more processing units to implement an object detection and recognition model and an object localization model in parallel.
Example 7 includes the subject matter of Example 6, wherein the instructions further comprise instructions that when executed cause the one or more processing units to implement the object detection and recognition model to determine tags related to general categories of items in the image.
Example 8 includes the subject matter of Example 1, wherein the instructions further comprise instructions that when executed cause the one or more processing units to implement the object localization model to identify the location of distinct objects in the image.
Example 9 is a system, comprising: a memory; and one or more processing units, communicatively coupled to the memory, wherein the memory stores instructions to cause the one or more processing units to: receive an image for a user, wherein the image includes one or more objects; automatically analyze the image using computer vision models responsive to receiving the media file; generate tags for the image responsive to automatically analyzing the image; filter the tags using Natural Language Processing (NLP) models; and utilize information obtained during filtering of the tags to fine-tune one or more of the computer vision models and the NLP models, wherein the media file includes one of an image or a video.
Example 10 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to select tags that are conceptually closer responsive to filtering the tags using NLP models.
Example 11 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to recheck outlier tags in an image corpus for accuracy of the outlier tag.
Example 12 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to automatically analyze the image using one or more of an object segmentation model, object localization model or object detection model.
Example 13 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to analyze the media file using an object segmentation model for identifying the extent of distinct objects in the image.
Example 14 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to implement an object detection model and an object localization model in parallel.
Example 15 includes the subject matter of Example 14, the memory further storing instructions to cause the one or more processing units to implement the object detection model to determine tags related to general categories of items in the image.
Example 16 includes the subject matter of Example 9, the memory further storing instructions to cause the one or more processing units to implement the object localization model for identifying the location of distinct objects in the image.
Example 17 is a computer-implemented method, comprising: receiving an image for a user, wherein the image includes one or more objects; automatically analyzing the image using computer vision models responsive to receiving the media file; generating tags for the image responsive to automatically analyzing the image; filtering the tags using Natural Language Processing (NLP) models; and utilizing information obtained during filtering of the tags to fine-tune one or more of the computer vision models and the NLP models.
Example 18 includes the subject matter of Example 17, further comprising selecting tags that are conceptually closer responsive to filtering the tags.
Example 19 includes the subject matter of Example 17, further comprising rechecking outlier tags in an image corpus for accuracy of the outlier tags.
Example 20 includes the subject matter of Example 17, further comprising automatically analyzing the image using one or more of an object segmentation model, object localization model or object detection model.
Example 21 includes the subject matter of Example 17, further comprising analyzing the media file using an object segmentation model for identifying the extent of distinct objects in the image.
Example 22 includes the subject matter of Example 17, further comprising implementing an object detection model and an object localization model in parallel.
Example 23 includes the subject matter of Example 22, further comprising implementing the object detection model to determine tags related to general categories of items in the image.
Example 24 includes the subject matter of Example 17, further comprising implementing the object localization model to identify a location of distinct objects in the image.
Example 25 includes the subject matter of Example 24, further comprising searching for visually similar objects in a dataset.
Example 26 includes the subject matter of Example 21, further comprising searching for visually similar objects in a dataset.
In the foregoing description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, to one skilled in the art that the disclosed embodiments may be practiced without these specific details. In other instances, structure and devices are shown in block diagram form in order to avoid obscuring the disclosed embodiments. References to numbers without subscripts or suffixes are understood to reference all instance of subscripts and suffixes corresponding to the referenced number. Moreover, the language used in this disclosure has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter. Reference in the specification to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one disclosed embodiment, and multiple references to “one embodiment” or “an embodiment” should not be understood as necessarily all referring to the same embodiment.
It is also to be understood that the above description is intended to be illustrative, and not restrictive. For example, above-described embodiments may be used in combination with each other and illustrative process steps may be performed in an order different than shown. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention therefore should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Claims
1. A non-transitory computer readable medium comprising computer readable instructions, which, upon execution by one or more processing units, cause the one or more processing units to:
- receive one or more media files comprising one or more objects;
- automatically analyze the one or more media files using a plurality of computer vision models responsive to receiving the one or more media files, wherein automatically analyzing the one or more media files comprises: determining content associated with the one or more media files; and running a subset of the plurality of computer vision models based on the content;
- generate computer vision tags for the media file responsive to automatically analyzing the one or more media files, wherein each of the computer vision tags comprises a determined confidence value;
- determine a prioritization of the generated computer vision tags based on one or more Natural Language Processing models;
- filter the generated computer vision tags based on the prioritization;
- identify a set of the generated computer vision tags having highest confidence values of the determined confidence values of the computer vision tags;
- identify one of the generated computer vision tags having one of the determined confidence values at or below a predetermined threshold;
- determine that the one of the generated computer vision tags is an irrelevant tag based on reprocessing the one of the generated computer vision tags using the plurality of computer vision models and the one or more Natural Language Processing models;
- prioritize the filtered generated computer vision tags based on the determined confidence values of the generated computer vision tags, wherein prioritizing the filtered generated computer vision tags comprises ranking the filtered generated computer vision tags by increasing the prioritization of the set of the filtered generated computer vision tags and decreasing the prioritization of the irrelevant tag; and
- associate the prioritized filtered generated computer vision tags with the one or more media files.
2. The non-transitory computer readable medium of claim 1, wherein the one or more image analyzer models comprises one or more models of object segmentation, object localization, object detection/recognition, natural language processing, or a relevance feedback loop.
3. The non-transitory computer readable medium of claim 1, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to prioritize the filtered generated computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- re-rank, via an inter-level tag filtering, the determined confidence values of the generated computer vision tags;
- prioritize the filtered tags based on the re-ranked confidence values of the generated computer vision tags; and
- select the prioritized generated computer vision tags with highest confident values.
4. The non-transitory computer readable medium of claim 1, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to prioritize the filtered generated computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- detect one or more outliers based on inference of natural meanings of the generated computer vision tags.
5. The non-transitory computer readable medium of claim 4, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to filter using intra-level filtering to produce a natural hierarchy of the generated computer vision tags by removing the one or more outliers.
6. The non-transitory computer readable medium of claim 1, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to filter the generated computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- de-noise the generated computer vision tags by detecting conceptual similarities to prioritize similar tags and de-prioritize dissimilar tags.
7. The non-transitory computer readable medium of claim 1, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to filter the generated computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- filter the generated computer vision tags based, at least in part, on a spatial layout of the one or more objects within the one or more media files.
8. The non-transitory computer readable medium of claim 1, wherein the computer readable instructions which, upon execution by the one or more processing units, cause the one or more processing units to adjust the priority of one or more of the generated computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- adjust the priority of a given generated computer vision tag based on an estimated depth of an object in the image that is associated with the given generated computer vision tag.
9. A system, comprising:
- a memory; and
- one or more processing units, communicatively coupled to the memory, wherein the memory stores instructions, when executed, cause the one or more processing units to: receive a media file comprising an image or a video, wherein the media file includes one or more objects; automatically analyze the media file using computer vision models responsive to receiving the media file, wherein automatically analyzing the media file comprises: determining content associated with the media file; utilizing a subset of the plurality of computer vision models with the content; generate computer vision tags for the media file responsive to automatically analyzing the media file, wherein each of the computer vision tags comprises a determined confidence value; determine a prioritization of the computer vision tags based on one or more Natural Language Processing models; filter the computer vision tags based on the prioritization, wherein the instructions to filter comprise using an inter-level tag filtering to re-rank determined confidence values of the computer vision tags; identify a set of the computer vision tags having highest confidence values of the determined confidence values of the computer vision tags; identify one of the computer vision tags having one of the determined confidence values at or below a predetermined threshold; determine that the one of the computer vision tags is an irrelevant tag based on reprocessing the one of the computer vision tags using the plurality of computer vision models and the one or more Natural Language Processing models; select the computer vision tags based on the determined confidence values of the computer vision tags, wherein selecting the computer vision tags comprises ranking the computer vision tags by increasing the prioritization of the set of the computer vision tags and decreasing the prioritization of the irrelevant tag; and associate the selected computer vision tags with the media file.
10. The system of claim 9, wherein the instructions, when executed, cause the one or more processing units to filter the computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- accumulate confidence values of the selected computer vision tags; and
- select the selected computer vision tags with highest confident values.
11. The system of claim 9, wherein the instructions, when executed, cause the one or more processing units to filter the computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- detect one or more outliers based on inference of natural meanings of the computer vision tags, wherein the instructions, when executed, cause the one or more processing units to filter using intra-level filtering to produce a natural hierarchy of the computer vision tags by removing the one or more outliers.
12. The system of claim 11, wherein the computer vision tags for the media file are generated via one or more image analyzer models comprising an object segmentation model, an object localization model, an object detection/recognition model, a natural language processing model, or a relevance feedback loop model.
13. The system of claim 9, wherein the instructions, when executed, cause the one or more processing units to filter the computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- de-noise the computer vision tags by detecting conceptual similarities to prioritize similar computer vision tags and de-prioritize dissimilar computer vision tags.
14. The system of claim 9, wherein the instructions, when executed, cause the one or more processing units to filter the computer vision tags further comprise instructions that, when executed, cause the one or more processing units to:
- filter the computer vision tags based, at least in part, on a spatial layout of the one or more objects within the media file.
15. A computer-implemented method, comprising:
- receiving a media file comprising one or more objects;
- automatically analyzing the media file using a plurality of computer vision models responsive to receiving the media file, wherein automatically analyzing the media file comprises: determining content associated with the media file; and running a subset of the plurality of computer vision models using the content;
- generating computer vision tags for the media file responsive to automatically analyzing the media file, wherein each tag comprises a determined confidence value;
- determining a prioritization of the computer vision tags based on one or more Natural Language Processing models;
- filtering the computer vision tags based on the prioritization and via an inter-level tag filtering;
- identify a set of the computer vision tags having highest confidence values of the determined confidence values of the computer vision tags;
- identify one of the computer vision tags having one of the determined confidence values at or below a predetermined threshold;
- determine that the one of the computer vision tags is an irrelevant tag based on reprocessing the one of the computer vision tags using the plurality of computer vision models and the one or more Natural Language Processing models;
- prioritizing the computer vision tags based on the determined confidence values of the computer vision tags, wherein the prioritizing comprises ranking the computer vision tags by increasing the prioritization of the set of the computer vision tags and decreasing the prioritization of the irrelevant tag; and
- associating the prioritized computer vision tags with the media file.
16. The method of claim 15, wherein the computer vision tags for the media file are generated via one or more image analyzer models comprising an object segmentation model, an object localization model, an object detection/recognition model, a natural language processing model, or a relevance feedback loop model.
17. The method of claim 15, wherein the computer vision tags are filtered based on analysis by one or more natural language processing models.
18. The method of claim 15, further comprising:
- accumulating confidence values of the prioritized computer vision tags; and
- selecting the prioritized computer vision tags with highest confident values.
19. The method of claim 15, further comprising:
- detecting one or more outliers based on inference of natural meanings of the computer vision tags, wherein the computer vision tags are further filtered via an intra-level filtering to produce a natural hierarchy of the computer vision tags by removing the one or more outliers.
20. The method of claim 15, further comprising:
- de-noising the computer vision tags by detecting conceptual similarities to prioritize similar computer vision tags and de-prioritize dissimilar computer vision tags.
5481597 | January 2, 1996 | Given |
5951638 | September 14, 1999 | Hoss |
6101320 | August 8, 2000 | Schuetze |
6950502 | September 27, 2005 | Jenkins |
7317929 | January 8, 2008 | El-Fishawy |
7450937 | November 11, 2008 | Claudatos |
7673327 | March 2, 2010 | Polis |
7680752 | March 16, 2010 | Clune, III |
7734705 | June 8, 2010 | Wheeler, Jr. |
7886000 | February 8, 2011 | Polis |
7908647 | March 15, 2011 | Polis |
8090787 | January 3, 2012 | Polis |
8095592 | January 10, 2012 | Polis |
8108460 | January 31, 2012 | Polis |
8112476 | February 7, 2012 | Polis |
8122080 | February 21, 2012 | Polis |
8156183 | April 10, 2012 | Polis |
8281125 | October 2, 2012 | Briceno |
8296360 | October 23, 2012 | Polis |
8433705 | April 30, 2013 | Dredze |
8438223 | May 7, 2013 | Polis |
8458256 | June 4, 2013 | Polis |
8458292 | June 4, 2013 | Polis |
8458347 | June 4, 2013 | Polis |
8468202 | June 18, 2013 | Polis |
8468445 | June 18, 2013 | Gupta |
8521526 | August 27, 2013 | Lloyd |
8527525 | September 3, 2013 | Fong |
8959156 | February 17, 2015 | Polis |
9088533 | July 21, 2015 | Zeng |
9529522 | December 27, 2016 | Barros |
9875740 | January 23, 2018 | Kumar |
11334768 | May 17, 2022 | Brody |
20020133509 | September 19, 2002 | Johnston |
20020152091 | October 17, 2002 | Nagaoka |
20020160757 | October 31, 2002 | Shavit |
20020178000 | November 28, 2002 | Aktas |
20020194322 | December 19, 2002 | Nagata |
20030096599 | May 22, 2003 | Takatsuki |
20040117507 | June 17, 2004 | Torma |
20040137884 | July 15, 2004 | Engstrom |
20040177048 | September 9, 2004 | Klug |
20040243719 | December 2, 2004 | Roselinsky |
20040266411 | December 30, 2004 | Galicia |
20050015443 | January 20, 2005 | Levine |
20050080857 | April 14, 2005 | Kirsch |
20050101337 | May 12, 2005 | Wilson |
20050198159 | September 8, 2005 | Kirsch |
20060193450 | August 31, 2006 | Flynt |
20060212757 | September 21, 2006 | Ross |
20070054676 | March 8, 2007 | Duan |
20070073816 | March 29, 2007 | Kumar |
20070116195 | May 24, 2007 | Thompson |
20070130273 | June 7, 2007 | Huynh |
20070180130 | August 2, 2007 | Arnold |
20070237135 | October 11, 2007 | Trevallyn-Jones |
20070299796 | December 27, 2007 | MacBeth |
20080062133 | March 13, 2008 | Wolf |
20080088428 | April 17, 2008 | Pitre |
20080112546 | May 15, 2008 | Fletcher |
20080236103 | October 2, 2008 | Lowder |
20080261569 | October 23, 2008 | Britt |
20080263103 | October 23, 2008 | McGregor |
20080288589 | November 20, 2008 | Ala-Pietila |
20090016504 | January 15, 2009 | Mantell |
20090119370 | May 7, 2009 | Stern |
20090177477 | July 9, 2009 | Nenov |
20090177484 | July 9, 2009 | Davis |
20090177744 | July 9, 2009 | Marlow |
20090181702 | July 16, 2009 | Vargas |
20090187846 | July 23, 2009 | Paasovaara |
20090271486 | October 29, 2009 | Ligh |
20090292814 | November 26, 2009 | Ting |
20090299996 | December 3, 2009 | Yu |
20100057872 | March 4, 2010 | Koons |
20100177938 | July 15, 2010 | Martinez |
20100179874 | July 15, 2010 | Higgins |
20100198880 | August 5, 2010 | Petersen |
20100210291 | August 19, 2010 | Lauer |
20100220585 | September 2, 2010 | Poulson |
20100223341 | September 2, 2010 | Manolescu |
20100229107 | September 9, 2010 | Turner |
20100250578 | September 30, 2010 | Athsani |
20100312644 | December 9, 2010 | Borgs |
20100323728 | December 23, 2010 | Gould |
20100325227 | December 23, 2010 | Novy |
20110010182 | January 13, 2011 | Turski |
20110051913 | March 3, 2011 | Kesler |
20110078247 | March 31, 2011 | Jackson |
20110078256 | March 31, 2011 | Wang |
20110078267 | March 31, 2011 | Lee |
20110130168 | June 2, 2011 | Vendrow |
20110194629 | August 11, 2011 | Bekanich |
20110219008 | September 8, 2011 | Been |
20110265010 | October 27, 2011 | Ferguson |
20110276640 | November 10, 2011 | Jesse |
20110279458 | November 17, 2011 | Gnanasambandam |
20110295851 | December 1, 2011 | El-Saban |
20120016858 | January 19, 2012 | Rathod |
20120209847 | August 16, 2012 | Rangan |
20120210253 | August 16, 2012 | Luna |
20120221962 | August 30, 2012 | Lew |
20130018945 | January 17, 2013 | Vendrow |
20130024521 | January 24, 2013 | Pocklington |
20130067345 | March 14, 2013 | Das |
20130097279 | April 18, 2013 | Polis |
20130111487 | May 2, 2013 | Cheyer |
20130127864 | May 23, 2013 | Nevin, Iii |
20130151508 | June 13, 2013 | Kurabayashi |
20130197915 | August 1, 2013 | Burke |
20130232156 | September 5, 2013 | Dunn |
20130262385 | October 3, 2013 | Kumarasamy |
20130262852 | October 3, 2013 | Roeder |
20130267264 | October 10, 2013 | Abuelsaad |
20130268516 | October 10, 2013 | Chaudhri |
20130304830 | November 14, 2013 | Olsen |
20130325343 | December 5, 2013 | Blumenberg |
20130332308 | December 12, 2013 | Linden |
20140020047 | January 16, 2014 | Liebmann |
20140032538 | January 30, 2014 | Arngren |
20140149399 | May 29, 2014 | Kurzion |
20140270131 | September 18, 2014 | Hand |
20140280460 | September 18, 2014 | Nemer |
20140297807 | October 2, 2014 | Dasgupta |
20140355907 | December 4, 2014 | Pesavento |
20150019406 | January 15, 2015 | Lawrence |
20150039887 | February 5, 2015 | Kahol |
20150095127 | April 2, 2015 | Patel |
20150134673 | May 14, 2015 | Golan |
20150149484 | May 28, 2015 | Kelley |
20150186455 | July 2, 2015 | Horling |
20150261496 | September 17, 2015 | Faaborg |
20150278370 | October 1, 2015 | Stratvert |
20150281184 | October 1, 2015 | Cooley |
20150286747 | October 8, 2015 | Anastasakos |
20150286943 | October 8, 2015 | Wang |
20150339405 | November 26, 2015 | Vora |
20160048548 | February 18, 2016 | Thomas |
20160078030 | March 17, 2016 | Brackett |
20160087944 | March 24, 2016 | Downey |
20160092959 | March 31, 2016 | Gross |
20160173578 | June 16, 2016 | Sharma |
20170039246 | February 9, 2017 | Bastide |
20170039296 | February 9, 2017 | Bastide |
20170206276 | July 20, 2017 | Gill |
20170364587 | December 21, 2017 | Krishnamurthy |
20180048661 | February 15, 2018 | Bird |
20180101506 | April 12, 2018 | Hodaei |
9931575 | June 1999 | WO |
2013112570 | August 2013 | WO |
- Guangyi Xiao et al., “User Interoperability With Heterogeneous IoT Devices Through Transformation,” pp. 1486-1496, 2014.
- Marr, Bernard, Key Business Analytics, Feb. 2016, FT Publishing International, Ch. 17 “Neural Network Analysis” (Year: 2016).
Type: Grant
Filed: Jul 28, 2020
Date of Patent: Sep 26, 2023
Patent Publication Number: 20210117467
Assignee: Entefy Inc. (Palo Alto, CA)
Inventors: Konstantinos Rapantzikos (Athens), Alston Ghafourifar (Los Altos Hills, CA)
Primary Examiner: Hosain T Alam
Assistant Examiner: Saba Ahmed
Application Number: 16/941,447
International Classification: G06F 16/58 (20190101); G06F 16/51 (20190101); G06F 16/335 (20190101); G06F 16/33 (20190101); G06F 16/583 (20190101);