Abstract: A system for contextual matching video content based on multimodal metadata extraction generated by processing one or more scenes to extract metadata corresponding to multiple extraction modes, and an embedding model for each extraction mode wherein an aggregated embedding model responsive to said metadata embeddings for each mode formulates an aggregated embedding with an embedding extractor responsive to a text input with an embedding model coordinated with said embedding model wherein said embeddings are in the form of a vector, and a vector comparison processor for determining the distance between the query vector and a vector representing the aggregated embedding. The coordination between embedding models is established by training. The embedding extractor may accept a free-form text query and present one or more subqueries for embedding. A textual inversion engine may be provided to generate an image from the embeddings to provide feedback to a user.