Abstract: This application directs to methods and systems for visual content retrieval using semantic search. An embodiment provides a method for generating media feature vectors from media data segments using jointly trained machine learning models, and storing these with entity indicators in a vector-based search database. An input vector is generated from text or image data, and a processor calculates cosine similarities between the input vector and existing media feature vectors to retrieve and rank relevant media segments. The method also includes generating a mean feature vector from the retrieved set and comparing it with mean feature vectors of other entities for ranking. There are other embodiments as well.