Systems and methods for contextual and semantic summarization
A system may, in a first pass: divide content to be summarized into a plurality of chunks and, for each chunk: execute a language model with the chunk and an instruction to summarize the chunk, generate, based on the executed language model, a summary of the chunk. In a subsequent pass, the system may: generate a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries comprising two or more summaries, each summary corresponding to a respective chunk, for each group of summaries from among the plurality of groups: execute a language model with the group of summaries and an instruction to summarize the group of summaries, generate, based on the executed language model on the group of summaries, a group summary. The system may iteratively repeat the subsequent pass for group summaries until a summary of the content is reached.
This application claims priority to U.S. Provisional Patent Application No. 63/562,662, filed on Mar. 7, 2024, the entire content of which is incorporated by reference in its entirety herein.
BACKGROUNDAutomated content generation has been advancing at a rapid rate. Artificial intelligence (“AI”) systems such as large language models (“LLMs”) trained to generate text and visual models, such as diffusion models trained to generate visuals, are becoming increasingly sophisticated. Research and advancements in generative audio are likewise gaining traction. Furthermore, three-dimensional (“3D”) engines that provide development capabilities for users to render realistic 3D (and two dimensional (2D)) environments and objects are becoming increasingly powerful. Despite these advancements, there are many challenges introduced by these and other related systems.
Features of the present disclosure may be illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
Various systems and methods relate to multi-modal content generation and retrieval that may address various issues with these and other advanced systems. In particular, the system provides a generative AI platform for generating, storing, searching, and summarizing content. The system may train, retrain, and/or execute various generative AI and other advanced systems to provide an integrated platform for tasks relating to content generation, storage, search, and summarization. The content may include scripts for, movies, shows, commercials, short videos, non-visual, and/or other types of content (“transcripts”). Using advanced systems and generative AI models, the system enables creators to generate content that spans full lifecycle development, from ideation, to storyboarding, character development, scene creation, and completed project. The system enables creators to collaborate with others to iteratively generate various aspects of project creation, including visual (image and video), audio, script, and other elements of the project. The system may be used in other contexts (other than for transcripts) in which one or more issues with generative AI and other advanced systems are implicated.
For example, the mass scale at which content such as text, visuals, audio, 3D objects, and other content can be generated is staggering. Providing meaningful search capabilities for this data can be problematic, leading to difficulties in storing, identifying, and retrieving relevant content. In particular, it may be difficult to find content such as transcripts. Furthermore, content to be processed may be unstructured, which may make it more difficult to perform various generative AI tasks, such as summarization or other processing of specific portions of unstructured content.
Another issue with these systems is that generative AI models may rely on appropriate inputs such as prompts to generate appropriate results. For example, one prompt may not be as effective as another prompt in obtaining an appropriate response from an LLM. Oftentimes it is difficult for users to formulate appropriate prompts, let alone gather contextual, semantic, or other information that may provide sufficient information for the models to generate good outputs. Thus, generating effective prompts to maximize relevance or appropriateness of generative AI model outputs can be problematic.
Furthermore, generative AI models may be non-deterministic: given the same input, the same output may not be generated. This can present various problems, such as when attempting to generate content consistently across different compute nodes in a parallelized architecture. For example, breaking apart long text for parallelized summarization may involve breaking the content into chunks and summarizing each chunk using an LLM. However, a given LLM may summarize a chunk using a different tone (and/or other way) compared to another chunk, resulting in an incohesive overall summary of the original content. This issue is further compounded when different LLMs are used for different chunks.
Even though context window sizes are increasing, enabling larger sized content to be analyzed via a single prompt, it can remain advantageous to parallelize this effort for various reasons. For example, parallelizing LLM or other generative AI tasks may advantageously reduce the computational load of this task by breaking up the task into smaller running instances. Parallelizing LLM or other generative AI tasks may also result in more accurate results because multiple content pieces are being analyzed separately. For example, summarizing long text in a single prompt may be subject to non-deterministic output in the single task. But doing so over multiple smaller chunks and aggregating the results may yield better performance since the chances of each chunk being subjected to non-deterministic output is less than a single task. Put another way, a given chunk may be subject to non-deterministic output but the parallelization may tolerate this since other chunks may not be subject to non-deterministic output.
Another issue with generative AI systems is that they may not produce desired results, whether or not appropriate inputs are provided. For example, an LLM may be tasked with generating text that is later deemed to be inappropriate, not the intended style, or otherwise not considered an appropriate response to a user request. In another example, an image model may inappropriately cut off a portion of an image in response to a request to zoom in on a particular item in a scene. In a related problem, generative AI systems are known to hallucinate. That is, they can generate inaccurate or falsified content. This can present problems in various contexts such as for content summarization or factual recounting.
One ancillary issue with generative AI systems is that as their use proliferates, content generation will become easier but also possibly more prone to human errors, in addition to inherent generative AI errors such as the generation of inappropriate results or hallucinations. For example, a human user writing a transcript with generative AI (or with other systems) may introduce inconsistencies in various aspects of the transcript or related content. To illustrate, a writer may change an aspect of the transcript so that the change conflicts with other parts of the transcript. This problem can (and does) manifest in other contexts, but can be especially acute with generative AI systems. These and other issues exist with generative AI systems and their use.
The computer system 110 is a computational platform having one or more computer devices that generates, summarizes, and semantically searches content to provide a broad range of assistive and generative functionality. The computer system 110 may include a processor 112, a model Application Programming Interface (“API”) endpoint 111, a system API 113, a prompt generator 115, a platform system 120, a content parsing system 130, a generative content system 140, a semantic summarization system 150, a self-correcting content generative system 160, an interface system 170, and/or other features. The computer system 110 may access (such as read, write, delete, and/or update) various databases, such as the content repository 101, the prompt repository 103, and training repository 105.
The computer system 110 may train, retrain, fine-tune, execute, or otherwise activate various computer models. The computer models may include a language model 121, an image model 123, a 3D engine/model 124, a vision model 125, an audio model 127, a scene parsing model 129, a segmentation model 131, an inpainting model 133, a harmonization model 135, and/or other models. At least some of these models are generative AI models. A generative AI model is a computer model that is trained to generate new content based on training data. Different types of content in different formats such as text, visuals, audio, and/or other types and formats of content are contemplated.
Each of the systems 120, 130, 140, 150, 160, 170, and 180 may call or otherwise use one or more of the other systems. For example, the platform system 120 may call the semantic summarization system 150 to generate summaries of content. Similarly, each of the systems 120, 130, 140, 150, 160, 170, and 180 may train, retrain, fine-tune, execute, and/or otherwise activate various computer models such as models 121, 123, 124, 125, 127, 129, 131, 133, and 135.
The processor 112 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor 112 is shown in
System Interfaces, Model Platform Interfaces, and Prompt Generation
The model API endpoint 111 is an API that provides an interface to one or more of the models. Although only one endpoint is shown, there may be multiple endpoints that each interface with a respective model. The system may activate a model via the model API endpoint 111. For example, to activate a model, the computer system 110 may generate or select a prompt via the prompt generator 115 and transmit the prompt as input via the model API endpoint 111. The system API 113 is an API that provides an interface to various system functions of the computer system 110. For example, inputs to and outputs from one or more systems and models of
The prompt generator 115 is a system component that receives an input and generates a prompt for execution by one or more of the models. A prompt is an instruction to a generative AI model to generate an output. The prompt may include a query to be answered and/or a description of the output to be generated. In some instances, the prompt may also include additional information to be used by the model to generate a response. The additional information may include contextual data, desired output formats, constraints, domain-specific knowledge, examples, templates, tone, style, localization information (such as output language, consideration of cultural information, and so forth), and/or other information that may be provided to the model to help shape its response. Thus, generation of the prompt itself can be an important factor in obtaining an appropriate response from one or more of the generative AI models.
Prompts can be in the form of a text prompt for models that can understand text inputs, machine prompts for models that can understand non-text such as vector inputs, and/or other types of prompts depending on the model for which the prompt is intended.
Dynamic Prompt Generation
The prompt generator 115 may receive an input from a user and generate a prompt based on the user input, contextual information, semantic information, and/or other data to generate a custom and targeted prompt. The prompt generator 115 may be programmed with instructions to dynamically generate specific prompts for various situations. The prompt generator 115 may have access to variable data in a runtime environment such that the prompt generator 115 is able to access the variable data. The variable data may include an indication and payload of content (if any) being viewed, generated, or analyzed, user inputs, contextual information, semantic information, and/or other types of data that may influence prompt generation. Further details of dynamic prompt generation and its use are described at
In some instances, the prompt generator 115 may access one or more preconfigured prompts that may be designed by a developer and/or historical prompts previously generated by one or more users. In these instances, the prompt generator 115 may provide a user-selectable listing of the preconfigured prompts. Preconfigured prompts may be advantageous in situations in which a prompt is found to be effective and can be re-used by the same or different users and/or to simplify and streamline prompts. In some instances, the prompt generator 115 may modify a preconfigured prompt for dynamic prompt generation based on the preconfigured prompt. The preconfigured prompts and/or dynamically generated prompts may be stored in the prompt database 103.
Callback Functions for Generative AI Models
In some implementations, the computer system 110 may store and execute one or more callback functions 117 (illustrated as callback functions 117A-N). A callback function 117 is a function executed by the computer system 110 in connection with a request from a generative AI model, such as the language model 121, to execute one or more functions to obtain data, generate data, query data, or otherwise provide additional data for the generative AI model. Typically, though not necessarily, the generative AI model will return a request for one or more callback functions when it requires more information to respond to a request such as a prompt. The computer system 110 may accordingly execute the requested callback functions and return additional information resulting from the one or more callback functions.
In some examples, a callback function 117 may include a clarify function that may be called back by a model, such as the language model 121. The clarify function may include a preconfigured instruction to clarify certain responses. If the model does not understand an input such as a prompt, the model may callback the clarify function that results in text that asks for clarification. The preconfigured instructions may therefore control how clarifications are directed back to users.
In some examples, a callback function 117 may include a virtual calculator function as a callback function. In these examples, the computer system 110 may train a language model 121 to execute a virtual calculator by including functions to press virtual buttons of a virtual calculator. Thus, a prompt to “what is 5+7” may result in a callback function to “press the number 5, press the button “+” then press the button “7.” The virtual calculator will then provide an output of the virtual calculator. This is in contrast to, for example, directly executing a software function such as “print (5+7)” to perform the calculation.
Various systems of the computer system 110 may use callback functions 117 to obtain additional information about transcripts, visuals, audio, and/or other content that is the subject of modeling, whether generative AI modeling or other modeling described herein.
The following is an example flow of using a callback function 117 “query_content” for content generation.
Send the text as-is to the model (along with the rest of the conversation). If the conversation already contains a description of a character named Maggie, the model may pull from that and answer right away.
If the conversation does not include information (such as the description of Maggie), the model will call the “query_content” function and pass back a modified query. For example, “Describe the character Maggie from the show Subversion in great detail. Be sure to include both physical and personality characteristics.”
The query_content function then breaks the document up into chunks and uses the modified query to extract information.
Information from all the chunks (each individual response) is combined into a single document, and then the aggregate is sent to the language model 121 with the original question (not the modified query). The result of this processing is provided to the user.
Computer Models, Including Generative AI Models
The language model 121 is a generative AI model for language. In particular, the language model 121 may be a pretrained deep-learning LLM trained on large language datasets. The language model 121 may be trained to semantically understand natural language and automatically generate new text based on this understanding. Examples of the language model 121 may include, without limitation, one or more variants of: OpenAI GPT, LLaMA from META, Google LaMBDA, BERT from GOOGLE, BigScience BLOOM, Multitask Unified Model (MUM), or other language models.
The computer system 110 may activate the language model 121 with one or more input prompts and one or more model parameter values. The prompt may be generated by the prompt generator 115. A model parameter value is an input that specifies behavior—and therefore output—of the language model 121. For example, a model parameter value may include a temperature parameter that adjusts the level of randomness for automatically generated text. Different temperature parameter values will result in different levels of randomness in the generated text. Thus, the temperature parameter value may be used to control the output of the language model 121. The language model 121 may return the automatically generated text based on the one or more input prompts and any model parameter values.
The image model 123 is a generative AI model for visual data such as images and video. As used herein, the term “visual data” will generally refer to images and video, whether two dimensional (2D) or three dimensional (3D). The image model 123 may be trained on visual data to automatically generate new visual content. The visual content may be two dimensional and/or three dimensional. The image model 123 may be a diffusion model trained to generate new visual content. Examples of image model 123 include diffusion models such as one or more variants of: STABLE diffusion, DALL-E, IMAGEN, and MIDJOURNEY.
Diffusion models may implement score-based generative modeling, denoising diffusion probabilistic models, and Stochastic Differential Equations (“SDE”), each playing a critical role in the model's ability to process and generate complex data. Score-based generative modeling through SDEs map data to a noise distribution (the prior) with an SDE and reverse this SDE for generative modeling. Denoising diffusion probabilistic models are a specific type of diffusion model that focuses on probabilistically removing noise from data. During training, these models learn how noise is added to data over time and how to reverse this process to recover the original data. This involves using probabilities to make educated guesses about what the data looked like before noise was added. SDEs are mathematical tools that describe the noise addition process in diffusion models. They provide a detailed blueprint of how noise is incrementally added to the data over time. This framework is essential because it gives diffusion models the flexibility to work with different types of data and applications, allowing them to be tailored for various generative tasks. Score-based generative models (SGMs) learn to understand and reverse the process of noise addition. Score-based generative modeling teaches the model to start with noisy data and progressively remove noise to reveal clear, detailed images.
The 3D engine/model 124 is a model that can render 2D or 3D scenes or objects based on input parameters. These input parameters may include values, machine instructions such as vector data, and/or other inputs that specify a 2D or 3D rendering. Non-limiting examples of the 3D engine/model 124 include UNITY, UNREAL, and CRYENGINE.
The computer vision model 125 is an AI model that is trained to process, understand, and identify objects in electronic visual data such as images and videos. Examples of computer vision models include GPT-4V, LaVA (Large Language and Vision Assistant), and BakLLaVA. These or other computer vision models 125 may integrate image identification and language understanding that provides an ability to analyze visuals and ask questions of the visuals.
The audio model 127 is a generative AI model that processes and generates audio data. Audio data (or simply “audio”) is data that is intended to be heard, such as voices, music, sound effects, ambient noise, and/or other sounds. The training data can include music, speech, environmental sounds, or any other type of audio the model is designed to generate. By analyzing audio in the training data, the audio model 127 learns the underlying patterns and relationships between different sounds. Examples of audio models 127 include WAVENET, Variational Autoencoders (VAEs), and Generative Adversarial Networks (“GANs”) for audio. Once trained, the audio model 127 can generate new audio content based on, for example, sampling from a learned distribution or starting from prompts or other starting points. Sampling from a learned distribution may involve predicting the probability of different sound elements occurring and sampling from this distribution to create entirely new audio. Starting from prompts or other starting points may involve receiving input from users to influence the generation process based on input audio prompts from users that include such as a melody, rhythm, specific sound effects, or other sounds. The audio model 127 then builds upon these inputs to create new audio. The audio model 127 may generate audio in various contexts, such as to generate musical compositions, sound effects, dialogue, and/or other sounds that may be derived from or otherwise automatically generated based on sounds in the training data or prompts.
The scene parsing model 129 is an AI model that converts unstructured content 202 into structured content. The unstructured content 202 can include text such as natural language text. The structured content can include an output in a format that structures parsed elements. An example structured format can include JavaScript Object Notation (JSON) output, although other structured formats such as XML can be used. For example, the system can parse scenes and its elements from text in screenplay format and generate JSON output.
The segmentation model 131 is a computer vision model that identifies different parts of an image. Image segmentation can be used to delimit an element in the image. For example, image segmentation (such as via the Segment Anything Model (SAM)) can be used to identify and mask sunglasses from an image.
The inpainting model 133 is a generative AI model that fills in a specified space in visual data. The specified space can include the mask generated by the segmentation model 131. The inpainting model 133 may take as input a set of visual data and generates content to fill in the specified space. The content may be consistent with the visual data. In this way, the inpainting model 133 may be tasked to generate content that is consistent with the visual data. In particular, a user may input the set of visual data, be able to infill content that is consistent with the set
The harmonization model 135 is an AI model that can use various other model outputs to determine whether a content element such as lighting, shadows, and/or other elements in the content are appropriate given the context and/or semantics of the content.
The API endpoint 111 is a network location such as a uniform resource locator (“URL”) or uniform resource interface (“URI”) used to interact with a model that exposes an API. Each such model may expose a corresponding API that is reachable through a respective API endpoint 111. Thus, the various systems of the computer system 110 may identify a model to use and interface with that model through its API endpoint 111. For models that do not expose an API, the various systems of the computer system 110 may call the model directly through interfaces (such as a command line interface) exposed by the model.
Generative AI and Other Systems for Content Generation, Summarization, and Querying
The platform system 120 is a generative AI platform for generating, storing, searching, and summarizing content, including transcripts for: movies, shows, commercials, short videos, novels, novellas, short stories, story treatment, non-fiction books, articles, audio such podcasts, game design documents, creative briefs, non-visual, and/or other types of content (“transcripts”). Using advanced systems and generative AI models, the platform system 120 enables creators to generate content that spans full lifecycle development, from ideation, to storyboarding, character development, scene creation, and completed project. The platform system 120 enables creators to collaborate with others to iteratively generate various aspects of project creation, including visual (image and video), audio, transcript, and other elements of the project.
User Profiles and Contextual Information for Generative AI Models
The platform system 120 may register users and/or organizations to access the system. Users may include writers who are writing a transcript, actors or talent agencies who are searching for transcripts, directors or producers seeking their next project, potential investors, or others. The registration may be made in connection with subscription, per-use, royalty, or other fees. Registered users may be assigned a user profile that includes a role, a username, security credential such as a password, user preferences, historical data (such as prior work), and/or other user data. The user profile may be stored at and retrieved from the account repository 107.
The role may specify a type of user, such as a producer, director, agent, writer, artist, sound engineer, investor, and/or other type of role involved in content creation, consumption, or investment. The user preferences may include data about what the user prefers such as genres, aesthetics, audio preferences, actor preferences, user interaction history with the system, and/or other aspects of content that the user prefers.
Some or all of the user profile, such as the role or user preferences may be used for contextual information to perform various functions accessed by the platform system 120 and provided by the other systems and models described herein. For example, when a user logs on to use the system, the platform system 120 may access and use the user preferences as contextual information for generative AI models to create customized content as will be described herein based on the user preferences and/or other aspects of the user profile.
Searchable Repository of Content and Marketplace for Obtaining Rights to Content
The platform system 120 may include or interface with a content repository 101. The content repository 101 may store content generated or updated with the computer system 110. The platform system 120 may provide search capabilities that enable querying and retrieving content from the content repository 101. Search capabilities may include a semantic search, a keyword search, a visual search, an audio search, and/or other types of search.
Semantic search involves querying semantic similarity between a query and the content being queried, such as content stored in the content repository 101. Semantic similarity refers to a measure of similarity based on semantic content (meaning, context, or structure of words) rather than keyword matching. For example, “transportation” may be semantically similar to “automobile” and is not a keyword match. Thus, a search query having the word “Western” may be deemed to be semantically similar to a transcript having the word “Cowboy.” On the other hand, in a keyword search, the response section having the word “Cowboy” would not be returned as a result of a keyword query for “Western.”
Alternatively, or additionally, semantic similarity may refer to the relatedness of words rather than keyword matching. For example, “transportation” may be related to “highway.” In this context, semantic similarity may refer to the similarity in relatedness of words. Thus, a query having the word “transportation” may be deemed to be semantically similar.
Semantic similarity may be measured based on various techniques, such as topological similarity, statistical similarity, semantics-based similarity, and/or other techniques. For example, an initial query may be passed to the semantic search engine that queries the content as word embeddings (vectors) in a vector space. In some examples, the initial query may be classified against known concepts or known entities to influence the vectors that are generated.
Visual search is a search based on a visual query. The visual query may include text, image, video, and/or other query data that is searched against visuals in the content repository 101. For example, the query may include terms such as “show me transcripts with images of Cowboys.” In another example, the query may include an image and the results may include other images or visuals that are similar to the query image. Combinations of query inputs may also be used, such as “show me visuals that have the same colors as this image” along with an upload or selection of the query image.
Audio search is a search based on an audio query. Like the visual query, the audio query may include text, audio, and/or other query data that is searched against audio in the content repository 101. For example, the query may include terms such as “show me transcripts with classical music scores.” In another example, the query may include a sound file and the results may include other audio that are similar to the query sound. Combinations of query inputs may also be used.
The various types of searches may be combined with one another. For example, a semantic search may include a keyword search, a visual search and/or an audio search. To illustrate, a combination search may include terms “show me transcripts in the Western genre with a classical music score.”
One or more of the searches may incorporate contextual information, such as a user profile. For example, the platform system 120 may use some or all of the user profile of a registered user to provide context for searches. To illustrate, a registered actor may submit a query “find me transcripts that I would be interested in.” The platform system 120 may obtain the user preferences and/or other information from the user profile of the registered actor and retrieve transcripts that are relevant to the search query and the user preferences. For example, if the user prefers Western or Sci-Fi genres, the results will be filtered according to these genres. The search results may also or instead be filtered based on user work history. For example, the platform system 120 may return transcripts that have one or more roles that are consistent with the type of role the user has played in the past.
One or more of the searches may be a recurring search that is executed automatically. For example, a registered user may specify one or more search queries that are periodically (such as daily, weekly, monthly, etc.) run. Search results may be transmitted to the registered user and/or be provided to the user via a user interface. To illustrate, a registered agency may enter data such as clients or criteria for specific content. As other users input and/or import content, connections can automatically be made. For example, an agent might have a client looking for a futuristic motorcycle movie. The agency could set up a “semantic notification” based on a recurring search with these or other search parameters, and be notified of any content matching that criteria, including content that is uploaded to the content repository 101 after the recurring search was initially setup.
In some implementations, the platform system 120 may provide an electronic or online marketplace (“marketplace”) for content. The marketplace may enable the search (via one or more of the search capabilities of the platform system 120), use, download, or other permitted action with respect to the content. In this way, users may purchase, license, or otherwise obtain rights for a permitted action. The marketplace may include tracking features that track the use or appropriation of content. For example, various hash digests may be generated on some or all of a given content or its content elements. In some examples, the content may be subject to Digital Rights Management (“DRM”) protection to ensure that creators and other providers of the content may securely share the content with the computer system 110 for dissemination via the marketplace.
User Interfaces
The platform system 120 may provide one or more user interfaces 122 (illustrated as user interfaces 122A-N) for interacting with the computer system 110. Examples of user interfaces are illustrated at, for example,
In some implementations, a user interface 122 may provide asset mapping and displays that show a timeline of where assets such as scenes, actors, props, certain audio, etc. appear in the transcript. For example, this user interface may query a transcript to identify when each assets appears on a timeline, and graphically represent such appearance. In this manner, this user interface may enable identification of a given asset in the transcript, the duration of time that the asset appears, and/or other information about the asset.
Streaming Buffers
Streaming buffers refer to storing content from a model or other source of the content in a buffer and then depicting the content from the buffer. This features allows output to a user to be perceived more favorably by the user by buffering the stream of data before it appears. In that way, the data appears to be fluidly provided and does not distinguish itself from other output. Doing this also avoids the user from jumping in to ask another question during hesitancies to which they were previously provided.
Computer Modeling for Transforming Unstructured Content to Structured Content
The content parsing system 130 may train and use the scene parsing model 129 to transform unstructured content 102 into structured content. The unstructured content 102 may include text, including natural language text in any language. The structured content can include an output in a format that structures parsed elements, such as text, object, record, structure, dictionary, hash table, keyed list, array, properties, name, value, number, hexadecimal, label, codes, metadata, and/or token. An example structured format can include JavaScript Object Notation (“JSON”) output, although other structured formats such as eXtensible Markup Language (“XML”) can be used. The parsed elements can be collected, analyzed, filtered, connected together (relationally and otherwise), ordered, transformed, and stored in many ways.
The content parsing system 130 may train, retrain, fine-tune and use the scene parsing model 129 to recognize data elements from the unstructured content 102 and generate structured content based on the recognized data elements. Data element recognition may be based on specific types of unstructured content 102. Put another way, different types of unstructured content 102 will have different data elements the scene parsing model 129 is trained to recognize. Continuing the illustrative examples used herein, the unstructured content 102 may include a transcript and the content parsing system 130 may specifically train and use the scene parsing model 129 to parse scenes and other transcript elements (such as props, characters, etc.) from transcripts. The content parsing system 130 may generate structured content based on the scenes and transcript elements recognized by the scene parsing model 129. For example, the content parsing system 130 may provide a transcript to the scene parsing model 129, which generates a JSON output that encodes the scenes and other transcript elements in a structured format. It should be noted that the unstructured content 102 can be content other than a transcript and the scene parsing model 129 may be trained to transform other types of unstructured content and data elements into structured content based on the disclosures herein relating to transcript parsing and scene recognition.
Inputs to the content parsing system 130 may include unstructured content having text in various formats such as ASCII text, word processing documents, PDF documents, images having text, and/or other types of content having text or recognizable text.
If the unstructured content does not contain ASCII or other free text, the content parsing system 130 recognize text from the content using Optical Character Recognition (“OCR”) (such as via TESSERACT) or translating, extracting, or reading, such as from native PDF format.
Training and Using the Scene Parsing Model 129
The content parsing system 130 may train and use the scene parsing model 129 to recognize data elements in unstructured content 102. For example, the content parsing system 130 may train the scene parsing model 129 to recognize the start and end of scenes and other elements associated with a transcript. In a transcript, for example, scenes generally starting with the strings “INT”; “EXT”; or “I/E.” To train the scene parsing model 129, the content parsing system 130 may access one or more training messages. Each training message includes an example of at least a portion of unstructured content 102 such as a transcript and the data elements that should be parsed from the unstructured content. A non-limiting example of unstructured content is shown in Table 1 for illustrative purposes. A non-limiting example of structured content transformed from the unstructured content in Table 1 is shown in Table 2.
Example of a training message that labels relevant parts of the unstructured content illustrated in Table 1 so that the model learns how to parse the unstructured content. Full text is omitted for clarity. Different numbers of training messages and their corresponding unstructured content may be used to train the model. In some instances, the number of training messages is 10000, although other numbers of messages may be used for particular needs or implementations. The training data (training messages and corresponding unstructured content) may be stored in the training repository 105. The example shown is a JSON structure with a single value called “messages”. Messages is an array that contains three objects. Each object has a “role” and a “content” property. The role can be one of:
1. System: The high-level message to the LLM. This teaches it its personality, disposition, and objectives. In this case, “I reformat text-based screenplays into JSON data structures.”
2. User: An example of text the user might pass in. In this case, unstructured text from a screenplay.
3. Assistant: How the assistant (LLM) should respond. In this case, with a data structure representing the scene.
Example of structured content. The structured content shown in Table 3 is transformed from the unstructured content illustrated in Table 1. This represents an example of output of the scene parsing model 129.
HITL Retraining/Fine-Tuning
Customized Human In the Loop (“HITL”) processing: If the SP model fails to parse a scene, the user can provide an example of how to parse the failed scene via a GUI or other interface. The system generates a new training message based on the user example and can retrain or amend the model based on the new scene example.
Dynamic Platform Selection
The content parsing system 130 may perform dynamic model platform selection. For example, the content parsing system 130 may dynamically identify and select a model platform to use (OpenAI, MICROSOFT, etc.) based on various selection parameters such as context length, cost, performance, load, network congestion, output, system capabilities, and/or other criteria by which a model platform to use. Oftentimes this will involve selecting a corresponding API endpoint 111 for the selected model platform.
Model Failover—Dynamic Model Selection for Failure Mitigation with HITL Integration
In some implementations, the content parsing system 130 may train and use a plurality of scene parsing models 129 for the same type of unstructured content 102. For example, if a first scene parsing model 129 fails to recognize a scene or other transcript element from a transcript, the content parsing system 130 may use a second scene parsing model 129 trained in a different way, and so forth until the transcript is successfully transformed to a structured format or until all of the second scene parsing models 129 have been tried. If all models have failed, then the content parsing system 130 may transmit a notification to the user of such failure. In some instances, HITL retraining may be used to train a new scene parsing model 129 or retrain a current one. Different scene parsing models 129 may be trained differently based on complexity and execution speed. For example, first training data used to train a first scene parsing model 129 may be simpler and less complex than second training data used to train a second scene parsing model 129. The first scene parsing model 129 in this example will train faster and take less computation resources than the second scene parsing model 129. For example, a first set of training messages for the first scene parsing model 129 may be less complex and have fewer ways to parse the transcript than a second set of training messages for the second scene parsing model 129.
Generative Content System
The generative content system 140 may generate content based on a user input, contextual data, semantic data, and/or other information. For example, the generative content system 140 may generate text using one or more language models 121, visuals using one or more image models 123, 3D scenes or environments using one or more 3D engine/models 124, and/or audio content using one or more audio models 127. The generative content system 140 may interact with these models through one or more iterations of the examples illustrated in
The user input may include natural language text, visual data, audio data, and/or other information related to the text to be generated. The contextual data may include information that provides an understanding of the intent, meaning or understanding of the user input. For example, the contextual information may include a scene of a transcript being viewed or interacted with by a user. The generative content system 140 in this example may use the contextual information to further understand the user input “give me two alternative endings to this scene.” Other contextual information such as a cursor position on a screen to indicate focus on that part of a screen, a screenshot of the screen to understand what the user is viewing, text on screen, any highlighted features such as highlighted text, and/or other aspects of a screen being viewed by a user.
The semantic data may include meanings of underlying content that the generative content system 140 may use to generate the text content. Continuing the previous example, the generative content system 140 may leverage a semantic understanding of the words in the scene to generate two alternative endings to the scene. The generative content system 140 may therefore be aware of the context and/or semantics associated with a user input to generate text content in response to a user input. In particular, the generative content system 140 may use the prompt generator 115 to generate a prompt based on the user input, context, semantics, and/or other information.
Based on the user input, contextual awareness, semantic awareness, and/or other information, the generative content system 140 may perform content generation assistance, query functions, and summarization functions. To further illustrate the generative content system 140, examples using transcript assistance and functionality will be described.
Content Assistive Functions
The generative content system 140 may be used to write, revise, query or otherwise analyze a transcript. The generative content system 140 may a contextual and semantic understanding of the transcript and is able to add text relating to assets such as scenes, props, actors, lighting, etc., according to the overall meaning and context of the transcript. Visuals, audio, and/or other aspects related to the transcript may be added in an integrated and seamless manner through other systems described herein as well. Example User Input: “like I don't like Hank's name, give me some alternatives.” User Input+transcript is provided to LLM, which analyzes the transcript to determine a new name for Hank. Example User Input: “what type of environment and lighting for this scene in this transcript”+transcript→LLM→response based on understanding of the transcript and contextual information identifying the scene being referenced or being viewed by the user.
Automatically Generating Content and Making Suggestions, and/or a System for Multimodal Creation.
For example, if I'm writing a story about a dragon, the system can automatically generate images of dragons, changing the dragon as I provide more detail. It can also generate alternatives for me to select. When I select an image, not only does it become part of my storyboard, but the system can also suggest changing the text to better match the image I selected. For example, if it generated a variation of a dragon with more voluminous wings, and I decide I like that, it can offer to rewrite my description so that the text is more consistent with the image.
Content Generation Based on Multimodal Input.
It may be that I want to adapt a novel to a screenplay; I should be able to start with the novel and have the system suggest a screenplay that I can use to start with.
I might find an image of a car I really like; I should be able to drag and drop it into my editor and get a description of that car generated as text.
If I want to capture a mood that a piece of music conveys, I should be able to import the music, and the system can then write text that conveys the mood of the music.
Multimodal Input System to Control Content
To get a character to pose correctly, turn on your webcam and make the pose yourself.
To give a character the right facial expression, hire an actor, turn on the webcam, and have the actor make the expressions (which are instantly transferred to characters).
Record all the lines for all the characters, then change things like gender, age, accents, etc.
Hum a tune as input for music.
Search Functionality
The generative content system 140 may provide search functionality to assist with content generation. To illustrate, the following is an example processing flow:
User Input: “find me all props in the transcript”
The generative content system 140 obtains the transcript and any contextual data, calls the prompt generator 115 to generate a prompt based on: (1) the user input and the transcript, and (2) any contextual information, any semantic information (such as from the transcript itself), any formatting conditions such as in table format, and/or other information.
Transmit the prompt to a language model 121.
Language model 121 returns a listing of props in a table format (if requested).
The generative content system 140 returns the listing of props via an interface.
Summarization Functions.
Example Flow:
User Input: “Summarize transcript up to this point.”—context-based location summarization.
Generate LLM Prompt: “User is looking at a 50 scene transcript, currently looking at scene 25, and is asking for a summary up to this point. What scenes does the user need?”
LLM Response: 1-25.
System parses scenes 1-25 (such as via the Screen Parser) then provides a prompt to a pretrained model (such as the semantic summarization system 150) that requests a summary along with parsed scenes 1-25.
Summarization Including Semantic Summarization of Content
The semantic summarization system 150 generates summaries, reductions, or distillations of content. For example, the summarizer may generate textual summaries of screenplay transcripts. To generate the summaries, the summarizer implements data processing, such as MapReduce-like functionality, to divide content into chunks (such as different scenes of a transcript as generated by the custom scene parser—see above) and serially generate different resolutions of summaries.
The semantic summarization system 150 may provide preconfigured, user-modified, system-modified, or user-directed prompts to a language model 121 to generate the summaries. Different types of content will have different prompts. For example, a movie transcript will have different prompts to generate summaries than a TV show.
In some implementations, the semantic summarization system 150 may implement a multi-pass architecture for summarizations. To illustrate, reference will be made to
The multi-pass generative AI architecture 600 may include two or more passes 601 (illustrated as passes 601A-N) in which the original content 610 is summarized into a summary 620. The original content 610 may include a transcript and the summary 620 may include a summary, a reduction, or distillation of the transcript. The entertainment industry includes the summary 620 of a transcript as part of “script coverage” for the transcript. The particular number of passes 601 may vary according to particular needs.
The initial pass 601A will summarize different chunks of the original content 610 into first pass summaries 602A, 604A, 606A, 608A, 610A, and so forth. The number of first pass summaries will vary depending on the size and number of chunks of original content 610. For example, for transcripts, the initial pass 601A will include a scene-by-scene summaries. In this example, the semantic summarization system 150 may parse the transcript into different scenes using the content parse system 130. For each scene, the semantic summarization system 150 may use the prompt generator 115 to generate a prompt to summarize the scene and transmit the prompt (which includes the scene, a request to summarize the scene, and any contextual or semantic information) to a language model 121. The language model 121 will return a scene summary, which is illustrated as one of the first pass summaries 602A, 604A, 606A, 608A, and 610A. In some implementations, scenes in the first pass 601A are not merged together, maintaining separate summaries for each scene in the content 610. In particular, the semantic summarization system 150 may take a scene-by-scene approach by identifying scenes, and generating summaries of each scene without breaking apart the scene. This results in fully contextually aware summaries on logical chunks of the content, which may not be possible if chunks spanned two or more scenes. The semantic summarization system 150 may further select which LLM platform to use based on various parameters such as context window, cost, speed, current load, network congestion, output, system capabilities, and/or other parameters.
Each subsequent pass 601 may be a summary of two or more summaries from a previous pass. For example, as illustrated, summary 602B in pass 601B is a summary of summaries 602A and 604A from pass 601A. Likewise, summary 604B in pass 601B is a summary of summaries 608A and 610A from pass 601A. Pass 601B may continue until all previous summaries from 601A are summarized. Although two prior summaries from pass 601A are shown as being summarized into one summary in pass 601B, the number of prior summaries may vary. Summarization at pass 601B and any subsequent passes may be performed as described with respect to pass 601A until the final pass is reached to generate the summary 620. For example, summaries in the final pass (601N as illustrated) may be aggregated together to generate the summary 620. If the transcript were a screenplay transcript, the summary 620 could be a logline or a short summary of the entire screenplay or both. The summaries from pass 601A could be scene by scene summaries while the summaries in pass 601B could be a summary of certain (two or more) scenes.
It should be noted that the multi-pass generative AI architecture 600 may be associated with specific and targeted prompts for each summary (such as 602A, 602B, 604A, 604B, 608A, 610A) and/or each pass 601A-N. Alternatively or additionally, a different model platform/language model 121 may be used for each summary (such as 602A, 602B, 604A, 604B, 608A, 610A) and/or each pass 601A-N.
Harmonization Pass
Oftentimes, generative AI models (such as the language model 121) may generate slightly different content because they can be nondeterministic. Alternatively or additionally, different prompts may be used for each chunk's summary in each pass 601. As a result, each summary in a given pass 601 may have a different tone, style or other characteristic than another summary in the same pass or another pass. An overall summary 620 generated from these summaries may therefore have a combination of different tones, styles, or other characteristics as an artifact of generative AI summarization in the multi-pass generative AI architecture 600.
To reduce or eliminate these artifacts, the semantic summarization system 150 may perform a harmonization pass to make content generated by multiple threads more consistent. To illustrate, reference will be made to
It should be noted that the multi-pass generative AI architecture 600 and harmonization pass 701 may be used in contexts other than transcripts to summarize original content.
Reduction or Elimination of Generative AI Model Hallucination
Oftentimes generative AI models such as LLMs “hallucinate” when generating content. A hallucination is when generative AI models create inaccurate, misleading, or false content. For example, language models 121 may hallucinate because they are trained on large amounts of language data, which can include incomplete or incorrect information. Furthermore, weak prompts may further confuse the language models 121. In the context of summarization, this hallucination can create content such as text in a summary that was not in the original content being summarized or otherwise provide inaccurate summaries of events that did not occur. The semantic summarization system 150 may implement solutions to reduce or eliminate hallucinations. For example, semantic summarization system 150 may use templates for providing guardrails.
The template specifies a starting scene “conference room” and a transition to a next see “kitchen.” The “[summary]” placeholder is a template block that indicates what to summarize (the “conference room” scene. The semantic summarization system 150 may prompt the language model 121 to fill in the [summary] section with a summary of the conference room scene as indicated in the template. The foregoing places guardrails around the content that the language model 121 should consider for summarization and reduces or eliminates hallucinations that may result from conflating the scene with other scenes or otherwise taking into account too much input content. It should be noted that the hallucination guardrail may be used in contexts other than for transcript summaries. For example, guardrail templates may similarly place limits around other types of content, such as limiting summarization to specific parts of an online news article, novel chapter, etc. Furthermore, templated approaches may be used in other generative AI contexts other than summarization. For example, templated approaches may limit the specific material considered to generate new content.
Self-Correcting Content Generation
The self-correcting generative system 160 may analyze and correct content generated by generative AI models across different types of media such as visual, audio, and text. The content to be corrected may include a prompt (generated by a human user and/or the prompt generator 115), a conversation between a user and the computer system 110, automatically generated content (such as by the computer system 110), and/or other content input to the self-correcting generative system 160. A conversation as used herein includes at least one input from a user and at least one response from the computer system 110. In some instances, a conversation may include multiple inputs from the user and/or multiple responses or inputs from the computer system 110. In either example, the computer system 110 may temporarily or permanently store the conversation to use as context for future responses, training, and/or other purposes.
Generative AI models often cannot recognize mistakes in its output, determine that the output is inconsistent with the user's request, or is inconsistent with the user's expectation. 3D models 124 may suffer from similar problems. The self-correcting generative system 160 may mitigate these mistakes through self-correcting iteration of model outputs. For example, the self-correcting iteration may be based on iterative generation and execution of text prompts 201 and/or machine inputs 203 illustrated at
Example Flow 1—Iteratively Correct Visual Output of a Generative AI Model:
User input: “give me a closeup of this character” and uploads an image of the character, user identifies a character in a visual, or contextual information identifies the character (such as user cursor or focus on the character).
The self-correcting generative system 160 uses the prompt generator 115 to generate a text prompt for a language model 121 to understand the request.
The language model 121 returns a text prompt for the image model 123.
The self-correcting generative system 160 provides the text prompt to the image model 123 along with the image that includes the character.
The image model 123 generates an output image.
The self-correcting generative system 160 uses the prompt generator 115 to generate a text prompt “Did what I try to do satisfy the request?”
The language model 121 returns a second text prompt that prompts the image model 123 to verify that the image is a closeup of the character such as with the character's head entirely in the frame and not cutoff.
The self-correcting generative system 160 provides the second text prompt to a computer vision model 125 along with the original image that includes the character and the output image from the image model 123.
The computer vision model 125 returns a response indicating whether or not the request was satisfied based on the second text prompt, the original image, and the output image.
The process is automatically iterated until a satisfactory response is achieved. In some instances, the process is automatically iterated until a maximum number of times at which point the self-correcting generative system 160 may notify the user that there may be a problem with fulfilling the request (along with one or more of the iterative image outputs for review).
Example Flow 2—Iteratively Correct 3D Output of a 3D Model/Engine:
This feature will also incorporate control and interface with various visuals, including 3-D. These features enable natural language or other user inputs to control even 3-D scenes and other visuals.
It should be noted that other aspects such as text, audio, visual may similarly be checked for self-correction individually and in combination with one another based on the workflow examples above. For example, another example workflow may perform self-correction on text generation to ensure that requested text generation has been correctly satisfied. Yet another example workflow may perform self-correction on audio generation to ensure that requested text generation has been correctly satisfied. Combinations of content checking may be performed to ensure that requested generation of a combination of text, audio, visual, and/or other type of content has been satisfied using similar workflows.
It should be further noted that the same models may be used to generate content and verify generated content. For example, the same generative language model 121 may be used to both generate text as requested and correct or validate that the generated text is correct. Alternatively, a first generative language model 121 may be used to generate text as requested and a second generative language model 121 may be used to correct or validate that the generated text is correct.
Segmentation and Inpainting
The segmentation and inpainting system 170 may segment a visual and/or inpaint content into the visual. Image segmentation is a computer vision process that identifies different parts of an image. Image segmentation or simply removal can be used to delimit an element in or portion of an image. For example, image segmentation (such as via the Segment Anything Model (SAM)) can be used to identify and mask layers, portions, or even specific items in an image, such as sunglasses. Inpainting can then be used to fill in the segmented part, such as sunglasses, along with filling in remaining corresponding portions, such as those left after removal of the sunglasses and filling with the eyes. This can be used to add elements as well, such as adding sunglasses to an image of a face by masking off where sunglasses would be in the image of the face and then filling it with an image of sunglasses.
Examples of workflows are provided for illustration and not limitation:
In one example workflow, a user may provide the segmentation and inpainting system 170 with an image of a face wearing sunglasses. For example, the user may upload the image via a user interface 122. The user may further provide an input that makes the following request: “replace the sunglasses in this image with soulful eyes.” The segmentation and inpainting system 170 may generate a prompt (via the prompt generator 115) and transmit the prompt to a language model 121, which may generate an instruction for the segmentation model 131 to mask the sunglasses from the image and an instruction for the inpainting model 133 to fill in images of eyes that are consistent with the requested description. Other examples of workflows may be similarly created to replace one image element with another similar image element (such as “replace the green tie with a red tie.” Still other examples of workflows may include more advanced semantic processing, such as “create a poster for these actors in a western movie” with an input of actor images. The image model 123 may generate a background that is appropriate for the “western” genre, masks off items where characters (actor images) would be placed and add the characters (actor images) to the background. It should be noted that the actor images may similarly be changed by the image model 123 to be consistent with the western genre (such as by masking image items of actor images and replacing them with western themed image items). It should be further noted that the prompt generator 115 may use contextual, semantic, and/or other data described herein to generate the visuals, which can include 2D visuals and 3D visuals.
Multi-Modal Consistency Verification
The multi-modal consistency verification system 180 provides multi-modal consistency checks across all types of content (visual, audio, text, etc.). The system ensures that the various modes (i.e. transcript, props, scene depiction, etc.) are consistent with one another.
To detect the inconsistency, the multi-modal consistency verification system 180 may identify and track each screenplay element (whether text, visual, audio, etc.) with respect to one another. In one example, the multi-modal consistency verification system 180 may track the transcript and its elements according to a graph-based data structure in which each scene is a node having various attributes (scene items such as visuals, text such as dialog or narration, etc.). In this way, a change to the order of scenes may be detected. Likewise, attributes such as tie worn by a character may be checked across nodes.
In some examples, the multi-modal consistency verification system 180 may detect changes to the transcript and/or its elements through diff tracking or embeddings.
In diff tracking, a diff is a change between different versions (such as a change from one version to a next version or a change from a current version to a prior version). Usually, but not necessarily, a root and complete version of the transcript and its elements is stored and any change is stored as a diff. A latest version of the transcript and its elements may be generated by obtaining the root version and applying all the diffs. If a change is made, such as a green tie being changed to a red tie, only this change will be stored as a diff. Through these diffs, the multi-modal consistency verification system 180 may identify inconsistencies that a given diff may have introduced as compared to a previous version of the transcript (such as by reviewing other diffs and/or the root version of the transcript and its elements).
For embeddings, the transcript and its elements may be converted to embeddings, or numerical representations for fast difference checking. When a new change is made, a new embedding representing the changed item or transcript will be generated. Inconsistencies may be detected when embeddings do not match as expected.
In response to a detected inconsistency, the multi-modal consistency verification system 180 may provide a notification (such as via an electronic communication channel or a user interface 122) to a user, guide the user to correct the inconsistency, or automatically correct the inconsistency. To automatically correct the inconsistency, the multi-modal consistency verification system 180 may use the prompt generator 115 to generate a prompt for a language model 121 to make a change to a transcript, a prompt for an image model 123 or 3D model 124 to make a change to a visual, and/or other make other changes to automatically correct a detected inconsistency. To guide the user, the multi-modal consistency verification system 180 may suggest prompts or other actions to take to resolve the inconsistency. For example, the multi-modal consistency verification system 180 may suggest that the user change green ties to red ties in all visuals and all portions of the transcript. In some instances, the multi-modal consistency verification system 180 may suggest an appropriate correction to an inconsistency based on various factors, such as timing, contextual or semantic data. For example, the multi-modal consistency verification system 180 may suggest that green ties in visuals should be changed to red ties because red ties were more recently indicated in the transcript. In another example, the multi-modal consistency verification system 180 may suggest that red ties be used based on contextual information that red ties are more seasonal (due to a holiday season). In another example, the multi-modal consistency verification system 180 may suggest that red ties be used from semantic information about a character that exhibits a preference for the color red in the transcript or other visuals.
Harmonization Modeling
In some implementations, the multi-modal consistency verification system 180 may use the harmonization model 135 to ensure that lighting, shadows, audio, or other aspects of the transcript are appropriate and consistent. This differs from the consistency checks because the lighting, for example, may be consistent from scene-to-scene but may be incorrect for the mood of the transcript. For example, suppose that an apple has been inpainted onto a wooden workbench, but its lighting and shadowing indicate a light source different from that shown by the grainy ridges of the worn wooden workbench on which the apple sits. The harmonization model can check as to whether there is a light source orientation problem (or be told there is such a problem) and then identify and reconcile one or more appropriate orientations and/or light sources, natural or man-made. In another example, the user sees a hat they like in another image. They select it and paste it into another photo. The system understands to remove everything but the hat from the pasted layer, and to resize it appropriately. The harmonization model pass then makes the hat appear to be realistic in the target image.
User input may include user-provided queries, text, visuals, audio, and/or other types of content to request a response from the computer system. For example, a user input may include a query such as “what is the tone of this scene?” or a command such as “show me what Hank should look like.” The user input may also or instead include other types of data such as a visual along with a question “does the hat in this image fit with Hank's character?”
Contextual information may include information that provides context around what is being requested. Contextual information may include some or all of the user profile information, user interface state information, content currently being viewed or interacted with (such as a transcript or scene in the transcript), and/or other information that may provide a context of what is being requested. User interface states may include an identification of the screen being viewed, a focus area on the screen, a highlighted area of the screen, a typed or other input made via the user interface, and/or other information about how the user is interacting with the user interface. For example, contextual information may be used to identify the scene being referred to in the query “what is the tone in this scene?”
Semantic information may include the meaning of content being viewed or data relating to what is being requested. For example, the of words of the dialog or narration in “this scene” may be used to identify the tone in the scene by a downstream model and therefore the prompt generator 115 may dynamically extract the words or other semantic information from “this scene” and generate the text prompt 201 and/or machine input 203 accordingly. Thus, the prompt generator 115 may use some or all of the inputs to dynamically generate the text prompt 201 and/or the machine input 203.
It should be noted that the schematic flows 200-500 illustrated in
Referring to
Referring to
In addition, prompt box 1020 provides a textual interface to enter queries to the platform 100 about the screenplay that is under consideration. The GUI will then convert the user's query into appropriate inputs for one or more of the AI models to generate content appropriate to the user's query. As an example, the user may type in the prompt a question, “What does Ed Rama look like?” In this example, Eduardo Rama is written in the screenplay as one of the characters, and the screenplay may not provide a succinct description of Eduardo Rama, but does contain his lines throughout, possibly some descriptions of his mannerisms or movements, and various contexts about his interactions with other characters. From just this prompt alone, the platform 100 may search through the entirety of the screenplay to find features of Ed Rama that are useful to provide to the one or more AI models for content generation. The prompt generator 115 may then create an appropriate prompt to feed to the one or more AI models that is based on the collective content about Ed Rama in the screenplay. For example, in response to the user's query, the prompt generator 115 may search through the entire screenplay to find any and all evidence about Ed Rama's characteristics. This may be based on lines that the character Ed Rama states, any narrative statements about Ed Rama's movements, clothes he wears, actions he takes, environments he resides in, and/or dialogue with other characters and what tones those may suggest, etc. The prompt generator 115 may then create an appropriate prompt that represents a description of Ed Rama that can be fed to the one or more AI models via the API endpoint 111. The one or more AI models may then produce at least a visual depiction of Ed Rama that the GUI can then display.
In this example, it can be seen that the platform 100, through the GUI, acts as a sort of translator or interpreter of the screenplay to provide appropriate inputs to the AI models. The user merely needs to ask questions about the screenplay without necessarily even knowing the full details of the screenplay, and the platform 100 then performs the work of examining the screenplay to find appropriate details to feed to the various AI models for content generation.
Still referring to
As another example, post 1024 may have been based on the user asking the question, “Tell me the backstory of Ed Rama.” The platform 100 may have developed the answer posted there based on some lines Ed Rama says in the script, descriptions of his character in the screenplay, and so on. As another example, in post 1028, the user may have asked in the prompt box 1020, “What is Ed Rama's exit scene?” While in post 1030, the user may have asked in the prompt box 1020, “What is Ed Rama's intro scene?” These answers may have again been provided based on the information contained in the screenplay and prompts fed to the one or more AI models.
Referring to
A follow-up step after creating the new card would be to populate the card with details derived from analyzing the screenplay. Referring to
At 1206, the GUI may then list the types of extra it found or inferred in the screenplay. At 1208, based on a user prompt or selection in the GUI, the Extras card 1208 may be further populated by visual depictions of what the extras may look like in the context of their scene according to the descriptions in the screenplay. Again, these generated images and lists of extras may be produced by an exchange between the prompt generator 115 and the AI models through the API endpoint 111, in a manner similar to the process described in
Referring to
Referring to
Similar to the examples in
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Each version of the movie may be associated with a corresponding identifier that can be used to identify and render or otherwise show the corresponding movie. As illustrated, the corresponding identifier may be encoded into respective QR codes 3102A and QR codes 3102B. When scanned or otherwise read by a QR code reader, such as a user's smartphone, each QR code 3102A or 3102B will respective result in custom content 3104A and 3104B. Encodings or other ways to read the identifier may be used instead, such as via barcodes, near field communication (NFC) identifiers, Bluetooth beacons, URLs, and so forth.
It should be noted that with the generative AI systems disclosed herein, such custom content may be generated ahead of time. When generated ahead of time, the custom content may be generated for different contexts such as different geolocations, customs, user preferences, and so forth. In other examples, with the generative AI systems disclosed herein, the custom content may be generated at run-time (such as when the QR code is scanned). In these examples, the custom content may be generated based on contextual information (such as geolocations of the user, customs of the user location, user preferences, and so forth) at runtime.
Referring to
Referring to
System and Method for Interactive Character Persona
In some implementations, custom content generation may be used to generate interactive character personas facilitated by the systems and models described herein. For example, the generative content system 140 illustrated in
The system may provide a template (referred to herein as a “character sheet”) for users to fill out, ask questions (interactive question and answer), enter in a free form fashion, and/or otherwise provide information about a character persona being customized. When the system has sufficient information (textual and/or images), the system will instantiate the character (“bring the character to life”) and allow the user to interact with the character. For example, the user may interactively chat with the character. The user can then ask the character questions to learn more about them, and to ask how the character might respond in certain situations. This character metadata can also be used in making suggestions about how characters might behave and what they might say in specific parts of a creative work (script, novel, etc.).
Each character sheet may have associated contact information that allows the user to send and receive communications. For example, the contact information may include a phone number, an electronic mail address, a social media account handle, and/or other contact information that enables communication between the user and the persona (such as via the systems and models described herein, as well as the dynamic prompt and response illustrated in
The use of this system may help the user understand the character. Interacting with the character via text, voice, and/or video can also help shape the character, and inform how the character contributes to the story. After the creative work is released, the interactive character persona can be used as a marketing “activation.” Fans of the movie or the book can also text, talk, or video chat with their favorite characters, thereby creating a heightened sense of immersion.
Multi-modal refers to at least different modalities. For example, multi-modal content generation can refer to an ability to create different types of content such as text (including natural language text), visuals, audio, and/or other types of content, either alone or in combination. Similarly, multi-modal inputs can refer to different types of input modalities such as text, visual, audio, and/or other types of inputs.
Training the various models as disclosed herein may include supervised, semi-supervised, and unsupervised techniques. For example, when training models with labeled data, such as for the scene parsing model 129, supervised machine learning techniques may be used. Suitable Models: Several machine learning models can be used for this task, depending on the complexity of your data and desired output. Here are some common choices: Supervised Learning Models: Rule-based Systems: If your data has clear patterns and rules for identifying key-value pairs, rule-based systems can be implemented. Conditional Random Fields (CRFs): These models excel at sequence labeling tasks like identifying named entities within text, which can be useful for parsing key-value pairs. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs): These models are powerful for handling sequential data like text and can learn complex relationships between words for accurate parsing. Deep Learning Models: Transformers: This powerful architecture, including models like BERT and RoBERTa, has shown excellent performance in various NLP tasks, including text parsing. Model Training: Splitting Data: Divide your labeled data into training, validation, and testing sets. The training set is used to train the model, the validation set helps fine-tune hyperparameters, and the testing set evaluates the model's final performance on unseen data. Hyperparameter Tuning: Adjust hyperparameters (learning rate, batch size, etc.) of the chosen model to optimize its performance on the validation set. Training the Model: Train the model on the training data. The model learns to identify patterns and relationships within the labeled data to predict the corresponding JSON structure for new, unseen text. Evaluation and Refinement. Testing and Evaluation: Evaluate the model's performance on the testing set using metrics like accuracy (percentage of correctly parsed documents) and F1-score (harmonic mean of precision and recall). Error Analysis and Refinement: Analyze errors made by the model and identify areas for improvement. This may involve collecting more labeled data or refining the model architecture/hyperparameters. Additional Considerations: Handling Ambiguity: Unstructured text can be ambiguous. Define clear guidelines for handling cases where the structure might be unclear or conflicting information exists. Model Explainability: For complex models like LSTMs and Transformers, consider techniques to understand how the model arrives at its predictions, especially in case of errors. Continuous Learning: As you acquire new data, consider retraining the model to improve its accuracy and adaptability over time. By following these steps and considering the additional points, you can train a model to effectively parse unstructured text into structured JSON using labeled data. Remember, the quality and quantity of your labeled data will significantly impact the model's performance.
To ingest the unstructured content 102 (such as transcripts or other content), the computer system 110 may use the system API 113 to provide upload capabilities for client devices 104. This data upload or access may be made via Java Database Connectivity (JDBC), Representational state transfer (RESTful) services, Simple Mail Transfer Protocol (SMTP) protocols, direct file upload, and/or other file transfer services or techniques. In particular, the system API 113 may include a MICROSOFT SHAREPOINT API Connector, an Hyper Text Transfer Protocol (HTTP)/HTTP-secure (HTTPS), a Network Drive Connector, a File Transfer Protocol (FTP) Connector, SMTP Artifact Collector, Object Store Connector, MICROSOFT ONEDRIVE Connector, GOOGLE DRIVE Connector, DROPBOX Connector, and/or other types of connector interfaces.
The computer system 110 and the one or more client devices 104 may be connected to one another via a communication network (not illustrated), such as the Internet or the Internet in combination with various other networks, like local area networks, cellular networks, or personal area networks, internal organizational networks, and/or other networks. It should be noted that the computer system 110 may transmit data, via the communication network, conveying the predictions to one or more of the client devices 104. The data conveying the predictions may be a user interface generated for display at the one or more client devices 104, one or more messages transmitted to the one or more client devices 104, and/or other types of data for transmission. Although not shown, the one or more client devices 104 may each include one or more processors.
Processor 112 may be programmed to execute one or more computer program components. The computer program components may include software programs and/or algorithms coded and/or otherwise embedded in the processor 112. The one or more computer program components or features may include various subsystems such as the platform system 120, the content parsing system 130, the generative content system 140, the semantic summarization system 150, the self-correcting generative system 160, the interface system 170, and/or other components.
Processor 112 may be configured to execute or implement 120, 130, 140, 150, 160, 170, and 180 by software; hardware; firmware; some combination of software, hardware, and/or firmware; and/or other mechanisms for configuring processing capabilities on processor 112. It should be appreciated that although 120, 130, 140, 150, 160, 170, and 180 are illustrated in
Each of the computer system 110 and client devices 104 may also include memory in the form of electronic storage. The electronic storage may include non-transitory storage media that electronically stores information. The electronic storage media of the electronic storages may include one or both of (i) system storage that is provided integrally (e.g., substantially non-removable) with servers or client devices or (ii) removable storage that is removably connectable to the servers or client devices via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). The electronic storages may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. The electronic storages may include one or more virtual storage resources (e.g., cloud storage, a virtual private network, and/or other virtual storage resources). The electronic storage may store software algorithms, information determined by the processors, information obtained from servers, information obtained from client devices, or other information that enables the functionalities described herein.
The databases and data stores (such as 101, 103, 105, 107) may be, include, or interface to, for example, an Oracle™ relational database sold commercially by Oracle Corporation. Other databases, such as Informix™, DB2 or other data storage, including file-based, or query formats, platforms, or resources such as OLAP (On Line Analytical Processing), SQL (Structured Query Language), a SAN (storage area network), Microsoft Access™ or others may also be used, incorporated, or accessed. The database may comprise one or more such databases that reside in one or more physical devices and in one or more physical locations. The database may include cloud-based storage solutions. The database may store a plurality of types of data and/or files and associated data or file descriptions, administrative information, or any other data. The various databases may store predefined and/or custom data described herein.
The preceding uses the term LLM. Large language models or LLMs are trained on many different kinds of data and information and similarly, can output many different kinds of data and information. With input, training, and/or modeling, it is understood and appreciated that an LLM can change and be modified and may be called by other names.
Although the foregoing has been described in detail for the purpose of illustration, it is to be understood that such detail is solely for that purpose and that the present patent application is not limited to the disclosed embodiments or implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements. In addition, it is to be understood that the present patent application contemplates that, to the extent possible, one or more features or functions of any embodiment or implementation can be combined with one or more features or functions of any other. Furthermore, the systems and processes described and taught in the foregoing are not limited to the specific implementations or embodiments described herein. In addition, components of each system and each process can be practiced independently and separate from other components and processes described herein. Each component and process can also be used in combination with other assembly packages, processes, implementations, or embodiments. The flow charts and descriptions thereof should be understood to not prescribe a fixed order of performing the method blocks described therein. Rather the method blocks may be performed in any order that is practicable including simultaneous performance of at least some method blocks. Furthermore, each of the methods may be performed by one or more of the system features illustrated in
The foregoing implementations and embodiments have been provided for the purposes of illustration and description. They are not intended to be exhaustive or to limit what is disclosed to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in this art. In particular, and without limitation, any and all variations described, suggested by this patent application or by the material incorporated by reference are specifically incorporated by reference into the description herein of the embodiments or implementations of the invention. In addition, any and all variations described, suggested, or incorporated by reference herein with respect to any one embodiment or implementation are also to be considered taught with respect to all others. The descriptions herein were chosen and provided to best explain the principles and practical applications, thereby enabling others skilled in the art to understand the invention for various embodiments and implementations and with various modifications as are suited to the particular use contemplated.
In some aspects, the techniques described herein relate to a system, including: a content repository configured to store a multi-modal content including text, visual content, and/or audio content; a processor programmed to: access a prompt including a text query that describes text to be found, a visual query including text that describes a visual to be found, and/or an audio query including text that describes audio to be found; execute a language model based on the prompt to identify content from the content repository; receive, from the language model, a request for a callback function that seeks additional information to satisfy the multi-modal query; execute the callback function to obtain the additional information and provide the additional information to the language model in response to the request for the callback function; re-execute the language model based on the multi-modal query and the additional information; obtain, from the language model, content responsive to the prompt based on the additional information.
In some aspects, the techniques described herein relate to a system, wherein the callback function includes a clarify function that includes an instruction to clarify the prompt or obtain additional information about the content being requested.
In some aspects, the techniques described herein relate to a system, wherein the callback function includes a function to perform a computation.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: access a user profile associated with a user for which the content is to be found; and provide at least a portion of the user profile to the language model, wherein the language model uses the portion of the user profile as context to identify the content from the content repository, wherein different user profiles result in different content identified from the content repository.
In some aspects, the techniques described herein relate to a system, wherein the portion of the user profile defines a role of the user, and wherein the processor is programmed to identify the content based on the role of the user such that, given the same prompt, different content is identified based on the role of the user.
In some aspects, the techniques described herein relate to a system, wherein the portion of the user profile defines a preference of the user, and wherein the processor is programmed to identify the content based on the preference of the user such that, given the same prompt, different content is identified based on the preference of the user.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: retrieve, at periodic intervals, the prompt; and perform, at the periodic intervals, a recurring search based on the prompt.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: access a request to provide a timeline that identifies an order in which one or more objects or entities appear in the content; generate a prompt requesting the timeline; execute the language model with the prompt; and generate the timeline based on the executed language model.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: store the content in a buffer while the content is being obtained from the language model; and begin transmitting the content only when the content is obtained and stored in the buffer.
In some aspects, the techniques described herein relate to a method, including: accessing a prompt including a text query that describes text to be found, a visual query including text that describes a visual to be found, and/or an audio query including text that describes audio to be found; executing a language model based on the prompt to identify content from a content repository; receiving, from the language model, a request for a callback function that seeks additional information to satisfy the multi-modal query; executing the callback function to obtain the additional information and provide the additional information to the language model in response to the request for the call function; re-executing the language model based on the multi-modal query and the additional information; obtaining, from the language model, content responsive to the prompt based on the additional information.
In some aspects, the techniques described herein relate to a method, wherein the callback function includes a clarify function that includes an instruction to clarify the prompt or obtain additional information about the content being requested.
In some aspects, the techniques described herein relate to a method, wherein the callback function includes a function to perform a computation.
In some aspects, the techniques described herein relate to a method, further including: accessing a user profile associated with a user for which the content is to be found; and providing at least a portion of the user profile to the language model, wherein the language model uses the portion of the user profile as context to identify the content from the content repository, wherein different user profiles result in different content identified from the content repository.
In some aspects, the techniques described herein relate to a method, wherein the portion of the user profile defines a role of the user, and wherein the processor is programmed to identify the content based on the role of the user such that, given the same prompt, different content is identified based on the role of the user.
In some aspects, the techniques described herein relate to a method, wherein the portion of the user profile defines a preference of the user, and wherein the processor is programmed to identify the content based on the preference of the user such that, given the same prompt, different content is identified based on the preference of the user.
In some aspects, the techniques described herein relate to a method, further including: retrieving, at periodic intervals, the prompt; and performing, at the periodic intervals, a recurring search based on the prompt.
In some aspects, the techniques described herein relate to a method, further including: accessing a request to provide a timeline that identifies an order in which one or more objects or entities appear in the content; generating a prompt requesting the timeline; executing the language model with the prompt; and generating the timeline based on the executed language model.
In some aspects, the techniques described herein relate to a method, further including: storing the content in a buffer while the content is being obtained from the language model; and begin transmitting the content only when the content is obtained and stored in the buffer.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium storing instructions that, when executed by a processor, programs the processor to: receive a multi-modal query including a text query that describes text to be found, a visual query including text that describes a visual to be found, and/or an audio query including text that describes audio to be found; identify content, from a content repository, based on the multi-modal query, the identified content including: (i) text that is semantically similar to the semantic query, (ii) visual content that matches a text description of a visual to be found, and/or (iii) audio content that matches an audio description of the visual to be found; and transmit the content responsive to the multi-modal query.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium, wherein the instructions, when executed by the processor, further program the processor to: access a user profile associated with a user for which the content is to be found; and provide at least a portion of the user profile to the language model, wherein the language model uses the portion of the user profile as context to identify the content from the content repository, wherein different user profiles result in different content identified from the content repository.
In some aspects, the techniques described herein relate to a system for multi-pass summarization, including: a processor programmed to: access a request to summarize content; in a first pass, from among the multi-pass summarization: divide the content into a plurality of chunks; for each chunk, from among the plurality of chunks: execute a language model with the chunk and an instruction to summarize the chunk; generate, based on the executed language model, a summary of the chunk; in a subsequent pass, from among the multi-pass summarization: generate a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries including two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: execute a language model with the group of summaries and an instruction to summarize the group of summaries; generate, based on the executed language model on the group of summaries, a group summary; and iteratively repeat the subsequent pass for group summaries until a summary of the content is reached.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: access a guardrail template for the content, the guardrail template including one or more delimiters for specific portions of the content to summarize; and generate the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
In some aspects, the techniques described herein relate to a system, wherein each chunk has a respective portion of the content.
In some aspects, the techniques described herein relate to a system, wherein a logical chunk does not overlap with a neighboring chunk to generate a respective summary for the logical chunk that is not influenced by the neighboring chunk for artificial intelligence hallucination mitigation.
In some aspects, the techniques described herein relate to a system, wherein the content includes a transcript and each logical chunk corresponds to a scene in the transcript.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: execute a scene parsing model that is trained to convert unstructured content in the transcript into a structured format for summarization.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: select one or more language models for the summarization based on one or more parameters.
In some aspects, the techniques described herein relate to a system, wherein the one or more parameters include: a context window size, a cost, a speed, a current load, a level of network congestion, and/or a system capability.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: execute a harmonization pass on at least two groups of summaries to harmonize the at least two groups of summaries with respect to one another.
In some aspects, the techniques described herein relate to a method, including: accessing a request to summarize content; in a first pass, from among the multi-pass summarization: dividing the content into a plurality of chunks; for each chunk, from among the plurality of chunks: executing a language model with the chunk and an instruction to summarize the chunk; generating, based on the executed language model, a summary of the chunk; in a subsequent pass, from among the multi-pass summarization: generating a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries including two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: executing a language model with the group of summaries and an instruction to summarize the group of summaries; generating, based on the executed language model on the group of summaries, a group summary; and iteratively repeating the subsequent pass for group summaries until a summary of the content is reached.
In some aspects, the techniques described herein relate to a method, further including: accessing a guardrail template for the content, the guardrail template including one or more delimiters for specific portions of the content to summarize; and generating the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
In some aspects, the techniques described herein relate to a method, wherein each chunk has a respective portion of the content.
In some aspects, the techniques described herein relate to a method, wherein a logical chunk does not overlap with a neighboring chunk to generate a respective summary for the logical chunk that is not influenced by the neighboring chunk for artificial intelligence hallucination mitigation.
In some aspects, the techniques described herein relate to a method, wherein the content includes a transcript and each logical chunk corresponds to a scene in the transcript.
In some aspects, the techniques described herein relate to a method, further including: executing a scene parsing model that is trained to convert unstructured content in the transcript into a structured format for summarization.
In some aspects, the techniques described herein relate to a method, further including: selecting one or more language models for the summarization based on one or more parameters.
In some aspects, the techniques described herein relate to a method, wherein the one or more parameters include: a context window size, a cost, a speed, a current load, a level of network congestion, and/or a system capability.
In some aspects, the techniques described herein relate to a method, further including: executing a harmonization pass on at least two groups of summaries to harmonize the at least two groups of summaries with respect to one another.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium storing instructions for multi-pass summarization, the instructions, when executed by a processor, programs the processor to: access a request to summarize content; in a first pass, from among the multi-pass summarization: divide the content into a plurality of chunks; for each chunk, from among the plurality of chunk execute a language model with the chunk and an instruction to summarize the chunk; generate, based on the executed language model, a summary of the chunk; in a subsequent pass, from among the multi-pass summarization: generate a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries including two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: execute a language model with the group of summaries and an instruction to summarize the group of summaries; generate, based on the executed language model on the group of summaries, a group summary; and iteratively repeat the subsequent pass for group summaries until a summary of the content is reached.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium, wherein the instructions, when executed, further program the processor to: access a guardrail template for the content, the guardrail template including one or more delimiters for specific portions of the content to summarize; and generate the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
In some aspects, the techniques described herein relate to a system, including: a processor programmed to: access content including text content, visual content, and/or audio content; perform, based on a harmonization model and/or consistency model, a harmonization check and/or a consistency check on the content; recognize, based on the harmonization check and/or the consistency check, a conflict to be corrected; identify a property of the content that should be changed based on the recognized conflict; and generate a corrective action based on the property of the content that should be changed.
In some aspects, the techniques described herein relate to a system, wherein the content includes multi-modal content and wherein to recognize the conflict to be corrected, the processor is programmed to: identify an inconsistency between text in the multi-modal content and a visual in the multi-modal content.
In some aspects, the techniques described herein relate to a system, wherein the content includes multi-modal content and wherein to recognize the conflict to be corrected, the processor is programmed to: identify an inconsistency between a first visual in the multi-modal content and a second visual in the multi-modal content.
In some aspects, the techniques described herein relate to a system, wherein the content includes a prompt for input to a language model, and wherein to recognize the conflict, the processor is programmed to: identify an inconsistency between one or more first words in the prompt and one or more second words in the prompt.
In some aspects, the techniques described herein relate to a system, wherein to generate a corrective action, the processor is programmed to: generate a recommendation to modify the prompt based on recognized conflict.
In some aspects, the techniques described herein relate to a system, wherein to generate a corrective action, the processor is programmed to: modify the prompt based on recognized conflict.
In some aspects, the techniques described herein relate to a system, wherein the content includes a prompt for input to a language model, and wherein to recognize the conflict, the processor is programmed to: identify an inconsistency between one or more words in the prompt and one or more words in a previous prompt.
In some aspects, the techniques described herein relate to a system, wherein to generate a corrective action, the processor is programmed to: generate a recommendation to modify the prompt based on recognized conflict.
In some aspects, the techniques described herein relate to a system, wherein to generate a corrective action, the processor is programmed to: modify the prompt based on recognized conflict.
In some aspects, the techniques described herein relate to a system, wherein to recognize the conflict to be corrected, the processor is programmed to: identify and store an object in the content; and determine a difference between a first version of the object and a second version of the object, wherein the difference indicates the conflict to be corrected.
In some aspects, the techniques described herein relate to a system, wherein to recognize the conflict to be corrected, the processor is programmed to: identify and store an object in the content; and generate a first embedding for the object at a first time; generate a second embedding for the object at a second time; and determine a difference between the first embedding and the second embedding, wherein the difference indicates the conflict to be corrected.
In some aspects, the techniques described herein relate to a system, wherein to identify and store the object in the content, the processor is programmed to: store the object in a graph database that includes one or more other objects in a timeline of the content.
In some aspects, the techniques described herein relate to a system, wherein to identify a property of the content that should be changed based on the recognized conflict, the processor is further programmed to: determine a first mood of a first portion of the content; determine a second mood of a second portion of the content; determine that the first mood and the second mood conflict with one another; wherein to generate the corrective action, the processor is programmed to: modify the first portion and/or the second portion so that the first mood is consistent with the second mood.
In some aspects, the techniques described herein relate to a system, wherein the first portion includes first text and the second portion includes second text, and wherein the processor is further programmed to: determine the first mood based on the first text; and determine the second mood based on the second text.
In some aspects, the techniques described herein relate to a system, wherein the first portion includes text and the second portion includes an image, and wherein the processor is further programmed to: determine the first mood based on the text; and determine the second mood based on the image.
In some aspects, the techniques described herein relate to a method, including: accessing content including text content, visual content, and/or audio content; performing, based on a harmonization model and/or consistency model, a harmonization check and/or a consistency check on the content; recognizing, based on the harmonization check and/or the consistency check, a conflict to be corrected; identifying a property of the content that should be changed based on the recognized conflict; and generating a corrective action based on the property of the content that should be changed.
In some aspects, the techniques described herein relate to a method, wherein the content includes multi-modal content and wherein recognizing the conflict to be corrected includes: identifying an inconsistency between text in the multi-modal content and a visual in the multi-modal content.
In some aspects, the techniques described herein relate to a method, wherein the content includes multi-modal content and wherein recognizing the conflict to be corrected includes: identifying an inconsistency between a first visual in the multi-modal content and a second visual in the multi-modal content.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium storing instructions that, when executed by a processor, programs the processor to: access content including text content, visual content, and/or audio content; perform, based on a harmonization model and/or consistency model, a harmonization check and/or a consistency check on the content; recognize, based on the harmonization check and/or the consistency check, a conflict to be corrected; identify a property of the content that should be changed based on the recognized conflict; and generate a corrective action based on the property of the content that should be changed.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium, wherein the content includes multi-modal content and wherein to recognize the conflict to be corrected, and wherein the instructions, when executed by the processor, further program the processor to: identify an inconsistency between text in the multi-modal content and a visual in the multi-modal content.
In some aspects, the techniques described herein relate to a system, including: a processor programmed to: receive, during iterative content generation, a first user input; identify a context associated with the first user input; execute, based on the user input and the identified context, a language model trained to generate text and/or an image model trained to generate images; generate, as an output of the language model and/or the image model, content based on the user input and the identified context; receive, during the iterative content generation, a second user input; and modify, by the language model and/or the image model, the content based on the second user input.
In some aspects, the techniques described herein relate to a system, wherein the second user input includes a request to recommend changes to the content, and wherein to modify the content, the processor is programmed to: generate a suggestion based on the content and the request to change the content.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes text and the generated content includes text an image based on the first user input.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes an image and the content to be changed includes text that describes the image, wherein the processor is programmed to: generate a first instruction, for input to an image model, to describe the image; execute the image model with the image and the instruction to generate the text that describes the image; generate a second instruction, for input to the language model, to generate text based on the text that describes the image, wherein the content includes the text generated by the language model.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes an image and the content to be changed includes text that describes the image, wherein the processor is programmed to: generate an instruction, for input to an image model, to describe the image; and execute the image model with the image and the instruction to generate the text that describes the image.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes an image of a pose, and wherein to generate the content, the processor is programmed to: generate an image of a character in the content that matches the pose.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes an image of a face having a facial expression, and wherein to generate the content, the processor is programmed to: generate an image of a character in the content that matches the facial expression.
In some aspects, the techniques described herein relate to a system, wherein the first user input includes audio, and wherein to generate the content, the processor is programmed to: generate audio output for the content that matches based on the audio of the first user input.
In some aspects, the techniques described herein relate to a system, wherein the context includes an area of focus on an interface of the user making the request.
In some aspects, the techniques described herein relate to a system, wherein the context for the second user input includes the content generated in response to the first user input.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: receive a query input for searching the content; and retrieve at least a portion of the content based on the query input.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to: receive a request to summarize the content; and generate a summary of the content.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to conduct a consistency check between the content generated in response to the first user input and the content generated in response to the first second input.
In some aspects, the techniques described herein relate to a system, wherein the processor is further programmed to conduct a harmonization check between the content generated in response to the first user input and the content generated in response to the first second input.
In some aspects, the techniques described herein relate to a method, including: receiving, during iterative content generation, a first user input; identifying a context associated with the first user input; executing, based on the user input and the identified context, a language model trained to generate text and/or an image model trained to generate images; generating, as an output of the language model and/or the image model, content based on the user input and the identified context; receiving, during the iterative content generation, a second user input; and modifying, by the language model and/or the image model, the content based on the second user input.
In some aspects, the techniques described herein relate to a method, wherein the second user input includes a request to recommend changes to the content, and wherein modifying the content includes: generating a suggestion based on the content and the request to change the content.
In some aspects, the techniques described herein relate to a method, wherein the first user input includes text and the generated content includes text based on the first user input.
In some aspects, the techniques described herein relate to a method, wherein the first user input includes an image and the content to be changed includes text that describes the image, the method further including: generating a first instruction, for input to an image model, to describe the image; executing the image model with the image and the instruction to generate the text that describes the image; generating a second instruction, for input to the language model, to generate text based on the text that describes the image, wherein the content includes the text generated by the language model.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium storing instructions that, when executed by a processor, programs the processor to: receive, during iterative content generation, a first user input; identify a context associated with the first user input; execute, based on the user input and the identified context, a language model trained to generate text and/or an image model trained to generate images; generate, as an output of the language model and/or the image model, content based on the user input and the identified context; receive, during the iterative content generation, a second user input; and modify, by the language model and/or the image model, the content based on the second user input.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium, wherein the first user input includes an image and the content to be changed includes text that describes the image, and wherein the instructions, when executed by the processor, further program the processor to: generate a first instruction, for input to an image model, to describe the image; execute the image model with the image and the instruction to generate the text that describes the image; generate a second instruction, for input to the language model, to generate text based on the text that describes the image, wherein the content includes the text generated by the language model.
This written description uses examples to disclose the implementations and embodiments and to enable any person skilled in the art to practice them, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the disclosure is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal language of the claims.
Claims
1. A system for multi-pass summarization, comprising:
- a processor programmed to: access a request to summarize content; in a first pass, from among the multi-pass summarization: divide the content into a plurality of chunks; for each chunk, from among the plurality of chunks: execute a language model with the chunk and an instruction to summarize the chunk; generate, based on the executed language model, a summary of the chunk;
- in a subsequent pass, from among the multi-pass summarization: generate a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries comprising two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: execute a language model with the group of summaries and an instruction to summarize the group of summaries; generate, based on the executed language model on the group of summaries, a group summary; and
- iteratively repeat the subsequent pass for group summaries until a summary of the content is reached.
2. The system of claim 1, wherein the processor is further programmed to:
- access a guardrail template for the content, the guardrail template comprising one or more delimiters for specific portions of the content to summarize; and
- generate the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
3. The system of claim 1, wherein each chunk has a respective portion of the content.
4. The system of claim 3, wherein a logical chunk does not overlap with a neighboring chunk to generate a respective summary for the logical chunk that is not influenced by the neighboring chunk for artificial intelligence hallucination mitigation.
5. The system of claim 1, wherein the content comprises a transcript and each logical chunk corresponds to a scene in the transcript.
6. The system of claim 5, wherein the processor is further programmed to:
- execute a scene parsing model that is trained to convert unstructured content in the transcript into a structured format for summarization.
7. The system of claim 1, wherein the processor is further programmed to:
- select one or more language models for the summarization based on one or more parameters.
8. The system of claim 1, wherein the one or more parameters comprise: a context window size, a cost, a speed, a current load, a level of network congestion, and/or a system capability.
9. The system of claim 1, wherein the processor is further programmed to:
- execute a harmonization pass on at least two groups of summaries to harmonize the at least two groups of summaries with respect to one another.
10. A method, comprising:
- accessing a request to summarize content;
- in a first pass, from among the multi-pass summarization: dividing the content into a plurality of chunks; for each chunk, from among the plurality of chunks: executing a language model with the chunk and an instruction to summarize the chunk; generating, based on the executed language model, a summary of the chunk;
- in a subsequent pass, from among the multi-pass summarization: generating a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries comprising two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: executing a language model with the group of summaries and an instruction to summarize the group of summaries; generating, based on the executed language model on the group of summaries, a group summary; and
- iteratively repeating the subsequent pass for group summaries until a summary of the content is reached.
11. The method of claim 10, further comprising:
- accessing a guardrail template for the content, the guardrail template comprising one or more delimiters for specific portions of the content to summarize; and
- generating the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
12. The method of claim 10, wherein each chunk has a respective portion of the content.
13. The method of claim 12, wherein a logical chunk does not overlap with a neighboring chunk to generate a respective summary for the logical chunk that is not influenced by the neighboring chunk for artificial intelligence hallucination mitigation.
14. The method of claim 10, wherein the content comprises a transcript and each logical chunk corresponds to a scene in the transcript.
15. The method of claim 14, further comprising:
- executing a scene parsing model that is trained to convert unstructured content in the transcript into a structured format for summarization.
16. The method of claim 10, further comprising:
- selecting one or more language models for the summarization based on one or more parameters.
17. The method of claim 16, wherein the one or more parameters comprise: a context window size, a cost, a speed, a current load, a level of network congestion, and/or a system capability.
18. The method of claim 10, further comprising:
- executing a harmonization pass on at least two groups of summaries to harmonize the at least two groups of summaries with respect to one another.
19. A non-transitory computer readable medium storing instructions for multi-pass summarization, the instructions, when executed by a processor, programs the processor to:
- access a request to summarize content;
- in a first pass, from among the multi-pass summarization: divide the content into a plurality of chunks; for each chunk, from among the plurality of chunk execute a language model with the chunk and an instruction to summarize the chunk; generate, based on the executed language model, a summary of the chunk;
- in a subsequent pass, from among the multi-pass summarization: generate a plurality of groups of summaries, each group of summaries from among the plurality of groups of summaries comprising two or more summaries, each summary corresponding to a respective chunk; for each group of summaries from among the plurality of groups: execute a language model with the group of summaries and an instruction to summarize the group of summaries; generate, based on the executed language model on the group of summaries, a group summary; and
- iteratively repeat the subsequent pass for group summaries until a summary of the content is reached.
20. The non-transitory computer readable medium of claim 19, wherein the instructions, when executed, further program the processor to:
- access a guardrail template for the content, the guardrail template comprising one or more delimiters for specific portions of the content to summarize; and
- generate the summary within one or more delimiters for the specific portions of the content, wherein the one or more delimiters prevent summarization across overlapping portions of the content to reduce artificial intelligence hallucination.
| 4969097 | November 6, 1990 | Levin |
| 5724457 | March 3, 1998 | Fukushima |
| 5805159 | September 8, 1998 | Bertram |
| 5864340 | January 26, 1999 | Bertram |
| 5959629 | September 28, 1999 | Masui |
| 6002390 | December 14, 1999 | Masui |
| 6321158 | November 20, 2001 | Delorme |
| 6337698 | January 8, 2002 | Keely, Jr. |
| 6675169 | January 6, 2004 | Bennett |
| 7171353 | January 30, 2007 | Trower, II |
| 7194404 | March 20, 2007 | Babst |
| 7387457 | June 17, 2008 | Jawerth |
| 7447627 | November 4, 2008 | Jessee |
| 7689684 | March 30, 2010 | Donoho |
| 8015482 | September 6, 2011 | Simova |
| 8396582 | March 12, 2013 | Kaushal |
| 10034645 | July 31, 2018 | Williams |
| 10719301 | July 21, 2020 | Dasgupta |
| 11030485 | June 8, 2021 | Karam |
| 11789837 | October 17, 2023 | Jain |
| 11907674 | February 20, 2024 | Akerlund et al. |
| 11972333 | April 30, 2024 | Horesh |
| 11978437 | May 7, 2024 | Thattai |
| 20020015042 | February 7, 2002 | Robotham |
| 20020024506 | February 28, 2002 | Flack |
| 20020052900 | May 2, 2002 | Freeman |
| 20020156864 | October 24, 2002 | Kniest |
| 20040239681 | December 2, 2004 | Robotham |
| 20050012723 | January 20, 2005 | Pallakoff |
| 20050195221 | September 8, 2005 | Berger |
| 20050223308 | October 6, 2005 | Gunn |
| 20050283364 | December 22, 2005 | Longe |
| 20060020904 | January 26, 2006 | Aaltonen |
| 20060026521 | February 2, 2006 | Hotelling |
| 20060026535 | February 2, 2006 | Hotelling |
| 20060026536 | February 2, 2006 | Hotelling |
| 20060088356 | April 27, 2006 | Jawerth |
| 20060095842 | May 4, 2006 | Lehto |
| 20060101005 | May 11, 2006 | Yang |
| 20060161870 | July 20, 2006 | Hotelling |
| 20060161871 | July 20, 2006 | Hotelling |
| 20060265648 | November 23, 2006 | Rainisto |
| 20060274051 | December 7, 2006 | Longe |
| 20070061704 | March 15, 2007 | Simova |
| 20070263007 | November 15, 2007 | Robotham |
| 20080270138 | October 30, 2008 | Knight et al. |
| 20150278226 | October 1, 2015 | Franks et al. |
| 20150293995 | October 15, 2015 | Chen et al. |
| 20150379429 | December 31, 2015 | Lee |
| 20190147357 | May 16, 2019 | Erlandson |
| 20190318099 | October 17, 2019 | Carvalho |
| 20200372395 | November 26, 2020 | Mahmud |
| 20210232632 | July 29, 2021 | Howard |
| 20210304452 | September 30, 2021 | Lee |
| 20230088796 | March 23, 2023 | Grue et al. |
| 20230315856 | October 5, 2023 | Lee |
| 20230359903 | November 9, 2023 | Cefalu |
| 20240029345 | January 25, 2024 | Selvert |
| 20240029862 | January 25, 2024 | Ahmed |
| 20240054233 | February 15, 2024 | Ohayon |
| 20240078732 | March 7, 2024 | Beith |
| 20240221719 | July 4, 2024 | Kothari |
| 20240257470 | August 1, 2024 | Gupta |
| 20240289863 | August 29, 2024 | Smith Lewis |
| 20240354555 | October 24, 2024 | Knipfing |
| 20240419246 | December 19, 2024 | Ullrich |
| 20250007870 | January 2, 2025 | Kim |
| 20250077487 | March 6, 2025 | Groenewegen |
| 20250110618 | April 3, 2025 | Kumar |
| 20250110975 | April 3, 2025 | Shea |
| 20250111334 | April 3, 2025 | Wilson |
| 20250111335 | April 3, 2025 | Wilson |
| 20250111380 | April 3, 2025 | Wilson |
| 20250124262 | April 17, 2025 | Badr |
| 20250124620 | April 17, 2025 | Arora |
| 20250131020 | April 24, 2025 | Gupta |
| 20250131024 | April 24, 2025 | Gupta |
| 20250140246 | May 1, 2025 | Lee |
| 20250148218 | May 8, 2025 | Gusarov |
| 2024/145209 | July 2024 | WO |
| 2024/166446 | August 2024 | WO |
- Md. Abdur Rahman, “A Survey on Security and Privacy of Multimodal LLMs—Connected Healthcare Perspective”, 2023 IEEE Globecom Workshops (GC Wkshps): Workshop on Edge—AI and IoT for Connected Health, pp. 1807-1812.
- Emily Dinan et al., “Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling”, arXiv preprint arXiv:2107.03451v3 (2021) (Year 2021) (43 pgs.).
- Matthias Urban et al., “CAESURA: Language Models as Multi-Modal Query Planners”, Conference Jul. 17, 2017, Washington DC USA 2024, arXiv preprint arXiv: 2308.03423 (2023) (7 pgs.).
- Qinghua Lu et al., “Building The Future Of Responsible AI: A Reference Architecture For Designing Large Language Model Based Agents”, arXiv preprint arXiv:2311.13148 v2 (Nov. 29, 2023) (14 pgs.).
- Qinghua Lu et al., “A Taxonomy Of Foundation Model Based Systems Throghthe Lens Of Software Architecture”, arXiv e-prints (Jan. 2024), arXiv-2305 v6 (14 pgs.).
- International Search Report and Written Opinion of the International Searching Authority dated Jul. 4, 2025, issued in corresponding International Application No. PCT/US2025/018770 (10 pgs.).
Type: Grant
Filed: Mar 6, 2025
Date of Patent: Oct 7, 2025
Patent Publication Number: 20250284725
Assignee: REVE AI, INC. (Palo Alto, CA)
Inventors: Christian Cantrell (Palo Alto, CA), Sam Breed (Palo Alto, CA), Nick Nikolov (London), Joe Penna (Palo Alto, CA), Alex Peysakhovich (Palo Alto, CA), Sam Pullara (Palo Alto, CA), Michael Storm (Palo Alto, CA), Luke Wroblewski (Palo Alto, CA), Amelia Wattenberger (Palo Alto, CA)
Primary Examiner: Thierry L Pham
Application Number: 19/072,847
International Classification: G06F 16/34 (20250101); G06F 40/289 (20200101);