Abstract: The disclosure describes techniques for integrating natural voice command and motion-based gesture recognition, a tokenizer, and a LLM to enable hands-free control of computing devices without requiring specialized programming. In some implementations, a method includes: obtaining data associated with input by a user at an application interface indicating an intent to control the application interface; tokenizing, using a tokenizer, the data associated with the input by the user to create tokenized user input data; obtaining tokenized UI element data corresponding to a tokenized record of actionable UI elements associated with the application interface; generating, using a LLM, based at least on the tokenized UI element data and the tokenized user input data, events to inject into the application interface to control one or more of the actionable UI elements; injecting the events into the application interface; and providing feedback to the user in accordance with injecting the events.
Abstract: Systems and methods are provided for implementing an interactive application, which incorporates voice-driven and motion-driven interactions with media content in a display device. An application instance can be initialized for interacting with media content output to a display device, such as an head mounted display (HMD). Then, a determination whether a received user interaction is interpretable into an interactive command defined by an operating system (OS) is performed. If the OS can interpret the user interaction, the interactive command can be executed with actions generated by the OS. Alternatively, an emulation of the interactive command may be executed, when the user interaction cannot be interpreted by the OS. Subsequently, the media content is presented within a user interface based on the interactive command. For example, user interaction can be the head movement of a user that is interpreted into a command that controls presentation of web content in the HMD.