Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting actions to be performed by an agent interacting with an environment. In one aspect, a method comprises, at each of one or more time steps: generating a respective action score for each action in a set of possible actions, wherein the set of possible actions comprises: (i) a plurality of atomistic actions, and (ii) one or more optimization actions, wherein each optimization action is associated with a respective objective function that measures performance of the agent on a corresponding auxiliary task; selecting an action from the set of possible actions in accordance with the action scores, wherein the selected action is an optimization action; in response to selecting the optimization action, performing a numerical optimization to identify a sequence of one or more atomistic actions that are predicted to optimize the objective function.
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium for selecting actions for an agent in an environment. In one aspect, a system comprises receiving an agent trajectory that characterizes interaction of an agent with an environment to perform one or more initial tasks in the environment; processing the agent trajectory to generate a classification output that comprises a respective classification score for each agent category in a set of possible agent categories, wherein each possible agent category is associated with a respective task selection policy; classifying the agent as being included in a corresponding agent category based on the classification scores; selecting tasks to be performed by the agent in the environment based on the task selection policy of the corresponding agent category; and transmitting, to the agent, data defining the selected tasks to be performed by the agent in the environment.
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection system used to select actions to be performed by an agent interacting with a target environment to perform a task in the target environment. In one aspect, a method comprises: obtaining a target environment model of the target environment; modifying the target environment model of the target environment to generate an obfuscated environment model of an obfuscated environment that represents an obfuscation of the target environment; obtaining, from each of a plurality of users, one or more obfuscated environment trajectories that represent interaction of the user with the obfuscated environment through the corresponding obfuscated environment simulation; mapping each of the obfuscated environment trajectories to a corresponding target environment trajectory; and training the action selection system on the target environment trajectories.