VLM-RM: Specifying reward with natural language

We trained reinforcement learning agents to perform complex behaviors like kneeling and doing splits from natural natural language descriptions of the tasks, by using vision-language models as zero-shot reward models.

Oct 23, 2023
By Juan Rocamonde, Victor Montesinos

Reinforcement learning (RL) is a common method for training robotic agents, but it faces a major practical roadblock: specifying reward functions can be difficult, brittle, and simply infeasible for complex real-world tasks. Approaches like inverse reinforcement learning and imitation learning aim to learn from human examples or feedback, but require large amounts of data.

In our latest research, we’ve made a breakthrough in automating reward specification using the natural language capabilities of large vision-language models. Specifically, we’ve shown that pre-trained base models like CLIP can provide effective rewards for RL agents with just a simple text prompt describing the desired behavior.

For example, using only the prompt “a humanoid robot kneeling,” we trained a humanoid to successfully kneel down without any additional feedback or fine-tuning. This zero-shot approach worked for other complex motor skills as well, like doing the splits. The simplicity of describing tasks through natural language unlocks the potential for generalized robot learning that understands and follows human instructions.

We use CLIP as a reward model to train a MuJoCo humanoid robot to (1) stand with raised arms, (2) sit in a lotus position, (3) do the splits, and (4) kneel on the ground. Each task is specified by a single sentence text prompt like 'a humanoid robot kneeling', and required no prompt engineering. More videos of our trained agents are available on the paper website.

Vision Language Models as Reward Models

The key innovation behind our approach is using a pre-trained vision-language model like CLIP to directly generate reward signals from natural language descriptions of a task, without any additional training. Prior work has explored similar ideas, but required collecting human-labeled data to fine-tune VLMs for each specific task. In contrast, our method works zero-shot, relying solely on CLIP’s pre-existing capabilities.

CLIP is a deep learning model trained on millions of image-caption pairs from the internet. Its goal is to determine whether or not a given image and caption match. To succeed at this, CLIP has learned to build rich internal representations of both images and text that capture their meaning.

VLM-RM takes advantage of these learned representations. First, we use CLIP to encode an image of the current state of the robotic environment into an abstract representation. Second, we encode the textual task instruction. If the two representations are highly similar, it suggests the environment state matches the instruction goal. This measure of similarity gives us a natural zero-shot reward signal.

Once we generate our reward signal from just a single human instruction, we can use standard reinforcement learning algorithms, such as Soft Actor-Critic, to find a policy that maximizes the reward signal.

Implications and future work

The ability to specify complex reinforcement learning tasks with simple natural language prompts has profound implications. It paves the way for language supervision interfaces that can safely steer agent behavior and enable corrigibility. By allowing humans to easily correct or override agent objectives through language, we can build reliable systems that remain aligned with human intentions.

While our method marks an important milestone, there remain exciting opportunities to extend it. The range of tasks that can be solved is currently limited by the visual and language understanding capabilities of the VLM. As larger and more capable VLMs emerge, we expect them to enable specifying an even broader set of behaviors.

We see VLM-RMs as an exciting avenue for scalable oversight in visual environments. An interesting direction for future work is exploring the potential of VLM-RMs to provide automated process supervision. By setting safe guardrails in the behavior of autonomous agents, we can control the means available to achieve a certain end.

Once a reward model is specified, a key challenge in any RL problem is the sample complexity of training policies, which makes real-world online learning infeasible. The best RL algorithms today still require millions of trials, and policies trained in simulators do not transfer reliably from simulators to the real world (a.k.a. “sim2real”). We think that VLM-RMs, combined with more accurate simulators, can help bridge this gap by training policies that rely on pre-trained VLM representations instead of pixel observations, which we found to be surprisingly robust to distributional shift.


At Vertebra, we are creating mind-computer interfaces that make machine intelligence a seamless extension of the human mind. Our research on VLM-RM demonstrates how natural language interfaces can allow us to robustly convey intentions to agents in complex environments. This is an important step towards developing reliable and corrigible systems that can understand our objectives and acquire new skills on the fly, without relying on huge human-annotated datasets, like RLHF.

Future AI systems should think in the background about how to help us, rather than rushing to provide immediate solutions. VLM-RM offers a way to guide open-ended exploration through language, enabling agents to discover creative solutions to novel problems. We expect to extend this approach to new kinds of exciting problems by tapping into other data modalities, such as video or scientific data.


Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner (2023): "Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning".

Sign up for updates on our latest research and products:

Vertebra Technologies, Corp. @VertebraCorp on X. © 2024. 548 Market St., San Francisco CA 94104-5401