BuilderBench

Today's AI models learn primarily through mimicry and sharpening, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. BuilderBench is a benchmark designed to facilitate research towards developing such agents. It attempts to capture challenges regarding open-ended exploration, embodied reasoning and wide generalization.

The main features of BuilderBench are:

A parallelizable and hardware-accelerated simulator built using MuJoCo and Jax. Training a PPO policy to pick and place a block takes less than 5 minutes on one GPU and twelve CPU threads.
A task-suite of 42 (×4 variations) tasks, where each task requires qualitatively different reasoning capabilities.
Single file implementations for two self-supervised RL and four RL algorithms in jax.

A central insight of our paper is that block building is a simple domain, that captures open-endedness and embodied reasoning.

Blocks conceptually form an atomic unit, allowing agents to build diverse structures. The skills required to build various structures are open-ended as we scale the number of blocks.
While research on reasoning has almost become synonymous with large language models, block-building allows us to study whether this sort of reasoning and generalization can emerge from the ground up, through exploration and trial-and-error learning.

(None of these tasks were solved by GPT-5 Thinking and Google Gemini Pro at the time of writing, see our paper for more details.)

Task are carefully curated such that solving them requires the agent to unlock at least one distinct reasoning ability and compose various high-level skills sequentially. Tasks not only require motor skills like locomotion, grasping and throwing but higher-level skills such as logical reasoning (commutativity and associativity of pick and place ordering), geometrical reasoning (maximizing overhangs, packing problems) and intuitive physics (gravity, friction, toppling, balancing). Tasks also require reasoning about counterweights, buttresses, and temporary scaffolding!

Let's look at some examples:

T-block

This task requires building a simple T shaped structure with one cube at the base, and two cubes on top. The second frame (B) shows what many people envision as the solution to this task. However, as show in the frame, this configuration isn't stable. Solving this task requires the reasoning insight to rotate the bottom cube by about 45^°. Since the diagonal of the cube's top surface is longer than its edge length, the rotated base provides sufficient support for both top cubes, enabling a stable T-shaped structure (C).

Leaning Tower

Maximum overhang problem

The BuilderBench environment consists of a single robot hand with five degrees of freedom. The robot hand can navigate in 3D space and interact with n cubes. All interactions approximate newtonian physics simulated using MuJoCo. Each task corresponds to a physically stable target structure built using a subset of cubes from the environment. To specify a task, we provide the vector of target cube positions.

The BuilderBench task suite contains a total of 42 tasks. Roughly, we have designed 5 tasks for each environment with 1-9 cubes. The primary design principles while curating tasks were:

Solving different tasks should require distinct high level skills.
Most tasks should be solvable by humans.
Tasks should range from very easy to extremely hard.
Tasks should include some whose solutions are unknown even to the authors.

To evaluate open-ended exploration and generalization, we design the self-supervised protocol. We also provide a debug supervised protocol meant to provide additional feedback for researchers.

Self-supervised protocol: The agent interacts with the environment, but does not receive any task specification during training. The agent's goal is to explore its environment to acquire general knowledge and skills that might help it to solve future tasks. The agent has to output a task-conditioned policy which is evaluated on held-out tasks from the task-suite.

Supervised protocol: In this standard RL protocol, the agents interact with a single environment to solve a single task from the task-suite. It is trained and tested on the same goals.

Self-supervised protocol: We implemented two algorithms, sampling for learnability (SFL) and maximum entropy gain exploration (MEGA). Both algorithms are implemented using proximal policy optimization (PPO).

Both algorithms achieve trivial performance on tasks with three cubes. While these results indicate that the tested algorithms are not directly scalable to complex tasks, it primarily underscores the inherent difficulty of the task setup itself. We believe that research in developing new algorithms (or revisiting old ones) is required to solve these tasks.

Supervised protocol: For this protocol, we benchmark four RL algorithms, proximal policy optimization (PPO), soft actor critic (SAC), contrastive RL (CRL), random network distillation (RND).

Training on the test goals improves both the returns and success achieved by the best agents. However as the number of cubes and the complexity of the tasks increase, current algorithms are not able to achieve a non zero success.

Evaluating Large Language Models

**Evaluating LLMs on BuilderBench. See our paper for more details.**
Task Name	ChatGPT-5	Gemini 2.5 Pro
T block	✗	✗
Four cube packing	✗	✗
Hexagonal Portal	✗	✗
Leaning tower	✗	✗
Maximum Overhang	✗	✗

It has been shown that scaling pretraining and inference-time compute can significantly enhance the reasoning abilities of language models. To test whether the latest proprietary models can solve tasks from our task-suite, we evaluated ChatGPT-53 and Gemini 2.5 Pro in some of our tasks. Each model was provided with a descriptive prompt about the environment and the task. The goal of the model was to provide a high-level open-loop plan in language, such that following this plan would stably build the target structure. A simple example task with a correct solution was also included in the prompt. The table above shows that both models, despite using inference-time compute, are not able to provide the correct high-level plan to solve any of the tasks. While this is not meant to be an extensive evaluation of current models' abilities, it highlights how solving our tasks requires non-obvious steps of reasoning that are beyond what current models can achieve through scaling alone.

What are the ingredients for building generalist agents that learn through self-supervised exploration? Most tasks in the self-supervised protocol remain unsolvable by the algorithms we tried. This is by no means an exhaustive evaluation. It will be exciting to see which approaches lead to agents that can solve the most complex tasks purely through self-supervised trial and error.

Why do standard RL algorithms struggle to make progress on the more complex tasks? All RL algorithms we tried achieve a zero success in tasks with more than 4 cubes. What are the primary reasons? Is it exploration, curse of horizon, model size, training steps, better hyperparameters or something else?

How to perform RL pretraining? It is easy to come up with target structures that are stable and easy to build. By easy we mean the tasks that can be solved by simple pick and place primitives. This is in contrast to tasks that require unique reasoning skills (see our task-suite for more examples), which are not trivial to design and to solve. We can train multi-task RL agents using dense rewards to solve these easy to build tasks. Will the pretrained model provide a better initialization for solving the novel / unsolvable tasks we really care about? This is akin to pretraining in LLMs where low quality data is cheap, but data for novel / unsolvable tasks in not availble (by definition).

Two dimensional scaling. The self-supervised protocol allows one to study scaling in two new dimensions. Compute allocated to sampling autotelic tasks (exploration in tasks-space) and compute allocated to solving the task via trial and error (exploration in trajectory-space). How to optimally balance the two?

What is the scaling behavior of current self-supervised exploration algorithms? While our results show that self-supervised exploration algorithms are not able to solve complex tasks, we have only made a preliminary evaluation. BuilderBench allows one to study what is the scaling behavior of existing self-supervised algorithms. We argue that the key bottleneck for scaling self-supervised algorithms is availability of suitable benchmarks. Existing benchmarks rarely allow agents to practice skills ranging from exploration to prediction, from low-level control to high-level reasoning.

A new type of scaling law? Typically, the x-axis in scaling laws corresponds to compute. BuilderBench allows one to systematically put task hardness in the x-axis (build a pyramid from 1, 2, ..., blocks). Can we reliably predict how RL algorithms scale with task hardness?

Why is the performance of off-policy RL algorithms (like SAC and CRL) so much worse than PPO? In our experiments, we found that off-policy algorithms (SAC and CRL) were much worse in terms of sample efficiency and final performance than PPO and RND, despite making more gradient updates.

BuilderBench – A benchmark for generalist agents

Motivation

Why building blocks?

Aha! moments: Reasoning in BuilderBench

(None of these tasks were solved by GPT-5 Thinking and Google Gemini Pro at the time of writing, see our paper for more details.)

T-block

Leaning Tower

Maximum overhang problem

BuilderBench Environment and Task-suite

BuilderBench Training and evaluation

Benchmarking

TL;DR

Evaluating Large Language Models

Research opportunities and open questions

BibTeX