Training and Evaluating Agents
Once you have installed BEHAVIOR and its dependencies, you are ready to evaluate your agents on the benchmark. To be evaluated, agents have to be instantiated as objects of the class
Agent or derived. We have created an initial derived class
behavior/benchmark/agents/users_agent.py to be fulfilled for your agent. The minimal functionality necessary for your agent is a
reset function and the
act action that provides actions at each timestep, given an observation from the environment.
Once your agent is implemented as derived from the
Agent class, you can evaluate it in BEHAVIOR activities. The way to do it will depend on your type of installation: manual or Docker installation. In the following, we include instructions for evaluation for both installation types.
Evaluating an agent on BEHAVIOR after a manual installation
To evaluate locally, use the functionalities in
behavior/benchmark/behavior_benchmark.py. As an example, the following code benchmarks a random agent on a single activity indicated in the specified environment config file:
python -m behavior.benchmark.behavior_benchmark
Once you have implemented your agent in the class
CustomAgent, you can evaluate it instead of the random agent by changing the last command by:
python -m behavior.benchmark.behavior_benchmark --agent-class Custom
The code in
behavior_benchmark.py can be used to evaluate an agent following the official setup benchmark rules: the agent is evaluated in the indicated activity/activities for nine instances of increasing complexity: three instances of the activity that were available for training, three instances where everything is the same as in training but the small objects change the objects initial locations, and three instances where the furniture in the scenes is also different. The code runs the benchmark metrics and saves the values on files in the
The example above evaluates the random agent in a single activity specified in the environment’s config file. However, you can select the activity you want to benchmark the agent on with the option
--split and the name of the activity (check all activities here and video examples here), or benchmark on the entire set of 100 activities by specifying
--split dev or
--split test, to use developing or test activity instances.
For example, to benchmark a provided PPO agent (reinforcement learning) loading a specific policy checkpoint only for the activity
cleaning_toilet, you can execute:
python -m behavior.benchmark.behavior_benchmark --agent-class PPO --ckpt-path /tmp/my_checkpoint --split cleaning_toilet
We provide pretrained checkpoints in
behavior/benchmark/agents/checkpoints. Due to refactoring changes in the
BehaviorRobot, the dimensionality of the action space may have changed.
Evaluating on a single activity instance
Instead of evaluating agents following the benchmark rules (nine instances per activity), you can also evaluate in one or a custom set of activity instances by calling directly the method
BehaviorBenchmark.evaluate_agent_on_one_activity and providing a list of instances.
Evaluating an agent on BEHAVIOR after a Docker installation
We provide several scripts to evaluate agents on
dev splits. The
minival split serves to evaluate on a single activity. The following code evaluates a random agent on the
minival split using a local docker image:
./benchmark/scripts/test_minival_docker.sh --docker-name my_submission --dataset-path my/path/to/dataset
my_submission is the name of the docker image, and
my/path/to/dataset corresponds to the path to the iGibson and BEHAVIOR Datasets, and the
igibson.key obtained following the installation instructions.
You can also evaluate locally for the
dev split (all activities) by executing:
./benchmark/scripts/test_dev_docker.sh --docker-name my_submission --dataset-path my/path/to/dataset
Both scripts call
behavior/benchmark/scripts/evaluate_agent.sh. You can modify this script, or the docker scripts to evaluate your agent.
Submitting to the BEHAVIOR public leaderboard on EvalAI
If you use the Docker installation, you can submit your solution to be evaluated and included in the public leaderboard. For that, you first need to register for our benchmark on EvalAI here. You should follow the instructions in the
submit tab on EvalAI that we summarize here:
# Installing EvalAI Command Line Interface
pip install "evalai>=1.2.3"
# Set EvalAI account token
evalai set_token <your EvalAI participant token>
# Push docker image to EvalAI docker registry
evalai push my_submission:latest --phase <track-name>
There are two valid benchmark tracks depending if your agent uses only onboard sensing or assumes full observability:
Once we receive your submission, we evaluate and return the results.
Due to the time and resource consuming evaluation process, each participant is restricted to submit once per week, maximum 4 times per month.