Training and Evaluating Agents
Once you have installed BEHAVIOR and its dependencies, you are ready to evaluate your agents on the benchmark. To be evaluated, agents have to be instantiated as objects of the class Agent
or derived. We have created an initial derived class CustomAgent
in behavior/benchmark/agents/users_agent.py
to be fulfilled for your agent. The minimal functionality necessary for your agent is a reset
function and the act
action that provides actions at each timestep, given an observation from the environment.
Once your agent is implemented as derived from the Agent
class, you can evaluate it in BEHAVIOR activities. The way to do it will depend on your type of installation: manual or Docker installation. In the following, we include instructions for evaluation for both installation types.
Evaluating an agent on BEHAVIOR after a manual installation
To evaluate locally, use the functionalities in behavior/benchmark/behavior_benchmark.py
. As an example, the following code benchmarks a random agent on a single activity indicated in the specified environment config file:
export CONFIG_FILE=path/to/your/config/for/example/behavior/configs/behavior_onboard_sensing.yaml
export OUTPUT_DIR=path/to/your/output/dir/for/example/tmp
python -m behavior.benchmark.behavior_benchmark
Once you have implemented your agent in the class CustomAgent
, you can evaluate it instead of the random agent by changing the last command by:
python -m behavior.benchmark.behavior_benchmark --agent-class Custom
The code in behavior_benchmark.py
can be used to evaluate an agent following the official setup benchmark rules: the agent is evaluated in the indicated activity/activities for nine instances of increasing complexity: three instances of the activity that were available for training, three instances where everything is the same as in training but the small objects change the objects initial locations, and three instances where the furniture in the scenes is also different. The code runs the benchmark metrics and saves the values on files in the OUTPUT_DIR
.
The example above evaluates the random agent in a single activity specified in the environment’s config file. However, you can select the activity you want to benchmark the agent on with the option --split
and the name of the activity (check all activities here and video examples here), or benchmark on the entire set of 100 activities by specifying --split dev
or --split test
, to use developing or test activity instances.
For example, to benchmark a provided PPO agent (reinforcement learning) loading a specific policy checkpoint only for the activity cleaning_toilet
, you can execute:
export CONFIG_FILE=path/to/your/config/for/example/behavior/configs/behavior_onboard_sensing.yaml
export OUTPUT_DIR=path/to/your/output/dir/for/example/tmp
python -m behavior.benchmark.behavior_benchmark --agent-class PPO --ckpt-path /tmp/my_checkpoint --split cleaning_toilet
We provide pretrained checkpoints in behavior/benchmark/agents/checkpoints
. Due to refactoring changes in the BehaviorRobot
, the dimensionality of the action space may have changed.
Evaluating on a single activity instance
Instead of evaluating agents following the benchmark rules (nine instances per activity), you can also evaluate in one or a custom set of activity instances by calling directly the method BehaviorBenchmark.evaluate_agent_on_one_activity
and providing a list of instances.
Evaluating an agent on BEHAVIOR after a Docker installation
We provide several scripts to evaluate agents on minival
and dev
splits. The minival
split serves to evaluate on a single activity. The following code evaluates a random agent on the minival
split using a local docker image:
./benchmark/scripts/test_minival_docker.sh --docker-name my_submission --dataset-path my/path/to/dataset
where my_submission
is the name of the docker image, and my/path/to/dataset
corresponds to the path to the iGibson and BEHAVIOR Datasets, and the igibson.key
obtained following the installation instructions.
You can also evaluate locally for the dev
split (all activities) by executing:
./benchmark/scripts/test_dev_docker.sh --docker-name my_submission --dataset-path my/path/to/dataset
Both scripts call behavior/benchmark/scripts/evaluate_agent.sh
. You can modify this script, or the docker scripts to evaluate your agent.
Submitting to the BEHAVIOR public leaderboard on EvalAI
If you use the Docker installation, you can submit your solution to be evaluated and included in the public leaderboard. For that, you first need to register for our benchmark on EvalAI here. You should follow the instructions in the submit
tab on EvalAI that we summarize here:
# Installing EvalAI Command Line Interface
pip install "evalai>=1.2.3"
# Set EvalAI account token
evalai set_token <your EvalAI participant token>
# Push docker image to EvalAI docker registry
evalai push my_submission:latest --phase <track-name>
There are two valid benchmark tracks depending if your agent uses only onboard sensing or assumes full observability: behavior-test-onboard-sensing-1190
, behavior-test-full-observability-1190
.
Once we receive your submission, we evaluate and return the results.
Due to the time and resource consuming evaluation process, each participant is restricted to submit once per week, maximum 4 times per month.