Abstract:The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website:

What problem does this paper attempt to address?

This paper introduces BEHAVIOR Vision Suite, a toolkit for generating customizable synthetic data to systematically evaluate and understand computer vision models. Currently, real-world datasets struggle to meet the comprehensive and customized labeling requirements under varying conditions, and existing synthetic data generators have limitations in terms of image quality, diversity, and physical realism. BEHAVIOR Vision Suite, based on the novel BEHAVIOR-1K benchmark, provides a large number of adjustable parameters for scene-level (such as lighting, object placement), object-level (such as joint configuration, "fill" and "fold" states), and camera-level (such as field of view, focal length) customization. Researchers can freely adjust these parameters to generate data for controlled experiments. The paper demonstrates three application examples: 1) robustness evaluation of models under different continuous domain transfer conditions; 2) evaluating scene understanding models using the same image set; 3) training and evaluating a novel visual task - simulation-to-real transfer of monocular and binocular prediction. The features of BEHAVIOR Vision Suite include high quality, physical plausibility, and high customization, providing rich annotations such as scene graphs, point clouds, depth, etc. It is applicable to a wide range of indoor scenes and objects, and supports physical interaction and modification of attribute states. By comparing with existing datasets, 3D reconstruction datasets, synthetic datasets, and 3D simulators, BEHAVIOR Vision Suite has advantages in customization and visual quality. It not only provides user-friendly tools for generating customized data but also addresses limitations of existing datasets, such as expensive annotation costs, static images, and fixed data distributions. The paper demonstrates the value of BEHAVIOR Vision Suite in model robustness evaluation, scene understanding, and training for new tasks through experiments.

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI

VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

VisDA: A Synthetic-to-Real Benchmark for Visual Domain Adaptation.

WorldSimBench: Towards Video Generation Models as World Simulators

BEHAVIOR in Habitat 2.0: Simulator-Independent Logical Task Description for Benchmarking Embodied AI Agents

DevBench: A multimodal developmental benchmark for language learning

Synthetica: Large Scale Synthetic Data for Robot Perception

Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Unity Perception: Generate Synthetic Data for Computer Vision

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

Synthetic data augmentation for robotic mobility aids to support blind and low vision people

BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

MABe22: A Multi-Species Multi-Task Benchmark for Learned Representations of Behavior