🌐 Website | 📑 Paper | 🤗 Caption Dataset
We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment.
Follow the steps below to set up the environment and use the repository:
# Clone the repository
git clone https://github.com/RAIVNLab/GenerateAnyScene.git
cd ./GenerateAnyScene
git submodule update --init --recursive
# Create and activate a Python virtual environment:
conda create -n GenerateAnyScene python==3.10.14
conda activate GenerateAnyScene
# Install the required dependencies:
pip install -r requirements.txt
pip install transformers==4.47.1 --force-reinstall --no-warn-conflictsFor the OpenSora 1.2 model, please refer to the installation instructions provided in its official repository.
For text-to-3D generation, please follow the setup guide from the ThreeStudio repository.
If you prefer using pre-configured environments, we provide Docker images for both Text-to-Video and Text-to-3D tasks. To download the Docker containers:
docker pull uwziqigao/3d-gen:latest
docker pull uwziqigao/video_gen:latestAfter downloading and setting up the repo, you can generate captions using the following command:
python generation.py \
--metadata_path ./metadata \
--output_dir ./output \
--total_prompts 1000 \
--num_workers 1 \ # enable parallelism by setting num_workers > 1
--min_complexity 4 \
--max_complexity 7 \
--min_attributes 1 \
--max_attributes 4 \
--modality_type text2image # or text2video or text2threed for different modalitiesWe also support a wide range of models evaluation and metrics, including:
- Text-to-Image Models
- Stable Diffusion 3 Medium
- Stable Diffusion 2-1
- SDXL
- PixArt-α
- PixArt-Σ
- Wuerstchen v2
- DeepFloyd IF XL
- FLUX.1-dev
- FLUX.1-schnell
- Playground v2.5
- Text-to-Video Models
- Text2Video-Zero
- ZeroScope
- VideoCraft2
- AnimateDiff
- AnimateLCM
- FreeInit
- ModelScope
- Open-Sora 1.2
- CogVideoX-2B
- Text-to-3D Models
- Latent-NeRF
- DreamFusion-sd
- DreamFusion-IF
- Magic3D-sd
- Magic3D-IF
- ProlificDreamer
- Fantasia3D
- SJC
- Metrics
- Clip Score
- Pick Score
- Image Reward Score
- Tifa Score
- VQA Score
The demo.py script allows you to generate and evaluate images, videos, or 3D scenes using prompts from a JSON file. Below are the instructions for running the script:
python demo.py --input_file <path_to_json_file> --gen_type <generation_type> [--models <model1> <model2> ...] [--metrics <metric1> <metric2> ...] [--output_dir <path_to_output_directory>]--input_file: Path to the JSON file containing captions. The JSON file must include captions generated bygeneration.py.--gen_type: Type of generation you want to perform. Choose one of:imagefor image generation.videofor video generation.3dfor 3D scene generation.
Generate images using default models and metrics:
python demo.py --input_file output/prompts_batch_0.json --gen_type imageGenerate images using specific models and metrics:
python demo.py --input_file output/prompts_batch_0.json --gen_type image --models stable-diffusion-2-1 --metrics ProgrammaticDSGTIFAScoreModern text-to-vision models produce high-fidelity visuals but struggle with compositional generalization and semantic alignment. Existing real-world datasets (like CC3M) are noisy, weakly compositional, and lack dense, high-quality annotations.
Constructing a compositional dataset requires that we first define the space of the visual content. Scene graphs are one such representation of the visual space, grounded in cognitive science. A scene graph represents objects in a scene as individual nodes in a graph. Each object is modified by attributes, which describe its properties. Relationships are edges that connect the nodes. Scene graphs provide explicit compositional structure.
We propose Generate Any Scene (GAS), a controllable scene-graph-driven data engine that produces synthetic captions and QA for training, evaluation, reward modeling, and robustness improvement in text-to-vision systems. GAS systematically enumerates scene graphs representing the combinatorial array of possible visual scenes.
Metadata Types:
| Metadata Type | Number | Source |
|---|---|---|
| Objects | 28,787 | WordNet |
| Attributes | 1,494 | Wikipedia, etc. |
| Relations | 10,492 | Robin |
| Scene Attributes | 2,193 | Places365, etc. |
Caption Generation Process:
Pipeline:
- Enumerate diverse scene graph structures under user-defined constraints.
- Populate structures with sampled objects, attributes, and relations.
- Sample scene attributes such as style, perspective, or time span.
- Translate scene graph and attributes into coherent captions.
- Automatically generate QA pairs covering all elements for evaluation and reward modeling.
We iteratively improve models by leveraging Generate Any Scene captions. Given a model's generated images, we select the best generations and fine-tune the model on them, leading to performance boosts. Stable Diffusion v1.5 gains around 5% performance improvements, outperforming even fine-tuning on real CC3M captions.
We identify and distill the specific strengths of proprietary models into open-source counterparts. For example, DALL·E 3's compositional prowess is transferred to Stable Diffusion v1.5, effectively closing the performance gap in handling complex multi-object scenes.
Fine-tuning Stable Diffusion v1.5 with fewer than 800 GAS-targeted synthetic captions yields about a 10% TIFA improvement on these capabilities.
GAS generates exhaustive scene-graph-based QA, enabling a naturally verifiable reward signal for autoregressive text-to-image models. Instead of using only CLIP-style similarity rewards, the paper creates exhaustive QA from scene graphs and uses answer accuracy as a semantic reward. With GRPO on SimpleAR-0.5B-SFT, this scene-graph-based reward surpasses CLIP-based methods.
By using Generate Any Scene to generate challenging synthetic data, we train content moderators to better detect generated imagery. This robustifies detectors across different models and datasets, improving the reliability and safety of generative AI.
metadata:
- attributes.json: Each key is a category of attributes, and the value is a list of attributes
- objects.json: A list of objects metadata
- relations.json: each key is a category of attributes, and the value is a list of attributes
- scene_attributes.json: A 2-level nested dictionary, where the first level is the scene attribute category, and the second level is the subcategory, and the value is a list of corresponding scene attributes.
- taxonomy.json: The taxonomy of objects, each edge is a directed edge from parent to child, indicates the first concept is the super concept of the second concept.
If you find Generate Any Scene helpful in your work, please cite:
@inproceedings{gao2026generate,
title={Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training},
author={Ziqi Gao and Weikai Huang and Jieyu Zhang and Aniruddha Kembhavi and Ranjay Krishna},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}





