Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

Puzzles is a novel data‐augmentation framework that synthesizes diverse, view‐consistent video–depth sequences from single images or video clips to train feedforward 3D reconstruction networks at scale. Existing video-based 3D reconstruction datasets often contain redundant frames due to fixed camera trajectories after recording (A.1). Puzzles increases view diversity by generating novel viewpoints from single image or video clips (B.1) and simulates realistic camera trajectories (B.2) to support 3D reconstruction training.

Check our website for videos and reconstruction results!

TODO List

Release Image-to-Clips tutorial and demo.
Release Clips-to-Clips tutorial and demo.
Release evaluation and training code.

Installation Guide

Follow these steps to set up the Puzzles codebase on your system.

1. Clone this repository

git clone https://github.com/Jiahao-Ma/puzzles-code.git
cd puzzles-code

2. Create conda environment

conda create -n puzzles python=3.9 cmake=3.14.0
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia  # use the correct version of cuda for your system

pip install -r requirements.txt

3. Install Open3D (necessary)

With GUI support:

pip install -U -f https://www.open3d.org/docs/latest/getting_started.html open3d

Headless version (no display server):

Follow the Open3D headless rendering guide.

4. Build CUDA kernels for RoPE

cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../

5. Download the DUSt3R checkpoint

# Create a directory for checkpoints and download the model
mkdir checkpoints
cd checkpoints
wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth

Quick Start

The Quick Start section helps you get up and running with Image-to-Clips in just a few commands:

1. Interactive tutorial

Launches a Jupyter notebook that walks you through the core concepts of Image-to-Clips and live examples of turning a single frame into a mini video sequence.

2. Command-line usage

Demonstrates two minimal examples:

Without Puzzles augmentation: sample ordered overlapping patches to form a video clip.
With Puzzles augmentation: additionally applies augmentation to a subset of those patches, diversing novel viewpoints.

# Without Puzzles augmentation
python Image2Clips.py --input examples/I2C/alien.jpg --output examples/I2C --threshold 0.01 \
                          --NPuzzles 8 --min_box_size_scale 0.4 --max_box_size_scale 0.8 \
                          --min_overlap 0.1 --max_overlap 0.3 

# With Puzzles augmentation
python Image2Clips.py --input examples/I2C/lamp.jpg --output examples/I2C --threshold 0.01 \
                          --NPuzzles 6 --min_box_size_scale 0.2 --max_box_size_scale 0.5 \
                          --min_overlap 0.1 --max_overlap 0.3 --Augment \
                          --rotate_ratio 0.2 --rotate_min_coverage 0.2 --rotate_max_coverage 0.8 \
                          --max_rotation_angle 60.0 --front_back_ratio_thresh 0.6

Parameters:

--input: Path to the input image.
--output: Path to save the output video clips.
--threshold: Threshold for depth map, remove the discontinuous edge of depth map.
--NPuzzles: Number of puzzles to generate.
--min_box_size_scale and --max_box_size_scale: Scale range for the size of the puzzle patches.
--min_overlap and --max_overlap: Overlap range between puzzle patches.
--Augment: Enable Puzzles augmentation.
--rotate_ratio: Ratio of puzzles to rotate.
--rotate_min_coverage and --rotate_max_coverage: Coverage range for rotated puzzles.
--max_rotation_angle: Maximum rotation angle for puzzles.
--front_back_ratio_thresh: Threshold for front-back ratio to filter out puzzles. (Corresponds to the Section 3.1 Front-surface coverage test in the paper)

What to expect:

The first command (--NPuzzles 8) produces 8 contiguous video clips.
The second command (with --Augment and rotation flags) intersperses dynamic, rotated views for improved viewpoint diversity.
Try the examples in the examples/I2C directory to see how different parameters affect the output. Please refer to the scripts.md for more examples.

Training and Evaluation

Datasets

We use Habitat, ScanNet, ArkitScenes, and BlendedMVS to train model. Please refer to data_preprocess.md for details on dataset preparation.

Training

Below are the commands to train Spann3R on the unified dataset of BlendMVS, Habitat, ScanNet, and ArkitScene, both with and without Puzzles augmentation. Adjust --nproc_per_node, --nnodes, and rendezvous settings (--rdzv_*) to match your cluster.

1. Train with Puzzles augmentation

This run enables Puzzles transforms in the training datasets to simulate novel viewpoints and increase diversity.

srun torchrun \
    --nproc_per_node=4 \
    --nnodes=1 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    --rdzv_id=spann3r_bs_256_run \
    --max_restarts=0 \
    train.py \
    --model "Spann3R(dus3r_name='./checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth', use_feat=False, mem_pos_enc=False)" \
    --batch_size 16 \
    --num_workers 8 \
    --num_workers_test 8 \
    --train_dataset "10000 @ BlendMVS(split='train', ROOT='./data/blendedmvs/', resolution=224, \
                        puzzle_augment={'overlap_depth_threshold': None,\
                                     'puzzle_min_overlap': 0.7,\
                                     'puzzle_max_overlap': 0.95,\
                                     'transform3d':True,\
                                     'transform2d':False,\
                                     'min_random_augment_ratio': 0.4,\
                                     'max_random_augment_ratio': 0.7,\
                                     }) \
                    + 10000 @ habitat(split='train', ROOT='./data/habitat/', resolution=224, \
                        puzzle_augment={'overlap_depth_threshold': 0.3, \
                                         'puzzle_min_overlap': 0.7, \
                                         'puzzle_max_overlap': 0.9, \
                                         'transform3d':True, \
                                         'transform2d':False, \
                                         'min_random_augment_ratio': 0.4, \
                                         'max_random_augment_ratio': 0.7, \
                                         }) \
                    + 10000 @ Scannet(split='train', ROOT='./data/scannet/', resolution=224, max_thresh = 100, \
                        puzzle_augment={'overlap_depth_threshold': 0.3, \
                                            'puzzle_min_overlap': 0.7, \
                                            'puzzle_max_overlap': 0.9, \
                                            'transform3d':True, \
                                            'transform2d':False, \
                                            'min_random_augment_ratio': 0.4, \
                                            'max_random_augment_ratio': 0.7, \
                                            }) \
                    + 10000 @ ArkitScene(split='train', ROOT='./data/arkitscenes/', resolution=224, max_thresh = 100, \
                        puzzle_augment={'overlap_depth_threshold': 0.3, \
                                            'puzzle_min_overlap': 0.7, \
                                            'puzzle_max_overlap': 0.9, \
                                            'transform3d':True, \
                                            'transform2d':False, \
                                            'min_random_augment_ratio': 0.4, \
                                            'max_random_augment_ratio': 0.7, \
                                            })"
    --test_dataset "Scannet(split='val', ROOT='./data/scannet/', resolution=224, num_seq=1, kf_every=10, seed=777, full_video=True, puzzle_augment=None)"
    --output_dir './output/blendmvs_habitat_scannet_arkit_with_puzzles' \

2. Train without Puzzles augmentation

This run disables all Puzzles transforms, using the same datasets and hyperparameters for a direct baseline comparison.

srun torchrun \
    --nproc_per_node=4 \
    --nnodes=1 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    --rdzv_id=spann3r_bs_256_run \
    --max_restarts=0 \
    train.py \
    --model "Spann3R(dus3r_name='./checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth', use_feat=False, mem_pos_enc=False)" \
    --batch_size 16 \
    --num_workers 8 \
    --num_workers_test 8 \
    --train_dataset "10000 @ BlendMVS(split='train', ROOT='./data/blendedmvs/', resolution=224, \
                        puzzle_augment=None) \
                    + 10000 @ habitat(split='train', ROOT='./data/habitat/', resolution=224, \
                        puzzle_augment=None) \
                    + 10000 @ Scannet(split='train', ROOT='./data/scannet/', resolution=224, max_thresh = 100, \
                        puzzle_augment=None) \
                    + 10000 @ ArkitScene(split='train', ROOT='./data/arkitscenes/', resolution=224, max_thresh = 100, \
                        puzzle_augment=None)"
    --test_dataset "Scannet(split='val', ROOT='./data/scannet/', resolution=224, num_seq=1, kf_every=10, seed=777, full_video=True, puzzle_augment=None)"
    --output_dir './output/blendmvs_habitat_scannet_arkit_without_puzzles'

Notes:

The only difference between the two runs is the presence or absence of the puzzle_augment settings in each dataset.
Customize --output_dir to organize your results.

Evaluation

Please use the following command to evaluate model:

python eval.py --exp_path ./output/blendmvs_habitat_scannet_arkit_with_puzzles --exp_name ckpt --conf_thresh 0.0 \
               --ckpt checkpoint-last.pth # or checkpoint-best.pth

By running the training and evaluation commands above, you’ll reproduce the similar values and relative improvements reported in the paper. We found that Spann3R with Puzzles hadn’t fully converged after 120 epochs; however, to ensure a fair comparison, we used the same number of training epochs for experiments. If you increase the epoch count, you may achieve even better performance.

Acknowledgement

Our code, data preprocessing pipeline, and evaluation scripts are based on several awesome repositories:

We thank the authors for releasing their code!

Citation

If you find our code or paper useful for your research, please consider citing:

    @article{ma2025puzzles,
            title={Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction},
            author={Ma, Jiahao and Wang, Lei and Ahmedt-Aristizabal, David and Nguyen, Chuong and others},
            journal={arXiv preprint arXiv:2506.23863},
            year={2025}
            }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

TODO List

Installation Guide

1. Clone this repository

2. Create conda environment

3. Install Open3D (necessary)

4. Build CUDA kernels for RoPE

5. Download the DUSt3R checkpoint

Quick Start

1. Interactive tutorial

2. Command-line usage

Parameters:

What to expect:

Training and Evaluation

Datasets

Training

1. Train with Puzzles augmentation

2. Train without Puzzles augmentation

Notes:

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
MoGe		MoGe
Puzzles		Puzzles
assets		assets
croco		croco
docs		docs
dust3r		dust3r
examples/I2C		examples/I2C
spann3r		spann3r
Image2Clips.ipynb		Image2Clips.ipynb
Image2Clips.py		Image2Clips.py
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

License

Jiahao-Ma/puzzles-code

Folders and files

Latest commit

History

Repository files navigation

Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

TODO List

Installation Guide

1. Clone this repository

2. Create conda environment

3. Install Open3D (necessary)

4. Build CUDA kernels for RoPE

5. Download the DUSt3R checkpoint

Quick Start

1. Interactive tutorial

2. Command-line usage

Parameters:

What to expect:

Training and Evaluation

Datasets

Training

1. Train with Puzzles augmentation

2. Train without Puzzles augmentation

Notes:

Evaluation

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages