ADL4D - 4D Human Activity Dataset

Overview

ADL4D bridges the gap between isolated object interactions and complex, real-world activity planning.

While previous benchmarks focused on single “atomic” actions (like holding a mug), ADL4D captures the Action Plan: the messy, continuous flow where a user opens a fridge, moves items, pours milk, and hands it to a partner. This project introduces a large-scale dataset of two-subject, multi-object interactions and a novel machine vision pipeline to annotate them.

A continuous sequence from the dataset showing complex interactions, multiple objects, and seamless transitions.

Technical Innovation: Automated ReID for Triangulation

Annotating heavy occlusion scenarios in an 8-15 camera setup is a “degenerate” geometric problem.

The Problem: Ghost Hands

Standard triangulation relies on epipolar geometry: if you see a point in Camera A, its corresponding point in Camera B must lie on a specific line.

Failure Mode: In multiview geometry, algorithmic epipolar matching fails when a subject’s landmarks in one view pass over epipolar lines from a different hand (either their own or a second subject’s) captured from another view, especially when the target hand is occluded in the first view. This degenerate case repeats frequently in sparse multi-camera scenarios attempting to cover 360-degree scenes. The effect is further exacerbated by our focus on hands, multiple subjects, and complex inter-subject interactions.
Result: Naive triangulation matches these disparate points, creating “Ghost Hands”—3D clusters that vanish or explode when the subjects move.

The Solution: Robust 3D Hand Identification

We developed a Dynamic Matching algorithm that treats triangulation as a Re-Identification (ReID) problem rather than just a geometric one.

Subspace Clustering: Instead of matching points directly, we generate all possible 3D candidates (including valid hands and ghost hands) and cluster them in 3D space.
Temporal Consistency (Tracking Mode): We propagate the unique identity of a hand cluster from previous frames. If a hand is visible in only 2 cameras (normally insufficient for stable clustering), our algorithm uses the projected trajectory from the previous frame to “lock” the identity.
Human-in-the-Loop: We built a custom GUI where a human validator can visually verify the “locked” tracks. Our entire test set is annotated with human supervision. The Training and Validation sets are unsupervised, and the pipeline masks out frames where it was unable to cluster and triangulate the hand correctly (usually a very small number of frames).

Left: The automated pipeline block diagram. Right: The "Human-in-the-Loop" GUI allowing for rapid correction of track drifts.

Achievements & Impact

The primary contribution of ADL4D is the Quality and Variety of the data. By capturing long-form interactions, we generate poses that standard “atomic” datasets miss.

Absolute Pose Accuracy: Validated on external datasets (H2O, DexYCB).
Scale: 1.1 Million frames of annotated RGB-D data.
Diversity: Includes “in-between” actions—transitions, handovers, and idle adjustments—that are critical for training robust robots.
Annotation Robustness: In a challenge using off-the-shelf MediaPipe, clustering without our method resulted in 1088 skipped frames on ADL4D, whereas utilizing our robust tracking method reduced this to just 22 skipped frames. On the H2O dataset, this impact is even more pronounced (6302 skipped vs. 213 skipped).

Pose Variety

Our dataset covers a significantly wider distribution of hand poses compared to existing benchmarks (H2O, DexYCB).

Left: t-SNE plot showing ADL4D (Red) covering a broader pose space than H2O (Blue) and DexYCB (Green). Right: Visualization of the diverse poses.

Absolute Pose Accuracy

We validated our annotation robustness by testing on external datasets, achieving state-of-the-art accuracy.

Metrics

Dataset	abs MPJPE (mm)	AUC
H2O	5.36	0.8930
DexYCB	8.56	0.8651

Our “Tracking Mode” with Reprojection (Repr) criterion achieves the lowest error.

Qualitative

Our automated pipeline generates annotations that closely match ground truth, even in challenging dynamic scenarios.

Comparison of H2O Ground Truth (Left, Blue) vs. Our Dynamic Annotation (Right, Red). Note the high alignment.

Downstream Tasks

We benchmarked the dataset on three critical computer vision tasks.

1. Hand Mesh Recovery (HMR)

HMR Model Quality

We achieved high-quality hand reconstruction results using ADL4D.

Visual results of the HMR model trained on ADL4D, showing accurate mesh recovery in complex interactions.

Cross-Dataset Generalization

Models trained on ADL4D generalize significantly better to unseen datasets.

Train Set	Test Set	Error (MPJPE mm)
DexYCB	H2O	44.96
ADL4D	H2O	32.76

Qualitative results showing cross-dataset generalization on H2O sequences.

2. Hand Action Segmentation

Using the precise 3D pose history from ADL4D enhances action segmentation. Pose features prove superior (57.15% Acc.) as they remain invariant to the dynamic background motion inherent in multi-view capture systems, whereas standard video features (I3D/X3D) struggle.

Overview of the Action Segmentation task.

Features	Acc.	Edit	F1@10	F1@25	F1@50
I3D	32.77	41.66	24.59	18.21	7.12
X3D	28.99	34.15	28.5	19.27	6.85
SF	45.02	40.78	36.73	28.86	16.94
Pose (ADL4D)	57.15	53.19	56.77	50.89	35.81

Left: The Action Segmentation Pipeline. Right: Qualitative samples of annotated activities. (Sampled at center of action window).

3. Zero-Shot Object Pose Tracking

We evaluated zero-shot object pose tracking methods on ADL4D test sequences.

Model	ADD	ADD-S
FoundationPose	0.47	0.64
ICG+	0.53	0.74

Qualitative analysis suggests ICG+ provides smoother predictions during severe hand-object occlusions compared to FoundationPose.

Demos

ADL4D Sequences

ADL4D Sequence 1

ADL4D Sequence 3

ADL4D Sequence 2 (View 1)

ADL4D Sequence 2 (View 2)

DexYCB Sequences

Sequence 4

Sequence 4

Sequence 6

Sequence 6

H2O

Sequence 2

Sequence 9

Conclusion & Future Improvements

ADL4D demonstrates that context is the missing link in Human-Object Interaction. By capturing the full “Action Plan”—preparation, interaction, transition, and conclusion—we provide a benchmark that forces models to learn the temporal logic of activity.

Looking forward, we identify several key areas for evolution:

Egocentric Perspectives: While our studio setup supports egocentric capture, this dataset focused on third-person views. Integrating ego-centric cameras from the subjects’ point of view would provide a critical “user-eye” signal for imitation learning.
Dense Landmark Models: The current pose estimation pipeline could be significantly enhanced by adopting emerging dense landmark models, offering finer granularity than sparse keypoints.
Expanded Data Variety: Future iterations should expand beyond kitchen-focused scenarios to cover a broader range of daily living environments and interaction types.

The release of this dataset serves as a foundational step for the next generation of “context-aware” robotic assistants.