ADL4D

4D Human Activity Dataset

Towards a Contextually Rich Dataset for 4D Activities of Daily Living

Zakour*  ·  Nath*  |  arXiv 2402.17758

ADL4D banner 1 / 16

The Gap: Long-Horizon Activity Capture

Prior Work

  • DexYCB-style: Pick / Place / Handover
  • H2O3D-style: Hold and Move
  • H2O-style: Hold and Interact
  • Atomic / minimal complexity — no temporal chaining

ADL4D

Activity modelled as Chain / Graph of actions

Activities: Making Dinner / Coffee / Breakfast, Cleaning up

Actions w Hand-Obj pair: Boil / Heat  ·  Pour / Place / Pick  ·  Handover / Receive  ·  Stir / Scoop  ·  Push / Open / Close

Dataset Complex Actions 3D Hands 6D Obj Inter-Subj #Hands/Seq Avg Fr/Seq #Img #Cam
FPHA Frame 1 87 105K 1
DexYCB 1 73 582K 8
H2O Frame 2 620 571K 5
H2O3D 2 234 76K 5
MECCANO Frame 2 15K 300K 1
OakInk 2 44 230K 4
HOI4D Frame 1 480 2.4M 1
As.Hands Frame 2 3M 12
ARCTIC Frame 2 688 2.1M 9
ADL4D Hand 2–4 1833 1.1M 8
2 / 16

Studio & Hardware

Studio setup diagram
  • 📷  (8-15+ego)× Intel RealSense D435 — synchronized RGB-D (colour + depth)
  • 🔵  8× OptiTrack Prime 13X — 3D marker tracking / 6 DoF rigid pose
  • 💡  4× Spotlights + diffusers — controlled, uniform illumination
  • ⚡  Sync options — hardware clock/post-processed synchronisation in software
  • 🏗️  Custom cage rig — plug-and-play up to 15 cameras; rapid-calibration sequence
3 / 16

Annotation Challenge

Prior Datasets (easier)

  • Single subjects, single hands
  • High FPS (>30) → easier triangulation & tracking
  • Short, small sequences; minimal hand pose variety

ADL4D (harder)

  • Low FPS (20) / long-tailed distribution
  • Multi-subject, multi-hand (up to 4 hands)
  • Sparse camera coverage

✓ Rigid Body Tracking (OptiTrack)

  • Reflective markers on rigid props → worked reliably
  • Marker removal via inpainting succeeded
  • Clean RGB output, no artefacts

✗ Hand Marker Tracking (ARCTIC-style)

  • Semi-spherical/flat markers on dorsal hand surface
  • Near-flat geometry → high triangulation error, marker inpainting poor
  • Parametric Hand Fitting to 3D markers work, but multisubject interhand occlusions trigger frequent track drops
4 / 16

Landmark Detectors as Marker Proxy

Adopted Approach

  • Follow DexYCB / H2O / H2O3D — use 2D landmark detector (MediaPipe/RCNNs) as marker proxy
  • Triangulate across views → 3D joints; no physical markers
  • Works well for single-hands but breaks in multi-hand with exponentially scaling failure

Core Failure Mode

ReID of landmarks is core to triangulation. Core to multi-view ReID is epipolar matching + filtering:

Why It Fails

  • Occluded hand's epipolar lines crosses different subject's detected hand within accepted error thresholds → incorrect 3D hand
  • Enhancing with stereo patch similarity / detector embeddings — tested on full-body, not cropped hand ROI; prone to failure
  • Loose MANO 2D constraints — minimal improvement; often hands start merging together
  • Trained detector-side classifier — best performance but limited to single-subject sequences; doesn't scale
5 / 16

Solution 1: GUI Human in the Loop

Annotation GUI suite
  • Custom CVAT-style GUI for visually verifying and correcting locked tracks across all views
  • Supports hotkey track switching, new detection triggering, live triangulation and temporal matching
  • Used to annotate test set in a fully human-supervised manner
6 / 16

The Automation Solve: 3D Inlier Clustering & ReID

  • Project ReID matching as a 3D inlier search / clustering problem
  • Generate all pairwise 3D hand candidates; set assignment creates clusters that:
    • Minimises/fixes number of clusters K
    • Maximises 2D source detections per cluster
    • Minimises final reprojection error
    • Minimises pairwise 3D distance within cluster
  • Propagate temporal tracks when available — enhance 3D track, prevent track drop, restore from prior dropped track
  • Flag lost-track frames contrastively: interpolate hand position; mark frames as explicitly untracked
Methodology block diagram

Multi-subject markerless pipeline

  • Handles any number of hands in the scene
  • Fully automated — no GUI needed for train/val splits
  • Validated at up to 4 subjects (8 hands) at 20 FPS
  • Scales to the full 1.1M frame ADL4D and external datasets without finetuning/engineering
7 / 16

Annotation Robustness

Skipped frames (off-the-shelf MediaPipe)

Dataset Without Method With Method Reduction
ADL4D 1 088 22 98%
H2O 6 302 213 97%

Absolute pose accuracy on external datasets

Dataset abs MPJPE (mm) ↓ AUC ↑
H2O 5.36 0.8930
DexYCB 8.56 0.8651
8 / 16

Dataset Scale & Diversity

1.1 Million annotated RGB-D frames

t-SNE pose variation plot

ADL4D (Red) covers broader pose space than H2O (Blue) and DexYCB (Green)

Dataset pose visualisations

Includes "in-between" transitions, handovers, and idle adjustments

9 / 16

Downstream: Hand Mesh Recovery (HMR)

Train Set Test Set MPJPE (mm) ↓
DexYCB (baseline) H2O 44.96
ADL4D H2O 32.76
Cross-dataset HMR qualitative results

27% reduction in rr MPJPE — better cross-dataset generalisation. Significantly better mesh quality.

10 / 16

Downstream: Action Segmentation & Object Pose

Action Segmentation

Features Acc. ↑ Edit ↑ F1@10 F1@25 F1@50
I3D 32.77 41.66 24.59 18.21 7.12
X3D 28.99 34.15 28.50 19.27 6.85
SlowFast 45.02 40.78 36.73 28.86 16.94
Pose (ADL4D) 57.15 53.19 56.77 50.89 35.81

Zero-Shot Object Pose Tracking

Model ADD ↑ ADD-S ↑
FoundationPose 0.47 0.64
ICG+ 0.53 0.74

Pose features remain camera-motion invariant; ADL4D validates as a challenging multi-subject benchmark for both action understanding and object pose methods.

11 / 16

Demo: ADL4D Sequences

12 / 16

Demo: DexYCB Sequence 4

13 / 16

Demo: DexYCB Sequence 6

14 / 16

Demo: H2O Sequences

15 / 16

Impact & Future Directions

Current Impact

  • Deployed at TUM — Project Lab Human Activity Understanding (HAU)
  • Multiple MSc and PhD projects built with markerless data from the system
  • Grounds a core research topic in long-horizon context analysis with 6D human poses
  • Comparable scale to frontier AI lab datasets (Meta HOT3D, CMU Panoptic) at a fraction of the cost
  • Published on arXiv: 2402.17758

Future Directions

  • Egocentric cameras — include already integrated ego feed into dataset
  • Dense landmark models — finer granularity beyond sparse keypoints
  • Beyond kitchen — expand to diverse daily-living environments
16 / 16
1 / —