BPC - Bin Picking Challenge 25
Zero-Shot 6D Pose Estimation pipeline for the Bin Picking Challenge using Render-and-Compare.
Overview
The Bin Picking Challenge (BPC) presents a notoriously difficult problem in computer vision: identifying and estimating the 6D pose of texture-less, reflective, and entangled objects in a bin.
I built this project to participate in the One-Shot Track of the OpenCV Bin Picking Challenge. The goal was to determine if I could robustly achieve zero-shot pose estimation without specific model tuning, relying instead on existing foundational models.
Methodology
The approach leverages a multi-stage pipeline that combines foundational vision models with robust 3D rendering for pose refinement.
1. Instance Segmentation
Segment Anything Model 2 (SAM2) and FastSAM are utilized to generate high-quality instance masks from the RGB/monochrome input. This ensures that even entangled or partially occluded objects are cleanly isolated proposals.
2. Zero-Shot Matching (DINOv2 / CLIP)
Instead of training a custom detector, segmented proposals are matched against a database of 3D object models rendered in multiple angles and lighting conditions using DINOv2 and CLIP embeddings. This allows the system to identify objects based on semantic similarity rather than just visual pattern matching.
3. Pose Optimization (Render-and-Compare)
Once an object is identified and roughly positioned via the depth map, its 6D pose is refined using a Render-and-Compare loop:
- Engine: PyTorch3D is used to render the 3D model at the estimated pose in all views.
- Optimization: A weighted combination of mask loss (IoU) and texture loss is minimized. To handle clutter, batch optimization of individual or all objects simultaneously is supported.
- Pruning: Class-wise 3D Non-Maximum Suppression (3D-NMS) is applied every N steps to de-duplicate instances, as initial matching spawns multiple hypotheses per object from different camera views.
Visuals
Visuals
Pipeline Progression
Automated Template Rendering
To enable Zero-Shot matching, a database of active object templates is rendered and cached during inference. These templates cover various viewpoints and are used to generate the embedding database.
Findings
- Image Space Noise: Converging in image space reconstruction was often too noisy to be reliable.
- Asset Variance: Some assets converged significantly better than others, likely due to distinct geometric features or texture that aided the optimizer.
Conclusion
- Unfortunately, I wasn’t able to dedicate more than three weekends to this project and had to drop it at the current state.
- Possibly, optimizing the pose in an encoded latent space rather than raw image might have done better, but hindsights 20/20.
- Overall, it was a fun distraction and a valuable experiment in zero-shot 6D pose estimation.