Skip to main content

Case Studies: Real-World Multi-Sensor Integration

These case studies showcase how leading humanoid robots integrate camera systems, depth sensors, IMUs, and sensor fusion to solve real-world challenges. Use these examples as design inspiration for your capstone project, not as blueprints to copy. Each robot made specific trade-offs based on their operational requirements, cost constraints, and technical priorities.


πŸ“Š Case Study 1: Boston Dynamics Atlas - Multi-Sensor SLAM for Dynamic Environments​

Robot Overview​

Boston Dynamics Atlas is designed for navigation and manipulation in disaster response, construction sites, and unstructured outdoor environments. Atlas must handle rough terrain, dynamic obstacles (moving debris, humans), and perform complex whole-body manipulation tasks like lifting objects and opening doors.

Sensor Configuration​

Sensor TypeModel/TechnologySpecificationsPlacementUpdate RateEstimated Cost
Stereo CamerasCarnegie Robotics Multisense SL2MP CMOS, Bayer filter, 60Β° FOVHead (pan-tilt mount)30 FPS$15,000+
LiDARHokuyo UTM-30LX-EW (in Multisense SL)270Β° planar, 30m range, Β±30mm accuracyHead (rotating spindle at 30 RPM)40 Hz (1,081 points/scan)Included in Multisense
RGB-D Depth CameraTime-of-Flight (ToF) sensors640Γ—480 depth resolution, 0.5-5m rangeChest/torso (fixed)30 Hz$500-1,500
IMUHigh-precision MEMS IMU (9-axis)Accelerometer, gyroscope, magnetometerTorso (center of mass)1,000 Hz$2,000-5,000
Joint EncodersAbsolute and incremental encodersSub-degree accuracy, 28 DOFAll joints (28 total)4,000 Hz$500-1,000 each

Total Sensor Suite Cost: Approximately $25,000-35,000 (excluding joint encoders)

Integration Approach​

Sensor Fusion Algorithm: Atlas uses a tightly-coupled LiDAR-Inertial-Visual SLAM system that fuses:

  • LiDAR point clouds (from rotating Hokuyo) for 3D structure mapping and obstacle detection at 5-30m range
  • Stereo camera images for visual features, texture-based localization, and object recognition
  • IMU data at 1 kHz for orientation tracking, motion prediction, and dynamic balance control
  • Joint encoder feedback at 4 kHz for proprioceptive state estimation (limb positions, center-of-mass calculation)

ROS2 Integration (conceptual, Atlas uses proprietary middleware but principles apply):

  • /multisense/left/image_raw (sensor_msgs/Image): Left stereo camera for visual odometry
  • /multisense/lidar_scan (sensor_msgs/LaserScan): Planar LiDAR scan, rotated to build 3D point cloud
  • /torso/imu (sensor_msgs/Imu): High-rate orientation, angular velocity, linear acceleration
  • /joint_states (sensor_msgs/JointState): 28 DOF encoder positions for balance computation
  • /slam/odometry (nav_msgs/Odometry): Fused state estimate (position, orientation, velocity) from EKF

Key Design Decision: The Multisense SL sensor head combines stereo cameras and a rotating LiDAR in a single unit, simplifying calibration and reducing sensor mounting complexity. The rotating LiDAR builds a 3D point cloud by sweeping a 2D scan line, achieving 360Β° coverage with mechanical rotation.

Lessons Applied​

From Lesson 1 (Camera Systems):

  • Stereo vision provides passive depth estimation for nearby obstacles (0.5-10m) without IR interference in outdoor sunlight
  • Active perception: Pan-tilt head allows Atlas to "look" at objects of interest, adjusting FOV dynamically

From Lesson 2 (Depth Sensing):

  • LiDAR for outdoor reliability: Unlike RGB-D sensors, LiDAR works in direct sunlight and provides long-range (30m) obstacle detection
  • Rotating vs. solid-state: Mechanical rotation achieves 360Β° coverage with a single planar LiDAR, trading mechanical complexity for cost savings vs. 3D LiDAR ($100k+)

From Lesson 3 (IMU and Proprioception):

  • High-rate IMU (1 kHz) enables dynamic balance control during parkour, backflips, and rough terrain walking
  • Joint encoders as proprioception: 4 kHz encoder feedback provides accurate limb position for center-of-mass estimation, critical for balance

From Lesson 4 (Sensor Fusion):

  • Tightly-coupled SLAM: LiDAR, camera, and IMU data are fused in a single optimization framework, improving robustness vs. loosely-coupled systems
  • Redundancy strategy: If camera fails (e.g., mud on lens), LiDAR + IMU provide basic navigation; if LiDAR fails, stereo vision + IMU enable short-range operation

Design Insights​

Sensor Placement Philosophy: Head-mounted sensors (Multisense SL) provide wide FOV and active perception via pan-tilt, but create occlusion when manipulating objects near the chest. Atlas supplements with chest-mounted ToF depth camera for "blind spot" coverage during manipulation tasks.

Failure Handling: Atlas prioritizes redundancy for safety-critical tasks:

  • Dual depth modalities (stereo + LiDAR): If one fails, the other provides fallback
  • IMU drift correction: Visual and LiDAR loop closure re-initializes IMU bias every 10-30 seconds
  • Emergency stop: If all perception fails, IMU-only balance control attempts safe shutdown (crouch, then sit)

Relevance to Capstone Scenarios:

  • Scenario B (Outdoor Delivery): Atlas' LiDAR + stereo camera approach is ideal for outdoor navigation with long-range obstacle detection
  • Scenario D (Warehouse): Multi-sensor redundancy ensures high reliability for safety-critical environments

πŸ“Š Case Study 2: Agility Robotics Digit - Visual-Inertial Odometry for GPS-Free Indoor Navigation​

Robot Overview​

Agility Robotics Digit is designed for warehouse logistics and last-mile delivery in indoor and semi-structured outdoor environments (sidewalks, building entrances). Digit prioritizes autonomous navigation without GPS, using Visual-Inertial Odometry (VIO) for localization. The robot must carry packages up to 16 kg while navigating through doorways, elevators, and around human workers.

Sensor Configuration​

Sensor TypeModel/TechnologySpecificationsPlacementUpdate RateEstimated Cost
RGB-D Depth CamerasIntel RealSense D435i (Γ—4)Stereo + IR depth, 1280Γ—720 RGB, 1280Γ—720 depth, IMU integratedHead (forward), torso (left/right/rear)30 FPS (RGB), 90 FPS (depth)$200-300 each ($800-1,200 total)
LiDAR2D planar LiDAR (model unspecified)270Β° FOV, 30m rangeChest (horizontal plane)10-20 Hz$1,500-5,000
IMUBosch BMI055 (integrated in D435i) + MEMS IMU6-axis (accel + gyro), Β±16g, Β±2000Β°/sTorso (center) + cameras200 Hz (torso), 200 Hz (D435i)$50-100 (integrated in D435i)
Absolute EncodersJoint position sensorsProprioceptive feedback for 20 DOF legs/armsAll leg and arm joints100+ Hz$300-500 each

Total Sensor Suite Cost: Approximately $5,000-10,000 (excluding encoders)

Integration Approach​

Sensor Fusion Algorithm: Digit uses Visual-Inertial Odometry (VIO) combining:

  • Four Intel RealSense D435i cameras provide 360Β° RGB-D coverage and integrated IMU data
  • Stereo visual features from multiple cameras enable robust visual odometry with redundancy (if one camera fails, others compensate)
  • IMU data fuses with visual odometry to handle rapid motion, camera occlusion, and reduce drift during dynamic walking
  • LiDAR supplements for obstacle detection and loop closure detection in large warehouses

ROS2 Integration (Digit uses proprietary software, but ROS2 equivalent would be):

  • /camera/front/color/image_raw, /camera/left/color/image_raw, etc. (sensor_msgs/Image): 4 cameras for multi-view VIO
  • /camera/front/depth/image_rect_raw (sensor_msgs/Image): Depth images for obstacle avoidance
  • /camera/front/imu (sensor_msgs/Imu): Integrated IMU from D435i cameras
  • /torso/imu (sensor_msgs/Imu): High-precision torso IMU for balance control
  • /scan (sensor_msgs/LaserScan): 2D LiDAR for obstacle detection
  • /odometry/vio (nav_msgs/Odometry): Fused VIO output combining cameras + IMU

Key Design Decision: Digit uses four RealSense D435i cameras instead of a single high-end camera or 3D LiDAR. This provides:

  1. Redundancy: If one camera's view is blocked (e.g., carrying a large package), others maintain localization
  2. 360Β° coverage: Front, left, right, rear cameras eliminate blind spots
  3. Cost efficiency: Four $250 cameras ($1,000 total) vs. one 3D LiDAR ($8,000+)

Lessons Applied​

From Lesson 1 (Camera Systems):

  • Multi-camera visual odometry: Tracks features across overlapping camera views, increasing robustness vs. single stereo pair
  • RGB-D integration: D435i combines RGB camera with active IR stereo depth, enabling operation in low-light warehouses

From Lesson 2 (Depth Sensing):

  • Active IR stereo: D435i projects IR pattern for texture-less surfaces (white walls, uniform floors), overcoming passive stereo limitations
  • Depth + LiDAR complementarity: RGB-D provides dense depth (0.3-3m) for close-range manipulation; LiDAR provides sparse long-range (5-30m) for navigation

From Lesson 3 (IMU and Proprioception):

  • VIO requires tight camera-IMU synchronization: D435i integrates IMU on camera PCB, ensuring hardware time-synchronization (<1 ms jitter)
  • Balance control vs. navigation IMU: Torso IMU (200 Hz) for dynamic balance; camera IMUs (200 Hz) for VIO

From Lesson 4 (Sensor Fusion):

  • VIO (Visual-Inertial Odometry): Fuses visual features with IMU using Extended Kalman Filter (EKF) or graph optimization (e.g., VINS-Mono algorithm)
  • Loop closure: When Digit revisits a location, visual recognition triggers re-localization, correcting accumulated drift

Design Insights​

GPS-Free Localization Philosophy: Digit operates in environments where GPS is unavailable (indoor warehouses, building interiors). VIO provides drift-bounded localization:

  • Short-term (<1 min): IMU-predicted motion between camera frames (<0.1% drift)
  • Medium-term (1-10 min): Visual odometry bounds IMU drift to <1% of distance traveled
  • Long-term (10+ min): Loop closure re-initializes position when revisiting known areas

Failure Handling:

  • Single camera failure: VIO continues with remaining 3 cameras at reduced accuracy (Β±5 cm β†’ Β±10 cm localization error)
  • IMU drift: If visual features are lost (e.g., uniform white hallway), IMU dead-reckoning for <5 seconds until features reappear
  • Complete vision failure: Fall back to joint encoders + IMU for basic balance, stop autonomous navigation, request manual intervention

Sensor Placement Strategy: RealSense cameras placed at torso height (not head) to minimize occlusion during package carrying. Forward and side cameras provide frontal FOV overlap for robust stereo matching.

Relevance to Capstone Scenarios:

  • Scenario A (Home Assistant): Digit's VIO approach ideal for GPS-free indoor navigation with cost-effective RGB-D cameras
  • Scenario C (Human Interaction): Multi-camera setup provides 360Β° awareness for safe human proximity
  • Scenario D (Warehouse): Proven VIO in real warehouse deployments (Amazon, GXO Logistics)

πŸ“Š Case Study 3: Tesla Optimus - Multi-Camera Fusion for Cost-Optimized Manipulation​

Robot Overview​

Tesla Optimus (also known as Tesla Bot) is designed for general-purpose humanoid tasks in factories, homes, and service environments. Optimus prioritizes cost reduction by leveraging Tesla's automotive Full Self-Driving (FSD) computer and camera-only perception (no LiDAR). The robot must perform manipulation tasks (assembly, picking, placing) and navigate indoor/outdoor environments.

Sensor Configuration​

Sensor TypeModel/TechnologySpecificationsPlacementUpdate RateEstimated Cost
Head Cameras8Γ— automotive-grade cameras1280Γ—960 resolution, wide/standard/narrow FOV mixHead (3Γ— forward, 2Γ— side, 1Γ— rear, 2Γ— downward)30-60 FPS$50-200 each ($400-1,600 total)
Wrist Cameras2Γ— high-resolution cameras1920Γ—1080 or higher for manipulationWrists (gripper-mounted)30 FPS$100-300 each ($200-600 total)
IMU6-axis MEMS IMUΒ±8g accel, Β±1000Β°/s gyroTorso (center of mass)100-200 Hz$20-50
Joint Force/Torque SensorsActuator-integrated sensorsMeasure joint torque for compliant controlAll 28+ joints200+ HzIntegrated in actuators
Foot Force SensorsPressure sensors or load cellsGround contact force for gait stabilityBoth feet200+ Hz$50-200 each

Total Sensor Suite Cost: Approximately $1,000-3,000 (excluding actuator-integrated sensors)

Notable Omission: No LiDAR or dedicated depth sensors. Depth estimated via monocular depth prediction neural networks.

Integration Approach​

Sensor Fusion Algorithm: Optimus uses multi-camera neural network fusion running on Tesla FSD computer:

  • Eight automotive cameras provide overlapping 360Β° coverage for navigation and obstacle avoidance
  • Monocular depth estimation: Neural networks trained on millions of driving scenarios predict depth from single camera images
  • Multi-camera occupancy grid: Fuses 8 camera views into 3D occupancy map (similar to Tesla's BEV - Bird's Eye View - representation for cars)
  • Wrist cameras for manipulation: Visual servoing during grasping, object pose estimation
  • IMU + vision fusion: IMU provides motion prediction between camera frames; neural network compensates for camera motion

ROS2 Integration (conceptual for educational purposes):

  • /camera/head_forward_wide/image_raw, /camera/head_forward_narrow/image_raw, etc. (sensor_msgs/Image): 8 head cameras
  • /camera/wrist_left/image_raw, /camera/wrist_right/image_raw (sensor_msgs/Image): Gripper-mounted cameras
  • /imu/data (sensor_msgs/Imu): Torso IMU for balance and motion prediction
  • /joint_states (sensor_msgs/JointState): Actuator positions and torques
  • /depth/prediction (sensor_msgs/Image): Neural network monocular depth prediction (not sensor_msgs/PointCloud2, as depth is estimated, not measured)
  • /perception/occupancy_grid (nav_msgs/OccupancyGrid): Fused 3D obstacle map from multi-camera neural network

Key Design Decision: Vision-only perception (no LiDAR) reduces sensor cost from $8,000-20,000 (3D LiDAR) to <$2,000 (cameras only). Trade-off: depth estimation is less accurate (Β±10-20 cm at 5m vs. Β±2 cm for LiDAR) but sufficient for navigation and manipulation tasks in structured environments.

Lessons Applied​

From Lesson 1 (Camera Systems):

  • Multi-camera stereo: Overlapping FOV cameras enable stereo depth triangulation (e.g., forward-wide + forward-narrow cameras)
  • Wrist-mounted cameras: Eye-in-hand configuration provides unoccluded view of grasped objects until contact

From Lesson 2 (Depth Sensing):

  • Monocular depth neural networks: Trained on massive datasets (Tesla FSD data), predict depth without structured light or ToF hardware
  • Trade-off acceptance: Lower depth accuracy acceptable for navigation (collision avoidance requires ~10 cm precision, not <1 cm)

From Lesson 3 (IMU and Proprioception):

  • Force/torque sensing: Joint actuators measure applied forces, enabling compliant manipulation (e.g., gentle handoff, adaptive grip)
  • Foot force sensors: Ground reaction forces for zero-moment point (ZMP) balance control

From Lesson 4 (Sensor Fusion):

  • Neural network as fusion algorithm: Instead of traditional EKF/UKF, Tesla uses transformers to fuse 8 camera views + IMU into unified 3D occupancy representation
  • Temporal fusion: Neural network processes camera sequences (not single frames), implicitly learning visual odometry

Design Insights​

Cost Reduction Philosophy: Optimus targets consumer/factory price points ($20,000-30,000 estimated), requiring aggressive sensor cost reduction:

  • No LiDAR: Saves $8,000-20,000 per robot
  • Automotive-grade cameras: Leverage Tesla's supply chain ($50-200 per camera vs. $500-2,000 for robotics-grade cameras)
  • FSD computer reuse: Amortizes R&D costs across automotive and robotics platforms

Wrist Camera Benefits:

  • Unoccluded view: Unlike head cameras, wrist cameras maintain view of object until gripper closes
  • Close-range precision: 1080p cameras at 0.2-0.5m provide sub-millimeter pixel resolution for grasp refinement
  • Lighting control: Wrist-mounted LED ring illuminates workspace, reducing sensitivity to ambient lighting

Failure Handling:

  • Camera redundancy: 8 head cameras mean 1-2 camera failures still allow degraded navigation (7 cameras β†’ reduced FOV coverage)
  • Monocular depth uncertainty: Neural network outputs confidence scores; low-confidence depth estimates trigger cautious behavior (slow down, use other cameras)
  • IMU drift: Visual odometry from multiple cameras re-initializes IMU bias every 1-2 seconds

Computational Strategy: FSD computer (custom ASIC) runs neural networks at 144 TOPS (tera operations per second), enabling real-time processing of 8 camera streams + depth prediction + object detection. Lower-cost robots cannot afford this compute power, making LiDAR + simpler algorithms more practical.

Relevance to Capstone Scenarios:

  • Scenario A (Home Assistant): Vision-only approach viable for indoor tasks if compute budget allows neural network depth estimation
  • Scenario C (Human Interaction): Wrist cameras essential for safe object handoff and manipulation
  • Cost-constrained designs: Demonstrates trade-off between sensor hardware (LiDAR) vs. computation (neural networks for depth estimation)

πŸ“Š Case Study 4: Comparative Analysis - Sensor Suite Trade-offs​

This table summarizes key design decisions across the three case studies, highlighting how different operational requirements drive sensor selection.

CriterionAtlas (Boston Dynamics)Digit (Agility Robotics)Optimus (Tesla)
Primary EnvironmentOutdoor/disaster (unstructured)Indoor warehouse (semi-structured)Indoor factory/home (structured)
Depth TechnologyLiDAR (rotating 2D) + StereoRGB-D (active IR stereo) + LiDARMonocular depth neural networks
Depth Sensor Cost$15,000+ (Multisense SL)$1,000-1,500 (4Γ— RealSense D435i + LiDAR)$0 (cameras only, depth estimated)
Camera Count2 (stereo pair)8 (4Γ— RGB-D cameras = 8 image streams)10 (8 head + 2 wrist)
IMU Update Rate1,000 Hz (high-precision)200 Hz (torso + integrated)100-200 Hz (standard MEMS)
Fusion AlgorithmTightly-coupled LiDAR-Visual-Inertial SLAMVIO (Visual-Inertial Odometry) with loop closureMulti-camera neural network fusion (transformer-based)
Localization Drift<0.1% (LiDAR + loop closure)<1% (VIO with loop closure)<2% (monocular depth less accurate)
Compute PlatformCustom high-performance CPU/GPUDual Intel i7 CPUs + optional JetsonTesla FSD computer (144 TOPS custom ASIC)
Estimated Sensor Cost$25,000-35,000$5,000-10,000$1,000-3,000
Cost OptimizationPerformance-first (disaster response requires reliability)Balanced (warehouse profitability requires cost control)Cost-first (consumer/factory scale requires low price)
Failure RedundancyDual depth modalities (stereo + LiDAR)Multi-camera redundancy (4 cameras, lose 1-2 OK)8 head cameras, partial failure tolerated
Outdoor SunlightLiDAR handles sunlight wellRGB-D IR pattern may wash out, relies on LiDAR fallbackMonocular depth neural networks handle sunlight (trained on outdoor driving)
Applicable Capstone ScenariosScenario B (Outdoor Delivery), Scenario D (Warehouse)Scenario A (Home Assistant), Scenario D (Warehouse)Scenario A (Home Assistant), Scenario C (Human Interaction)

Key Takeaway for Capstone Design​

There is no "best" sensor suiteβ€”only appropriate trade-offs for specific scenarios:

  1. If outdoor operation is required (Scenario B): LiDAR is nearly mandatory due to sunlight interference with IR-based RGB-D sensors. Atlas demonstrates this principle.

  2. If indoor GPS-free navigation is required (Scenarios A, C, D): VIO (camera + IMU fusion) provides drift-bounded localization. Digit proves this works at commercial scale.

  3. If cost is paramount and compute is available (Scenario A with consumer budget): Monocular depth estimation can replace hardware depth sensors, as Optimus demonstrates. However, this requires significant neural network training and inference compute.

  4. If manipulation precision is critical (Scenarios A, C): Wrist-mounted cameras (Optimus approach) provide unoccluded view of objects during grasping. Head cameras alone create blind spots when arms reach forward.

  5. If safety is critical (Scenario C: human interaction): Redundant depth sensing (e.g., Atlas' stereo + LiDAR) ensures zero collision risk even if one sensor fails.

When designing your capstone project, justify sensor choices with explicit trade-offs (cost vs. accuracy, indoor vs. outdoor, compute vs. hardware). Reference these case studies to demonstrate awareness of real-world design decisions.


Summary: Applying Case Studies to Your Capstone​

Use these case studies as conceptual references, not blueprints:

  1. Atlas shows: When reliability and outdoor operation are paramount, invest in LiDAR and redundant sensors (Scenarios B, D)
  2. Digit shows: VIO with multiple RGB-D cameras is viable for indoor logistics at moderate cost (Scenarios A, D)
  3. Optimus shows: Vision-only approaches can work if neural network compute is available and accuracy requirements are relaxed (Scenarios A, C)

Design Process:

  1. Identify your scenario constraints (indoor vs. outdoor, cost budget, precision requirements)
  2. Map constraints to sensor technologies from Lessons 1-4
  3. Reference case studies for real-world validation (e.g., "Digit uses VIO for warehouse navigation, so VIO is appropriate for Scenario D")
  4. Justify trade-offs explicitly (e.g., "Chose stereo cameras over RGB-D because Scenario B requires outdoor operation where sunlight interferes with IR depth sensors, as seen in Atlas' LiDAR-first design")

What Makes a Strong Capstone Design:

  • Sensor choices reference specific limitations from Lessons 1-4 (e.g., "RGB-D fails on glass surfaces, Lesson 2 Section 3.3")
  • Fusion strategy matches sensor characteristics (e.g., "VIO appropriate for camera + IMU, not LiDAR + magnetometer")
  • Failure modes demonstrate awareness of real-world issues (e.g., "Atlas handles camera mud occlusion with LiDAR fallback")
  • ROS2 architecture is realistic and implementable (message types, topic names, QoS policies correct)

These case studies provide the evidence base for your design justifications. When explaining sensor selection, cite these robots as proof that your approach works in practice.