Chapter 2: Isaac ROS VSLAM/Perception
Learning Objectives
By the end of this chapter, you will be able to:
- Understand Visual SLAM (VSLAM) algorithms and their role in robotics
- Explain depth perception techniques for 3D environment understanding
- Implement sensor fusion for robust perception
- Use Isaac ROS packages for VSLAM and perception tasks
- Configure and tune VSLAM parameters for different environments
- Process point clouds and depth images for obstacle detection
- Integrate perception pipelines with navigation systems
1. Introduction to Visual SLAM
Visual Simultaneous Localization and Mapping (VSLAM) is a technique that enables robots to build a map of an unknown environment while simultaneously tracking their position within that map using visual sensors (cameras).
Why VSLAM?
Traditional localization methods require:
- Pre-built maps
- External infrastructure (GPS, beacons)
- Expensive sensors (LiDAR)
VSLAM offers:
- ✅ Camera-only localization (low cost)
- ✅ Dense 3D reconstruction
- ✅ Works indoors and outdoors
- ✅ Scales to large environments
- ✅ Enables autonomous exploration
VSLAM Pipeline Overview
2. VSLAM Algorithms
2.1 Feature-Based VSLAM
Feature-based methods extract distinctive keypoints from images and track them across frames.
ORB-SLAM3
ORB-SLAM3 is one of the most popular open-source VSLAM systems.
Key Features:
- Monocular, stereo, and RGB-D camera support
- Multi-map system for disconnected environments
- Real-time loop closure detection
- Place recognition
Algorithm Steps:
- Feature Extraction: Extract ORB (Oriented FAST and Rotated BRIEF) features
- Tracking: Match features between consecutive frames
- Local Mapping: Create and optimize local map points
- Loop Closing: Detect revisited places and optimize global map
- Pose Graph Optimization: Bundle adjustment for accuracy
Example: ORB Feature Detection
"""
Simple ORB feature detection example.
"""
import cv2
import numpy as np
def detect_orb_features(image_path):
"""
Detect ORB features in an image.
Args:
image_path: Path to input image
Returns:
keypoints: List of detected keypoints
descriptors: Feature descriptors
"""
# Load image
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# Create ORB detector
orb = cv2.ORB_create(
nfeatures=1000, # Maximum number of features
scaleFactor=1.2, # Pyramid decimation ratio
nlevels=8, # Number of pyramid levels
edgeThreshold=31, # Border size
firstLevel=0, # First pyramid level
WTA_K=2, # Number of points for descriptor
scoreType=cv2.ORB_HARRIS_SCORE,
patchSize=31 # Patch size for descriptor
)
# Detect keypoints and compute descriptors
keypoints, descriptors = orb.detectAndCompute(img, None)
# Draw keypoints
img_keypoints = cv2.drawKeypoints(
img,
keypoints,
None,
color=(0, 255, 0),
flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS
)
print(f"Detected {len(keypoints)} ORB features")
return keypoints, descriptors, img_keypoints
# Usage
keypoints, descriptors, viz = detect_orb_features('camera_frame.png')
cv2.imshow('ORB Features', viz)
cv2.waitKey(0)
2.2 Direct Methods (Dense VSLAM)
Direct methods use pixel intensity values directly, without feature extraction.
LSD-SLAM
LSD-SLAM (Large-Scale Direct SLAM) operates on image intensities.
Advantages:
- Works in low-texture environments
- Dense or semi-dense reconstruction
- No feature extraction overhead
Disadvantages:
- Sensitive to lighting changes
- Higher computational cost
- Requires good initialization
2.3 Hybrid Methods
Modern VSLAM systems combine feature-based and direct methods for robustness.
3. Depth Perception
3.1 Stereo Vision
Stereo cameras mimic human binocular vision to estimate depth.
Stereo Depth Estimation
"""
Stereo depth estimation using OpenCV.
"""
import cv2
import numpy as np
def compute_stereo_depth(left_image, right_image, baseline=0.12, focal_length=700):
"""
Compute depth map from stereo image pair.
Args:
left_image: Left camera image (grayscale)
right_image: Right camera image (grayscale)
baseline: Distance between cameras in meters
focal_length: Camera focal length in pixels
Returns:
depth_map: Depth map in meters
"""
# Create StereoBM object
stereo = cv2.StereoBM_create(
numDisparities=16*5, # Must be divisible by 16
blockSize=15 # Matching block size (odd number)
)
# Compute disparity map
disparity = stereo.compute(left_image, right_image)
# Convert to float and normalize
disparity = disparity.astype(np.float32) / 16.0
# Avoid division by zero
disparity[disparity == 0] = 0.1
# Compute depth: depth = (baseline * focal_length) / disparity
depth_map = (baseline * focal_length) / disparity
# Clip to reasonable range (0.1m to 10m)
depth_map = np.clip(depth_map, 0.1, 10.0)
return depth_map
# Load stereo images
left = cv2.imread('left_camera.png', cv2.IMREAD_GRAYSCALE)
right = cv2.imread('right_camera.png', cv2.IMREAD_GRAYSCALE)
# Compute depth
depth = compute_stereo_depth(left, right)
# Visualize
depth_vis = cv2.applyColorMap(
cv2.convertScaleAbs(depth, alpha=25),
cv2.COLORMAP_JET
)
cv2.imshow('Depth Map', depth_vis)
cv2.waitKey(0)
3.2 RGB-D Cameras
RGB-D cameras (e.g., RealSense, Kinect) provide aligned color and depth images.
Advantages:
- Direct depth measurement
- No stereo correspondence needed
- Works in textureless environments
Limitations:
- Limited range (0.3m - 10m typically)
- Outdoor performance issues (IR-based)
- Higher power consumption
3.3 LiDAR
LiDAR provides accurate 3D point clouds but at higher cost.
Comparison:
| Sensor | Range | Accuracy | FPS | Cost | Power |
|---|---|---|---|---|---|
| Stereo Camera | 0.5-20m | Medium | 30-60 | $100-$500 | Low |
| RGB-D | 0.3-10m | High | 30-90 | $200-$600 | Medium |
| LiDAR | 1-100m | Very High | 10-20 | $1k-$75k | High |