Dense vs. Sparse Rewards: The Core Tradeoff
Sparse rewards — task completion only, no intermediate signal — are theoretically clean and cannot be hacked. But sparse rewards fail catastrophically for manipulation tasks exceeding 10 sequential steps because of credit assignment: when the policy sees a reward only at episode completion, it has no signal to explain which of the 200 preceding actions contributed to success. Sample efficiency drops exponentially with task horizon under sparse rewards.
Dense rewards — a per-step progress signal based on some proxy for task advancement — dramatically improve sample efficiency. A policy learning to place an object with a dense distance-to-goal reward converges in 10–50× fewer environment steps than the sparse equivalent. The cost is reward hacking: the agent finds ways to maximize your proxy without solving your actual task.
Potential-Based Shaping: The Theoretically Safe Dense Reward
The Ng, Daswani, and Russell (1999) theorem on potential-based reward shaping provides a mathematically grounded approach to adding dense rewards without changing the optimal policy. The construction is: F(s, a, s') = γΦ(s') − Φ(s), where Φ is any real-valued potential function over states. Adding F to your base reward r produces a shaped reward r + F that shares the same set of optimal policies as r alone.
For manipulation, the most natural potential function is negative distance to goal: Φ(s) = −d(object, target_position). This produces a shaping term that gives positive reward when the agent moves the object toward the target and negative reward when it moves away — without ever rewarding the agent for just being near the object without having grasped it (which would be a non-potential-based dense reward and would change optimal behavior).
Manipulation-Specific Reward Components
- End-Effector to Object Distance: Normalize by workspace diagonal (typically 0.8–1.2m) to keep the component in [0, 1]. Weight: 0.1–0.3× total reward scale. Purpose: encourage approaching the object. Disable after first contact to avoid the hover-near-object hack.
- Object to Target Distance: Activate only after grasp is detected (contact force > threshold). Normalize by workspace. Weight: 0.4–0.6×. This is the primary dense signal for the transport phase.
- Grasp Quality: A binary or continuous measure of contact stability — e.g., whether contact normals form a positive force closure, or GQ-CNN score if you have it. Weight: 0.1–0.2×. Without this, the agent learns to drag objects rather than grasp them.
- Action Smoothness: Penalize jerk (third derivative of joint position) to discourage jerky motions that damage hardware. Compute as ||q̈_{t} − q̈_{t−1}||. Weight: 0.05–0.1× as a negative penalty. Critical for sim-to-real transfer since jerky sim policies fail on real hardware due to unmodeled joint elasticity.
Reward Hacking: Examples and Fixes
The following hacks appear repeatedly across manipulation reward designs. They are listed as warnings, not discoveries — your agent will find them independently:
- Hover Exploitation: The agent learns to move the end-effector to just outside contact range of the object and stay there, accumulating positive proximity reward without ever grasping. Fix: disable end-effector proximity reward after first contact, and add a time penalty for non-contact phases exceeding a threshold.
- Table Push for Termination: With negative reward for unfinished tasks, the agent discovers that pushing the object off the table terminates the episode faster (less accumulated negative reward) than attempting a difficult grasp. Fix: add a large negative terminal reward for object-out-of-bounds that swamps the ongoing negative signal, and verify your episode termination conditions.
- Early Termination Exploitation: Related to above — if success condition is checked on a rolling window (e.g., object within 5cm of target for 1 time step), agents learn to rapidly oscillate the object through the target zone. Fix: require success condition maintained for 10+ consecutive time steps.
Anti-Hacking Design Principles
- Terminal Reward Dominance: The task completion terminal reward should be at least 10× the maximum possible cumulative dense reward over an episode. This ensures the policy always prefers actually completing the task over exploiting dense reward accumulation.
- Component Normalization: Keep all dense reward components in [-1, 1] and set explicit weights. Un-normalized reward components at different scales interact unpredictably and make the reward function very sensitive to implementation details.
- Adversarial Testing: Before full training runs, test your reward function with a scripted adversarial policy that deliberately tries to maximize reward without completing the task. If the adversary finds a high-reward path that isn't task completion, redesign before wasting GPU time.
SVRC's simulation environment includes reference reward implementations for standard manipulation tasks with anti-hacking safeguards built in. See the RL environment documentation for available tasks and reward specifications.