What Is a Robot Trajectory?
A robot trajectory is the sequence of states and actions recorded during one episode of robot operation. Formally: {(s₀, a₀), (s₁, a₁), ..., (s_T, a_T)}, where s_t is the full sensor observation at time step t (camera images + joint states + force readings) and a_t is the action taken.
For training, the action sequence is what the policy must learn to reproduce. But "action" is not a single universal format — it depends on your design choices, and those choices have significant consequences for policy generalization and training difficulty.
Joint Space Representation
The most direct representation: the action a_t is a vector of absolute joint angles q ∈ ℝⁿ. The robot controller receives joint angle targets and drives each joint to the commanded position.
- Advantages: No IK required — angles go directly to the motor controllers. Deterministic: the same action always produces the same joint configuration. Fast execution — no intermediate computation.
- Disadvantages: Arm-specific. A policy trained on a 6-DOF arm cannot be transferred to a 7-DOF arm without retraining, even for the same task. Not intuitive: angle values do not correspond to meaningful task concepts.
- Standard use: ALOHA and ACT use absolute joint space as the primary action representation. For fixed-workspace, single-arm tasks this works well.
Cartesian Task Space
The action a_t represents the desired end-effector pose: (x, y, z, qx, qy, qz, qw) — three position components and a quaternion for orientation.
- Advantages: Intuitive — the action directly specifies where the end-effector should be. More transferable across robots with different joint configurations but similar workspaces.
- Disadvantages: Requires IK to convert to joint commands — adds latency and singularity risk. Rotation representation is tricky: Euler angles have gimbal lock (avoid them), quaternions are compact but not unique (double cover of SO(3)).
- Rotation representation note: Always use quaternions (not Euler angles) in code. For neural network outputs, consider 6D rotation representation (first two columns of the rotation matrix) — it is continuous and singularity-free.
Delta Actions
Instead of absolute targets, delta actions specify the change from the current state: a_t = Δq (change in joint angles) or a_t = (Δx, Δy, Δz, Δ rotation) (change in Cartesian pose).
- Why deltas are easier to learn: The magnitude of delta actions is small and roughly constant across a trajectory. Networks learn to predict small, bounded values more easily than large absolute coordinates that vary with workspace position.
- Implicit safety: Clipping delta actions to a maximum magnitude bounds the robot's speed — an important safety property.
- Standard use: Diffusion Policy, RT-1, Octo, and most VLA models use Cartesian delta actions as their primary action space.
Absolute vs. Object-Relative Representations
A fundamental generalization question: should actions be represented in the robot's workspace frame, or relative to the object being manipulated?
- Absolute (workspace frame): Action values depend on where in the workspace the object is. If the table is moved 10 cm, the policy fails. Good for fixed setups; poor for deployment generalization.
- Object-relative: Actions are expressed as offsets from the detected object pose. Policy learns "grasp from 5 cm above the object" rather than "move to (0.3, -0.1, 0.15)". Requires reliable object detection or pose estimation, but generalizes dramatically better to new table heights, positions, and even new environments.
Trajectory Length and Padding
Tasks vary in duration: a simple grasp might take 2 seconds (100 steps at 50Hz); a multi-step assembly might take 30 seconds (1500 steps). This creates a practical challenge for batch training.
- Fixed-length with padding: Pad shorter episodes to a maximum length with a special "no-op" action token. Use attention masking in Transformer-based policies so the network ignores padding tokens. Simple to implement.
- Variable-length with masking: Process each episode at its natural length. Requires careful batching (group by similar length or use dynamic padding).
- Action chunking: ACT-style: break the trajectory into fixed-length chunks (e.g., 100 steps) and train on chunks independently. Naturally handles variable episode lengths while maintaining fixed-size model inputs.
Representation Comparison
| Representation | Intuitive | Generalizable | IK Needed | Used In |
|---|---|---|---|---|
| Absolute joint angles | No | Low | No | ALOHA, ACT, RoboAgent |
| Absolute Cartesian pose | Yes | Moderate | Yes | Classic manipulation research |
| Cartesian delta actions | Yes | High | Yes (incremental) | Diffusion Policy, RT-1, Octo |
| Object-relative Cartesian | Yes | Very High | Yes | Generalizable grasping research |
| Waypoint sequences | Yes | High | Yes | Long-horizon task planning |
Trajectory representation is one of the most impactful design decisions in a robot learning system. Spend time on this choice before investing in large-scale data collection. See the SVRC platform for tools that support all of the above representations with built-in format conversion.