Trajectory Representation in Robot Learning: Joint Space, Task Space, and Deltas

What Is a Robot Trajectory?

A robot trajectory is the sequence of states and actions recorded during one episode of robot operation. Formally: {(s₀, a₀), (s₁, a₁), ..., (s_T, a_T)}, where s_t is the full sensor observation at time step t (camera images + joint states + force readings) and a_t is the action taken.

For training, the action sequence is what the policy must learn to reproduce. But "action" is not a single universal format — it depends on your design choices, and those choices have significant consequences for policy generalization and training difficulty.

Joint Space Representation

The most direct representation: the action a_t is a vector of absolute joint angles q ∈ ℝⁿ. The robot controller receives joint angle targets and drives each joint to the commanded position.

Advantages: No IK required — angles go directly to the motor controllers. Deterministic: the same action always produces the same joint configuration. Fast execution — no intermediate computation.
Disadvantages: Arm-specific. A policy trained on a 6-DOF arm cannot be transferred to a 7-DOF arm without retraining, even for the same task. Not intuitive: angle values do not correspond to meaningful task concepts.
Standard use: ALOHA and ACT use absolute joint space as the primary action representation. For fixed-workspace, single-arm tasks this works well.

Cartesian Task Space

The action a_t represents the desired end-effector pose: (x, y, z, qx, qy, qz, qw) — three position components and a quaternion for orientation.

Advantages: Intuitive — the action directly specifies where the end-effector should be. More transferable across robots with different joint configurations but similar workspaces.
Disadvantages: Requires IK to convert to joint commands — adds latency and singularity risk. Rotation representation is tricky: Euler angles have gimbal lock (avoid them), quaternions are compact but not unique (double cover of SO(3)).
Rotation representation note: Always use quaternions (not Euler angles) in code. For neural network outputs, consider 6D rotation representation (first two columns of the rotation matrix) — it is continuous and singularity-free.

Delta Actions

Instead of absolute targets, delta actions specify the change from the current state: a_t = Δq (change in joint angles) or a_t = (Δx, Δy, Δz, Δ rotation) (change in Cartesian pose).

Why deltas are easier to learn: The magnitude of delta actions is small and roughly constant across a trajectory. Networks learn to predict small, bounded values more easily than large absolute coordinates that vary with workspace position.
Implicit safety: Clipping delta actions to a maximum magnitude bounds the robot's speed — an important safety property.
Standard use: Diffusion Policy, RT-1, Octo, and most VLA models use Cartesian delta actions as their primary action space.

Absolute vs. Object-Relative Representations

A fundamental generalization question: should actions be represented in the robot's workspace frame, or relative to the object being manipulated?

Absolute (workspace frame): Action values depend on where in the workspace the object is. If the table is moved 10 cm, the policy fails. Good for fixed setups; poor for deployment generalization.
Object-relative: Actions are expressed as offsets from the detected object pose. Policy learns "grasp from 5 cm above the object" rather than "move to (0.3, -0.1, 0.15)". Requires reliable object detection or pose estimation, but generalizes dramatically better to new table heights, positions, and even new environments.

Trajectory Length and Padding

Tasks vary in duration: a simple grasp might take 2 seconds (100 steps at 50Hz); a multi-step assembly might take 30 seconds (1500 steps). This creates a practical challenge for batch training.

Fixed-length with padding: Pad shorter episodes to a maximum length with a special "no-op" action token. Use attention masking in Transformer-based policies so the network ignores padding tokens. Simple to implement.
Variable-length with masking: Process each episode at its natural length. Requires careful batching (group by similar length or use dynamic padding).
Action chunking: ACT-style: break the trajectory into fixed-length chunks (e.g., 100 steps) and train on chunks independently. Naturally handles variable episode lengths while maintaining fixed-size model inputs.

Representation Comparison

Representation	Intuitive	Generalizable	IK Needed	Used In
Absolute joint angles	No	Low	No	ALOHA, ACT, RoboAgent
Absolute Cartesian pose	Yes	Moderate	Yes	Classic manipulation research
Cartesian delta actions	Yes	High	Yes (incremental)	Diffusion Policy, RT-1, Octo
Object-relative Cartesian	Yes	Very High	Yes	Generalizable grasping research
Waypoint sequences	Yes	High	Yes	Long-horizon task planning

Trajectory representation is one of the most impactful design decisions in a robot learning system. Spend time on this choice before investing in large-scale data collection. See the SVRC platform for tools that support all of the above representations with built-in format conversion.