Coordinate Frames and Transforms in Robotics: A Clear Guide

Why Coordinate Frames Matter

Every sensor, every robot link, and every object in the environment has its own coordinate system — its own definition of "forward," "up," and "right." When your camera detects an object at position (0.1, -0.05, 0.3) in the camera's frame, the robot cannot act on this directly. It needs the object's position in the robot's frame to command the arm.

Transforming positions and orientations between frames is the job of coordinate transforms. Getting these wrong is the single most common source of hard-to-debug errors in robot systems.

Homogeneous Transforms

A homogeneous transform T ∈ SE(3) is a 4×4 matrix that simultaneously encodes rotation and translation:

T = [[R, t], [0, 0, 0, 1]]

where R is a 3×3 rotation matrix (orthogonal, det(R)=1) and t is a 3×1 translation vector.

The notation T_AB means "the pose of frame B expressed in frame A" — or equivalently, "the transform that converts coordinates in frame B to coordinates in frame A." A point p_B in frame B is transformed to frame A by: p_A = T_AB · p_B (using homogeneous coordinates [x, y, z, 1]ᵀ).

Rotation Representations

Rotation is a 3-DOF quantity, but there are multiple ways to represent it. Choosing the wrong one causes bugs.

Rotation matrix (3×3): The most geometrically correct representation. Orthogonality constraint (9 values for 3 DOF) makes it over-parameterized. Use when you need to compose many transforms — matrix multiplication is numerically stable.
Euler angles (roll, pitch, yaw): Intuitive: "rotate 45° around z-axis." But suffer from gimbal lock — at certain configurations, two axes align and you lose a degree of freedom. Do not use Euler angles in code that runs continuously on hardware. Only use for human-readable logging.
Quaternion (qx, qy, qz, qw): 4 values (unit constraint gives 3 DOF). No gimbal lock. Compact. Interpolates smoothly (SLERP). The standard for robot software. Use this in your code. Note the double cover: q and -q represent the same rotation — check for sign flips when comparing.
6D rotation (two columns of R): Best for neural network outputs — continuous representation without discontinuities. Reconstruct full R via Gram-Schmidt. Used in modern robot learning policies.

Common Robot Frames

A typical manipulation setup involves at least four coordinate frames:

World frame: The fixed global reference. Usually defined at the robot's mounting point or at a calibration target on the table. All other frames are ultimately expressed relative to this.
Robot base frame: Origin at the center of the robot's mounting flange. The arm's forward kinematics are computed relative to this frame.
End-effector / tool frame: Origin at the tip of the gripper (the "tool center point" or TCP). This is what you command when you specify a Cartesian target.
Camera frame: Origin at the optical center of the camera. The z-axis points along the camera's optical axis (into the scene). Object detections are initially expressed in this frame.

Transform Chaining

To convert an object pose detected in the camera frame into the world frame, chain the transforms:

T_world_obj = T_world_robot × T_robot_ee × T_ee_camera × T_camera_obj

Read this right to left: first transform the object into the camera frame (identity — it is already there), then into the end-effector frame, then into the robot base frame, then into the world frame.

Each transform is a 4×4 matrix; the chain is just matrix multiplication. In ROS2, the tf2 library manages all of these transforms automatically and lets you look up any frame-to-frame transform with a single call.

Hand-Eye Calibration

To use the transform chain above, you need T_ee_camera — the pose of the camera relative to the robot's end-effector. This is determined by hand-eye calibration.

The process: move the robot to many different joint configurations (at least 15–20) while a calibration target (ArUco marker or checkerboard) is visible in the scene. At each configuration, record the robot's FK-computed end-effector pose and the camera's observation of the target. This gives a system of equations AX = XB, where X is the unknown T_ee_camera. Standard solvers (OpenCV's calibrateHandEye) solve this in seconds.

Re-calibrate whenever: the camera is physically moved or bumped, a new camera is installed, or you observe systematic grasping offset errors.

Practical Python Tools

scipy.spatial.transform.Rotation: Convert between rotation matrix, quaternion, Euler angles, and axis-angle. The go-to for offline data processing.
ROS2 tf2 (Python rclpy): Real-time transform tree management. Use TransformBroadcaster to publish transforms and TransformListener to look them up.
pytransform3d: Pure Python library for all SE(3) operations with excellent visualization for debugging.
numpy: Matrix multiplication of 4×4 homogeneous transforms is just T_AB @ T_BC.

For more robotics resources and tutorials, see the SVRC resources section.