Running Inference on the Real Arm
Deployment means running your trained checkpoint in real time, feeding live camera and joint observations into the network and executing the output actions on the physical arm. The inference script handles the observation-action loop at 50Hz.
For the first deployment run, keep your hand near the physical E-stop. A freshly deployed policy may occasionally make unexpected movements while it warms up to the real hardware environment. This is normal for the first 2–3 episodes. After that, behavior should stabilize.
For comprehensive deployment and production guidance including safety envelopes and watchdog timers, see the OpenArm Production Guide.
Evaluation Methodology
Do not evaluate your policy informally. Use a structured protocol — it is the only way to know if a change you make (more data, different checkpoint, different task framing) actually improved performance:
| Protocol Item | Specification |
|---|---|
| Number of episodes per evaluation | 10 minimum, 20 for high-confidence results |
| Object starting position | Fixed. Use tape marks. Same position every episode. |
| Object type | Same object as training. Lighting must match training conditions. |
| What counts as success | Object placed within 3cm of target. Arm returns to home. No human intervention during episode. |
| Failure classification | Log failure type: missed grasp / dropped object / wrong target / timeout. This tells you what to fix. |
| Report metric | Success rate = successful episodes / total episodes. Report with episode count (e.g., "7/10 = 70%"). |
The Data Flywheel: How to Get Better
A policy that succeeds 7/10 times is a good start — but the path to 9/10 or beyond is through the data flywheel. This is the core loop of robot learning in production:
Collect
Record demonstrations, including failure cases your current policy struggles with
Train
Retrain (or fine-tune) on your expanded dataset with the new demonstrations added
Evaluate
Run the structured eval protocol. Did success rate improve? What failure modes remain?
Analyze
Watch the failure videos. Identify the specific state where the policy breaks down. Collect targeted data there.
The key insight of the flywheel: targeted data beats random data. Instead of recording 50 more random demonstrations, watch your failure videos and identify the exact moment things go wrong. Record 20 demonstrations that specifically cover that difficult state (e.g., the grasp at the edge of the workspace, or the object at an unusual angle). Your success rate will improve faster with 20 targeted demos than 50 random ones.
Common Failure Modes and How to Fix Them
- Arm overshoots the grasp position: The policy's action chunks are too large or your data had high velocity variance. Record 10 more demos at slow speed near the grasp point. Or reduce
chunk_sizefrom 100 to 50 in the training config. - Arm succeeds on training object but fails on slightly different objects: Your training data lacked object position diversity. Record 20 demos with the object at 5 different positions within a 10cm radius. This teaches the policy to generalize.
- Policy freezes or produces repeated motions: The CVAE style variable is collapsing. This often means your dataset has too much variance — the model cannot find a consistent style. Check for mixed demonstrations (different operators, different task framings) and clean your dataset.
Unit 6 Complete When...
Your arm completes the pick-and-place task autonomously 7 out of 10 times in a structured evaluation run. You have watched the 3 failure videos and identified what went wrong. You understand the data flywheel well enough to plan your next improvement iteration. This is the end of the structured path — but it is the beginning of your robot learning practice.
What's Next
You have the foundation. Here is where to go from here: