Physical AI in 2026: What It Is, Key Models, and How to Build It

Physical AI is the convergence of foundation models and robotics — AI systems that perceive, reason about, and act in the real world. This guide covers the core concepts, the key models driving the field, and the practical infrastructure you need to build physical AI systems in 2026.

What Is Physical AI?

Physical AI refers to artificial intelligence systems that operate in and interact with the physical world through robotic embodiment. Unlike digital AI — which processes text, images, or code in purely computational environments — physical AI must handle the continuous, noisy, and unforgiving physics of real objects, surfaces, and forces.

The distinction matters because the physical world introduces challenges that digital AI never encounters. A language model can retry a generation in milliseconds. A robot arm that drops a glass cannot undo gravity. The consequences of actions are immediate, irreversible, and governed by physics rather than software abstractions.

The Grounding Problem

Large language models understand the word "grasp" as a statistical pattern in text. They can describe grasping, write code to plan a grasp, and even reason about when grasping is appropriate. What they cannot do is translate the concept of "grasp" into the precise sequence of motor commands that close a robotic hand around an irregularly shaped object with the right amount of force — not so much that the object is crushed, not so little that it slips.

This is the grounding problem: connecting abstract language understanding to concrete physical actions. Physical AI solves it by training on embodied data — recordings of actual robots performing actual tasks in actual environments. The model learns not just what "grasp" means semantically, but what it looks like as a trajectory of joint angles, forces, and visual observations.

Physical AI vs Classical Robotics

Classical robotics solves manipulation through explicit programming: an engineer writes code that specifies exactly how to move the arm, when to open the gripper, and how to handle each object. This works well for fixed tasks in structured environments (car assembly lines, semiconductor fabrication), where the same motion repeats millions of times on identical parts. It fails in environments with variation — different objects, different positions, different lighting — because every variation requires additional code.

Physical AI inverts this approach. Instead of programming the robot with explicit rules, you show the robot what to do through demonstrations, and a neural network learns the mapping from sensor observations to motor actions. The trained policy generalizes to variations it was not explicitly shown, because the neural network learns the underlying structure of the task rather than memorizing specific trajectories. A classical pick-and-place program knows the coordinates of one specific mug. A physical AI policy trained on 200 diverse demonstrations can pick up mugs it has never seen, in positions it has never encountered.

The tradeoff is predictability. A classical program always produces the same output for the same input. A neural network policy may produce slightly different trajectories each time, and debugging failure modes is harder when the controller is a learned model rather than interpretable code. For safety-critical applications, this unpredictability remains a significant deployment barrier.

NVIDIA's "Physical AI" Framing

NVIDIA has been instrumental in popularizing the term "Physical AI" as a market category, positioning it alongside generative AI as the next major platform shift. Their framing centers on three pillars: simulation (Omniverse/Isaac Sim for generating synthetic training data), foundation models (Project GR00T for humanoid robots), and compute infrastructure (Jetson Thor for onboard robot inference). This framing is commercially motivated — NVIDIA sells the GPUs that train these models — but it accurately captures the convergence of large-scale learning, simulation, and embodied robotics that defines the field in 2026.

The NVIDIA framing is useful because it highlights that physical AI is not just a research concept. It is an engineering discipline with specific compute requirements, data pipelines, and deployment constraints that differ fundamentally from training a chatbot or image generator.

Key Physical AI Milestones (2023–2026)

The pace of progress in physical AI has accelerated dramatically since 2023. The following milestones define the trajectory:

RT-2 (Google DeepMind, July 2023)

RT-2 was the first large-scale Vision-Language-Action (VLA) model. Google trained a 55-billion parameter model that could take a camera image and a natural language instruction ("pick up the empty can") and output robot motor commands directly. The breakthrough was demonstrating that the semantic knowledge in large vision-language models — understanding what a "can" is, what "empty" means — could transfer to physical manipulation tasks the robot had never seen during training.

The technical mechanism was surprisingly simple: treat robot actions as text tokens. A 7-DOF joint position becomes a string of discretized numbers that the language model generates exactly as it would generate the next word in a sentence. This reuse of language model infrastructure meant Google could apply all of its existing scaling techniques — massive parallelism, curriculum learning, instruction tuning — to the robot action domain.

RT-2's limitation was scale: it required Google's proprietary RT-X robot fleet and massive compute to train. No one outside Google could reproduce or fine-tune it. The 55B parameter count also made real-time inference impractical on any hardware a robotics lab could reasonably afford.

OpenVLA (Stanford/Berkeley, June 2024)

OpenVLA democratized the VLA concept. A 7.5-billion parameter open-source model trained on the Open X-Embodiment dataset, OpenVLA proved that VLA architectures work at a scale that university labs and startups can actually fine-tune. A single A100 GPU could fine-tune OpenVLA for a specific robot and task in under 24 hours. This transformed VLAs from a Google-exclusive research result into a practical tool for the broader robotics community.

OpenVLA's dual vision encoder design (SigLIP for semantic features plus DinoV2 for spatial features) became the template for subsequent VLA architectures. The SigLIP encoder understands what objects are; the DinoV2 encoder understands where they are and how they are oriented. This complementary pairing addresses one of the weaknesses of single-encoder designs, where spatial precision suffered when the encoder was optimized for semantic classification.

Within six months of release, OpenVLA had been fine-tuned on over a dozen different robot platforms by research groups worldwide. It became the de facto baseline for VLA research, similar to how ResNet became the baseline for computer vision a decade earlier.

π0 (Physical Intelligence, October 2024)

Physical Intelligence's π0 ("pi-zero") introduced flow matching as an action generation mechanism, replacing the discrete token prediction used by RT-2 and OpenVLA. Flow matching produces smooth, continuous action trajectories that are better suited to contact-rich manipulation tasks like folding laundry or assembling components. π0 demonstrated the broadest task generalization of any robot policy to date, performing over 10 distinct manipulation tasks across multiple robot embodiments from a single model.

The significance of π0 was not just technical but commercial. Physical Intelligence raised $400M+ to build a general-purpose robot foundation model, betting that a single policy could eventually control any robot on any task. The company's recruitment of top researchers from Google DeepMind, UC Berkeley, Stanford, and CMU signaled that the field's best talent viewed general-purpose robot policies as the next major research frontier.

SmolVLA (HuggingFace, March 2025)

SmolVLA proved that VLA models do not need billions of parameters. At 450 million parameters, SmolVLA matches or exceeds OpenVLA performance on standard benchmarks while running on a single consumer GPU. SmolVLA uses a chunked action prediction head (inspired by ACT) rather than single-step token prediction, producing action sequences that are smoother and more efficient. The model is fully open-source with training code, weights, and evaluation scripts available on HuggingFace.

SmolVLA's efficiency breakthrough came from two design choices: using a compact vision-language backbone (SmolVLM, 500M parameters) instead of a 7B LLM, and predicting action chunks (50 timesteps at once) instead of single actions. The chunked prediction means SmolVLA calls the neural network once every 50 control steps instead of every step, reducing compute by 50x at inference time. This makes SmolVLA fast enough for reactive manipulation on consumer hardware — 15–30 Hz on an RTX 4090.

For the physical AI research community, SmolVLA's significance is accessibility. A graduate student with a single consumer GPU can now fine-tune a VLA model on their robot in a few hours. This was impossible 18 months earlier.

Helix (Figure AI, Early 2026)

Figure AI's Helix is the first VLA designed specifically for humanoid robots. While previous VLAs were tested primarily on tabletop manipulation arms, Helix integrates whole-body control — coordinating locomotion, arm manipulation, and hand dexterity in a single model. Details remain limited (Helix is not open-source), but Figure's BMW deployment demonstrates multi-step manufacturing tasks that require walking to a station, picking parts, and performing assembly operations.

Helix's whole-body approach addresses a fundamental limitation of earlier VLAs: they controlled arms in isolation, assuming a stationary base. For humanoid robots, locomotion and manipulation are coupled — the robot must balance while reaching, step forward to extend its workspace, and coordinate two arms simultaneously. Helix reportedly uses a hierarchical architecture with separate policies for locomotion and manipulation that share a common vision-language backbone, allowing the model to decide when to walk, when to reach, and when to do both.

The Broader Milestone Pattern

Looking at these milestones collectively, a clear pattern emerges: the field is progressing along two axes simultaneously. The first axis is capability — from single-task policies (2022) to multi-task VLAs (2023) to cross-embodiment generalist policies (2024–2025) to whole-body humanoid control (2026). The second axis is accessibility — from closed-source 55B models requiring Google-scale compute (RT-2) to open-source 450M models that fine-tune on a consumer GPU (SmolVLA). Both axes are advancing roughly in parallel, which means the state of the art is becoming simultaneously more powerful and more accessible.

YearMilestoneParametersOpen SourceMin Fine-Tune Hardware
2023RT-255BNoN/A (closed)
2024OpenVLA7.5BYes2x A100
2024Octo93MYes1x RTX 3090
2024π0~3BNoN/A (closed)
2025SmolVLA450MYes1x RTX 4090
2026HelixUndisclosedNoN/A (closed)

The Data Problem at the Heart of Physical AI

Every physical AI model, regardless of architecture, is limited by its training data. This is the fundamental bottleneck of the field, and understanding it is essential for anyone building physical AI systems.

Why Physical AI Needs More Data Than LLMs

Language models train on the internet — trillions of tokens of text generated by billions of humans over decades. Physical AI has no equivalent data source. Every robot demonstration must be collected by a human operator physically controlling a robot through a task. The entire Open X-Embodiment dataset (the largest public robot dataset) contains approximately 2 million trajectories. GPT-4 trained on trillions of tokens. The data gap is roughly 1,000x to 1,000,000x depending on how you count.

This gap explains why physical AI generalization lags behind language AI generalization. An LLM can write a poem about a topic it has never explicitly seen because it has seen millions of related texts. A robot policy cannot fold a towel it has never seen because it has only encountered a few hundred towels during training.

Quality vs. Quantity: Expert Teleoperation Data

Not all robot data is equal. Early attempts to scale robot datasets focused on quantity — collecting thousands of demonstrations quickly using low-cost methods (video scraping, crowd-sourced teleoperation). The results were disappointing. Policies trained on noisy, inconsistent data learned noisy, inconsistent behaviors.

The field has converged on a quality-first approach: fewer demonstrations collected by expert operators with high-quality hardware produce better policies than larger datasets of mediocre quality. A dataset of 200 expert demonstrations with consistent grasp strategies, smooth trajectories, and near-100% task success can outperform a dataset of 2,000 demonstrations with varied quality. This is why SVRC's data collection services emphasize trained operators and standardized collection protocols.

Cross-Embodiment Transfer: Can One Dataset Train Any Robot?

The Open X-Embodiment project (Google DeepMind, 2023) demonstrated that training on data from multiple robot types improves generalization compared to training on data from a single robot. An RT-2 model trained on data from 22 different robots performed better on each individual robot than models trained on that robot's data alone.

However, cross-embodiment transfer has limits. The morphological gap between a 6-DOF arm and a 30-DOF humanoid is vast. Current models transfer semantic understanding (what to grasp, where to place) but struggle to transfer motor strategies (how to move joints) across very different robot bodies. The most practical approach in 2026 is to pre-train on cross-embodiment data for semantic grounding, then fine-tune on target-robot data for motor control.

The data format standardization effort has been critical to making cross-embodiment transfer practical. The RLDS format (used by Open X-Embodiment) and the LeRobot format (used by HuggingFace) provide common schemas that normalize action spaces across different robots. Without this standardization, every dataset would require custom preprocessing to be compatible with VLA training — a friction that historically prevented the field from pooling data across labs and robot platforms.

For teams collecting new data, the practical recommendation is to record in a standardized format from the start. SVRC's data pipeline outputs both RLDS and LeRobot formats, ensuring that your demonstrations contribute to the growing ecosystem of cross-embodiment training data and remain compatible with future models.

SVRC's Role: Structured Expert Demonstrations at Scale

SVRC addresses the data bottleneck directly. Our Mountain View and Allston facilities maintain multiple robot platforms (OpenArm 101, DK1, Unitree G1) with standardized data collection infrastructure. Our trained operators produce demonstrations in HDF5 and RLDS formats compatible with LeRobot, ACT, and Diffusion Policy training pipelines. A typical data collection campaign produces 100–500 expert demonstrations per task at a cost of $2,500 for a pilot (100 demos) or $8,000 for a full campaign (500 demos).

Hardware Requirements for Physical AI Research

Building physical AI systems requires specific hardware at three levels: compute for training, robot platforms for data collection, and sensors for perception.

Compute: What You Actually Need

TaskMinimum HardwareRecommendedCloud Cost (est.)
Fine-tune SmolVLA (450M)1x RTX 4090 (24 GB)1x A100 (80 GB)$2–4/hr
Fine-tune OpenVLA (7.5B)2x A100 (80 GB)4x A100 (80 GB)$8–16/hr
Train ACT policy from scratch1x RTX 3090 (24 GB)1x A100 (40 GB)$1–2/hr
Train Diffusion Policy1x A100 (40 GB)2x A100 (80 GB)$4–8/hr
Full VLA pre-training (7B+)8x H100 (80 GB)32x H100$200–800/hr
Real-time VLA inference1x RTX 4070 (12 GB)NVIDIA Jetson OrinN/A (onboard)

The key insight: fine-tuning existing VLA models is accessible (a single A100 for a day), but pre-training new foundation models from scratch requires resources only large labs can afford. This is why open-source models like OpenVLA and SmolVLA are so important — they give smaller teams a starting point that bypasses the most expensive training phase.

Robot Arms: DOF Requirements and Encoder Resolution

Physical AI research requires robot arms with sufficient degrees of freedom (DOF) and encoder resolution to perform the manipulation tasks you want to learn. The minimum practical configuration for manipulation research is 6 DOF (3 for position, 3 for orientation) plus a gripper. For dexterous tasks, 7+ DOF is preferred to allow null-space motion and redundancy.

Robot ArmDOFPayloadPriceBest For
OpenArm 1016+1 (gripper)1.5 kg$4,500Tabletop manipulation, education, data collection
DK1 (bimanual)2x 6+11.5 kg per arm$12,000Bimanual tasks, ALOHA-style data collection
Franka Emika Panda7+13 kg~$30,000Research benchmark standard, torque sensing
ViperX 3006+10.75 kg$6,500Budget research arm, ALOHA-2 platform
Kinova Gen37+14 kg~$35,000Assistive robotics, mobile manipulation

Encoder resolution matters because imitation learning policies are sensitive to the precision of recorded joint positions. Absolute encoders with ≥14-bit resolution (0.02°) are recommended. OpenArm 101 uses 14-bit magnetic encoders that meet this threshold at its price point.

Repeatability is equally important. If the robot cannot return to a recorded position with sub-millimeter accuracy, the correspondence between recorded demonstrations and policy-commanded actions breaks down. For tabletop manipulation research, ≤0.5 mm repeatability is sufficient. For precision assembly tasks (inserting USB connectors, threading screws), ≤0.1 mm is required, which limits practical options to the Franka Panda class of research arms.

Sensors: The Perception Stack

A complete physical AI data collection setup requires multiple sensor modalities:

  • RGB cameras (minimum 3): One wrist-mounted, two third-person views at different angles. 720p minimum, 30 fps. Intel RealSense D435 or similar. Budget: $200–400 per camera.
  • Depth sensing: At least one RGBD camera for point cloud generation. Useful for grasp planning and scene reconstruction. Often combined with one of the RGB cameras (RealSense provides both).
  • Tactile sensing: For contact-rich tasks (insertion, assembly, deformable objects), tactile feedback significantly improves policy performance. Paxini Gen3 tactile gloves provide 22+ DOF with force feedback for dexterous data collection.
  • Proprioception: Joint position, velocity, and torque readings from the robot's encoders. This is typically provided by the robot's control API at 100–1000 Hz.
  • Force/torque sensing: A 6-axis F/T sensor at the wrist provides critical data for contact tasks. ATI Nano25 or OnRobot HEX are common choices ($3,000–$8,000).

Data Collection Infrastructure Cost Breakdown

ComponentBudget OptionRecommended Option
Robot armOpenArm 101 ($4,500)DK1 bimanual ($12,000)
Cameras (3x)Logitech C920 ($60 ea.)Intel RealSense D435 ($350 ea.)
Compute (recording)Laptop with USB3 ($0, existing)Dedicated workstation ($2,000)
Teleoperation inputSpaceMouse Compact ($200)Paxini Gen3 gloves (contact SVRC)
F/T sensorNoneOnRobot HEX ($4,000)
Mounting / workspaceTable + clamps ($200)T-slot frame ($800)
Total~$5,200~$20,000

For teams that need data but do not want to build and maintain collection infrastructure, SVRC's data collection services provide a turnkey alternative starting at $2,500 for a pilot campaign.

Physical AI Research Workflows at SVRC

The following workflow represents a typical physical AI research cycle using SVRC infrastructure:

A typical physical AI research project at SVRC follows five steps, from task definition through iterative evaluation. This workflow applies whether you are training a task-specific ACT policy or fine-tuning a large VLA model.

Step 1: Define the Task and Collect Expert Demonstrations

The researcher defines a manipulation task (e.g., "pick up a mug from a table and place it on a shelf"). An SVRC operator or the researcher themselves uses the OpenArm 101 or DK1 with a SpaceMouse or Paxini Gen3 gloves to teleoperate the robot through the task. Each demonstration is recorded as a synchronized trajectory containing joint positions, camera images, and optionally force/torque data.

Typical collection rate: 20–40 demonstrations per hour for simple pick-and-place tasks, 10–20 per hour for complex multi-step tasks.

Step 2: Format Data to RLDS/LeRobot-Compatible Datasets

Raw demonstration data is converted into standardized formats. SVRC's data pipeline outputs both HDF5 (for RoboMimic/ACT) and RLDS/LeRobot format (for OpenVLA, SmolVLA, and Octo). The conversion process normalizes action spaces, aligns camera timestamps, and generates metadata files required by each training framework.

Data preprocessing includes several critical steps that significantly impact downstream policy performance: action space normalization (scaling all action dimensions to [-1, 1]), image resizing and augmentation, outlier filtering (removing demonstrations with unusually long durations or abnormal trajectories), and temporal alignment (ensuring all sensor streams are synchronized to a common clock). Skipping or poorly executing these steps is one of the most common sources of training failures in physical AI research.

Step 3: Train a Policy

Choose a training approach based on your goals:

  • ACT (Action Chunking with Transformers): Best for learning from small datasets (50–100 demos). Trains in 2–8 hours on a single GPU. Good for specific tasks with limited variation. See our ACT guide.
  • Diffusion Policy: Better than ACT for multi-modal action distributions (tasks with multiple valid strategies). Requires slightly more data (100–200 demos) and compute.
  • VLA fine-tuning (OpenVLA or SmolVLA): Best for language-conditioned tasks or when you want zero-shot generalization to variations. Requires 200+ demos and at least one A100 GPU. See our VLA model comparison.

Step 4: Evaluate on Robot

Deploy the trained policy on the real robot and run evaluation trials. A standard evaluation protocol runs 20+ trials with randomized initial conditions (object positions, orientations). Success rate, completion time, and failure modes are logged. Typical first-iteration success rates: 40–70%. After dataset refinement and retraining, 80–95% success rates are achievable for well-defined tasks.

Step 5: Iterate on Failure Cases

The evaluation phase reveals systematic failure modes: the robot drops objects in a specific orientation, fails when the object is near the workspace boundary, or produces a collision trajectory in certain configurations. The most effective improvement strategy is targeted data collection — recording additional demonstrations specifically for the failure cases. If the policy fails on objects placed on the left side of the table, collect 30–50 more demonstrations with left-side placements. This targeted approach is significantly more efficient than collecting more data uniformly across the task distribution.

This iterate-collect-retrain loop typically converges within 2–3 cycles for well-defined tasks, with each cycle improving success rate by 10–20 percentage points. SVRC's data collection services support iterative campaigns where follow-up data collection is scoped based on evaluation results from the previous round.

Physical AI Applications in 2026

Physical AI is not a future technology — it is being deployed today in specific domains where the data collection and evaluation infrastructure is mature enough to support reliable operation.

Manufacturing and Assembly

Manufacturing is the first domain where physical AI has reached production deployment. The key advantage: manufacturing tasks are repetitive, which means a moderate dataset (200–500 demonstrations) can capture the task distribution comprehensively. Figure AI's deployment at BMW and several Chinese automotive manufacturers using VLA-based assembly policies represent the leading edge. The tasks are narrow — inserting specific components, routing cables along predefined paths — but they demonstrate that learned policies can match the reliability required for production environments.

Logistics and Warehousing

Picking diverse objects from bins and shelves is a natural fit for physical AI because the object variety makes hand-programmed solutions impractical. Companies like Covariant (now part of Amazon Robotics) and Dexterity have deployed VLA-like models for warehouse picking that handle thousands of distinct SKUs. The challenge is reliability at scale: a 95% pick success rate means 1 failure per 20 picks, which in a warehouse processing 100,000 picks per day translates to 5,000 failures requiring human intervention.

Laboratory Automation

Research labs in biology, chemistry, and materials science are adopting physical AI for repetitive experimental procedures — pipetting, plate handling, sample preparation. The high cost of skilled lab technicians and the need for 24/7 operation make the economics compelling even at current capability levels. SVRC has worked with several academic and pharmaceutical labs on data collection campaigns for lab automation tasks.

The key advantage of physical AI in laboratory settings is adaptability. Traditional laboratory automation systems (like liquid handling robots) are programmed for specific protocols on specific labware. Physical AI-based systems can be retrained for new protocols by collecting 50–100 demonstrations of the new procedure, without modifying any hardware or writing new control code. For labs that run dozens of different experimental protocols, this flexibility reduces the automation barrier from months of system integration to days of demonstration collection.

Food Service and Agriculture

Food handling presents unique challenges for physical AI: deformable objects (lettuce, bread), variable textures, and hygiene requirements. Early deployments focus on structured tasks like sandwich assembly or fruit sorting where the object variation is bounded. Agricultural robotics (fruit picking, plant inspection) benefits from physical AI's ability to handle natural variation in plant morphology and lighting conditions.

Tactile sensing is particularly important in food handling, where visual appearance does not reliably indicate ripeness, firmness, or fragility. Physical AI systems that combine visual and tactile inputs — using sensors like Paxini Gen3 tactile gloves for data collection — can learn grasp force modulation that adapts to each object's physical properties. This multi-modal approach is one of the most active areas of physical AI research and has direct applications in food processing, agriculture, and any domain involving soft or deformable objects.

Where Physical AI Is Going

World Models: Learning Physics, Not Just Actions

The next frontier in physical AI is world models — neural networks that predict the physical consequences of actions before executing them. Instead of mapping directly from observation to action (as current VLAs do), a world model predicts "if I push this object with force F, it will slide distance D and rotate angle R." This internal physics simulation enables planning: the robot can evaluate thousands of possible actions mentally before committing to one.

Google's Genie 2, NVIDIA's Cosmos, and several academic projects (UniSim, DayDreamer) are pursuing this direction. World models trained on large video datasets show promising results in predicting object dynamics, but generating accurate contact predictions remains unsolved. The gap between predicted and actual outcomes narrows with more data, which again reinforces the importance of large-scale physical interaction datasets.

Sim-to-Real: Closing the Gap

Simulation has always been attractive for physical AI because it offers unlimited free data. The problem is the sim-to-real gap: policies trained in simulation often fail when deployed on real robots because the simulation does not perfectly capture friction, deformation, lighting, and sensor noise. Recent advances in domain randomization (randomizing simulation parameters to cover real-world variation) and photorealistic rendering (using neural radiance fields to generate training images indistinguishable from real cameras) are closing this gap.

The practical approach in 2026 is hybrid training: pre-train on simulation data for basic sensorimotor skills, then fine-tune on a small amount of real-world data (50–100 demonstrations) for the target task. This reduces real-world data requirements by 5–10x while maintaining deployment performance.

NVIDIA's Isaac Sim and Isaac Lab have become the standard simulation platforms for physical AI research, offering GPU-accelerated physics, photorealistic rendering, and built-in integration with popular training frameworks. MuJoCo (acquired by Google DeepMind, now open-source) remains the preferred simulator for contact-rich manipulation research due to its superior contact physics accuracy. The choice between simulators depends on your task: Isaac Sim for visual fidelity and large-scale parallelism, MuJoCo for contact physics accuracy.

For teams considering sim-to-real approaches, the critical investment is in building an accurate simulation model of your specific robot and task environment. Generic simulation environments transfer poorly to specific real-world setups. The cost of building a high-fidelity simulation (CAD modeling, physics parameter identification, visual domain randomization) typically runs $10,000–$50,000 in engineering time, which is why direct real-world data collection (starting at $2,500 through SVRC) is often more cost-effective for small-to-medium scale projects.

Embodied AGI: The Long Horizon

Physical AI is often cited as a prerequisite for artificial general intelligence (AGI). The argument is that an intelligence that cannot interact with the physical world — cannot experiment, build, test, and iterate in reality — is fundamentally limited in its understanding. Whether or not you subscribe to this view, the practical trajectory is clear: physical AI systems are becoming more general, more capable, and more autonomous year over year. The gap between "can fold one specific towel" and "can fold any towel" is narrowing. The gap between "can fold a towel" and "can cook dinner" remains large but is no longer obviously unbridgeable.

The key open questions for embodied AGI include: long-horizon planning (chaining dozens of subtasks without error accumulation), common-sense physical reasoning (knowing that a cup of water will spill if tilted, even if the policy has never seen a cup), and safe exploration (learning new behaviors without breaking things or hurting people). Current physical AI systems address none of these comprehensively, but each is an active area of research with measurable year-over-year progress.

The Scaling Question

The dominant question in physical AI research is whether scaling laws — the empirical observation that larger models trained on more data consistently improve — apply to robotics as they do to language and vision. Early evidence from RT-2 and π0 suggests yes: larger models with more diverse training data do generalize better to new tasks and environments. But the data collection bottleneck means that scaling robot datasets by 10x costs millions of dollars, not thousands. Whether physical AI follows a smooth scaling curve or hits a plateau at current data scales is the billion-dollar question that will determine investment returns for companies like Physical Intelligence and Figure AI.

For practitioners building physical AI systems today, the implication is pragmatic: invest in high-quality data collection infrastructure now, because data is the limiting factor regardless of which model architecture prevails. The models will improve; your ability to collect structured demonstrations efficiently is the durable competitive advantage.

Getting Started: Your First Physical AI Project

For teams new to physical AI, the most effective starting point is a single-task imitation learning project. This scopes the problem to something achievable in weeks rather than months, while teaching the core skills (data collection, training, evaluation) that apply to every physical AI project.

Recommended First Project

  1. Choose a simple manipulation task: Pick-and-place with a single object type (e.g., "pick up a mug and place it on a coaster"). Avoid multi-step tasks, deformable objects, and precision insertion for your first project.
  2. Collect 100 demonstrations: Use a SpaceMouse or leader-follower arm. This takes 3–5 hours for a simple task. Record at least 2 camera views plus joint positions.
  3. Train an ACT policy: ACT is the most forgiving algorithm for small datasets. Fine-tuning takes 2–8 hours on a single GPU. Use the LeRobot framework for the simplest setup experience.
  4. Evaluate with 20 trials: Randomize the mug position and orientation. Record success/failure for each trial. A first-attempt success rate of 50–70% is normal and encouraging.
  5. Iterate: Collect 30–50 more demonstrations targeting failure cases. Retrain. Expect 75–90% success after one iteration.

Example: Minimal Physical AI Training Script

# Train an ACT policy on your demonstrations using LeRobot
# Assumes data is already collected and formatted

# pip install lerobot
from lerobot.scripts.train import train
from lerobot.common.policies.act.configuration_act import ACTConfig

# Configure ACT for your robot
config = ACTConfig(
    chunk_size=50,           # Predict 50 future actions
    n_obs_steps=1,           # Use 1 observation frame
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.images.wrist": [3, 480, 640],
        "observation.state": [7],
    },
    output_shapes={"action": [7]},
)

# Launch training (or use CLI):
# python lerobot/scripts/train.py \
#     --policy.type=act \
#     --dataset.repo_id=your_org/mug_pick_place \
#     --training.num_epochs=2000 \
#     --training.batch_size=8

Cost and Timeline for a First Project

ComponentDIYWith SVRC Services
Robot hardwareOpenArm 101: $4,500 (purchase)
or $800/mo (lease)
Included in data service
Data collection (100 demos)3–5 hours operator time$2,500 (pilot campaign)
Training compute~$10 cloud GPU or own hardware~$10 cloud GPU
Timeline2–4 weeks (including setup)1–2 weeks
Total cost$4,500–5,500$2,500–3,300
Getting started with Physical AI at SVRC: Whether you need robot hardware, expert demonstration data, or guidance on training pipelines, SVRC provides end-to-end support for physical AI research. Start with a $2,500 data collection pilot or lease a robot platform from $800/month. Contact our solutions team to scope your project.

Key Takeaways

  • Physical AI is real and deployable today for specific, well-scoped manipulation tasks in structured environments. It is not yet a general-purpose replacement for human workers.
  • Data is the bottleneck, not models. Open-source VLA models (OpenVLA, SmolVLA) are good enough for most applications. The limiting factor is always the quantity and quality of task-specific demonstration data.
  • Start small. A single-task project with 100 demonstrations, one robot arm, and an ACT policy is the best way to learn physical AI workflows and assess whether the technology fits your use case.
  • Quality over quantity for data. Expert demonstrations from trained operators consistently outperform larger datasets of variable quality. Invest in collection infrastructure or use professional services.
  • The field is moving fast. Models that were state-of-the-art 12 months ago (OpenVLA) are now outperformed by models 16x smaller (SmolVLA). Standardized data formats ensure your demonstrations remain valuable as models improve.
  • Hardware costs have dropped dramatically. A complete physical AI research setup (robot + cameras + compute) costs $5,000–$20,000, down from $100,000+ three years ago.
  • Cross-embodiment is the future. Collecting data in standardized formats (RLDS, LeRobot) contributes to a growing shared data ecosystem that benefits all participants.

Related Reading

Related: VLA Models Compared · Action Chunking Transformers · Data Services · SpaceMouse Teleoperation Guide · LeRobot Framework Guide · Hardware Catalog · Robotics Glossary