The Hidden Scale of Tesla's Robotics Ambition

Nov 20

From Billions of Inputs to a Few Outputs

Tesla’s Full Self Driving (FSD) system is an elegant study in asymmetry and information compression. In AI, a token represents a small chunk of data, such as a pixel, word (or part of a word), or numerical input, that the neural network processes. If it helps, you can think of a token as the raw ingredient the model consumes, while parameters are the model’s internal knobs that learn patterns from those tokens. As Ashok Elluswamy explained at ICCV25 (watch here), the car’s onboard computer handles a staggering torrent of data. Seven to eight five-megapixel cameras stream 36 frames per second, generating a firehose of sensory input that equates to roughly two billion visual tokens over a 30-second driving window. These billions of signals are distilled by Tesla’s end-to-end neural network into only a few control outputs: steering, acceleration, and braking, represented mathematically as negative acceleration. From billions of pixels and frames to just three commands; that compression is both the magic and the challenge.

This process exemplifies how Tesla has shifted from traditional hand-coded perception and planning systems toward a unified neural architecture. Instead of writing thousands of lines of code to define every possible condition, a next to impossible task for the near infinite possibilities, Tesla trains a single model to infer human-like decisions directly from video data. This shift allows for deterministic latency, richer correlations across cameras, and emergent human-like driving behavior. It’s the same architectural principle that underpins all advanced robotics: the fewer interfaces and handoffs between modules, the smoother the intelligence that emerges.

Why Humanoids Multiply the Challenge

Brett Winton of ARK Invest recently suggested that Tesla’s humanoid robot, Optimus, could require up to 200,000 times more AI compute than the FSD system. While that figure may seem extravagant at first glance, a closer look at the input-output structure of both systems makes it plausible. The FSD network operates with a narrow range of outputs in a structured environment steering, throttle, and braking on roads designed for predictability. By contrast, a humanoid robot operates in the unstructured complexity of the human world.

Humanoids inherit the same visual backbone as FSD but add an entirely new dimension of sensory input: joint encoders, torque and force sensors, tactile fingertips, inertial measurement units (IMUs), audio sensors, and eventually, haptic feedback. Each of these adds its own high-frequency data stream. On the output side, control expands from three continuous commands to dozens of actuators coordinating in three-dimensional space. A single Optimus robot might control 20 to 40 degrees of freedom, with each limb, finger, and joint requiring microsecond-level synchronization. Where an FSD model controls smooth trajectories on flat asphalt, a humanoid must grasp fragile objects, balance on uneven surfaces, and adapt in milliseconds to novel environments.

Every additional sensor and actuator expands the dimensionality of the control space. What was once a car predicting two output tokens steering and acceleration becomes a bipedal organism managing hundreds of interdependent micro-actions per second. The complexity grows exponentially, not linearly. That’s why scaling from cars to humanoids isn’t a matter of repurposing software; it’s about reimagining intelligence itself.

The Economics of Compute

Tesla currently spends around ten billion dollars annually on AI, supporting an infrastructure of roughly eighty-five thousand H100-class GPUs. That spending funds both the training clusters that evolve FSD’s neural policies and the inference hardware embedded in every vehicle. Extending that to humanoid robotics, ARK’s estimate of a hundred to two hundred billion dollars in total AI compute begins to make sense. It represents not only the capital cost of hardware but the continuous investment required to collect, simulate, and train across an ever-expanding dataset.

The magnitude of data required to teach Optimus even basic human-like dexterity is extraordinary. If FSD represents a Niagara Falls of video input, Optimus will demand the equivalent of a planetary-scale weather system of multimodal data constant, self-generating, and diverse. Every grasp, every balance correction, every misstep becomes a training datapoint. And each one contributes to a model that must eventually generalize to the full complexity of human motion and decision-making.

The Strategic Advantage

Tesla’s true advantage lies not only in model scale or compute power but in its data feedback loop. Millions of Teslas on the road act as both sensors and teachers, streaming real-world edge cases that fuel the model’s evolution. Soon, humanoid robots deployed in Tesla’s own factories will do the same observing, working, learning, and sending their data back into the neural hive. Each robot becomes both a productive asset and a node in Tesla’s distributed learning system.

This flywheel effect is nearly impossible to replicate. For competitors, the barrier is not just financial but logistical. Gathering diverse, real-world data at this scale requires a fleet of learning agents embedded in reality cars, robots, drones that continuously observe and interact with the world. Tesla already has that infrastructure in motion. If ARK’s projection is directionally correct, the high cost of compute could itself form a protective moat, cementing Tesla’s lead for years to come.

The Broader Takeaway

Autonomous vehicles taught Tesla how to perceive the world; humanoid robots will teach it how to participate in it. The leap from three control outputs to forty may appear incremental, but in computational terms it’s exponential. Every additional sensor, degree of freedom, or millisecond of latency compounds across the training pipeline. The FSD system learned to see; Optimus must learn to act, adapt, and collaborate. It’s the difference between vision and embodiment between observation and agency.

Ashok Elluswamy’s remark about “two billion tokens to two outputs” captures the essence of Tesla’s philosophy. It is the foundation for the next industrial revolution, where the neural compression of perception into purposeful motion becomes the blueprint for artificial intelligence in the physical world. As Tesla bridges the gap between autonomous driving and general-purpose robotics, we are witnessing not just an engineering milestone but a paradigm shift in how intelligence interacts with reality.

Shawn Behnam