Embedded ML

Giving AI agents real hardware to learn from

Tiny devices have hard limits. Embedded Arena asks whether agents can optimize models by compiling, flashing, measuring, and trying again.

Embedded MLHardware FeedbackBenchmark
embedded machine learning hardware

Without hardware feedback, frontier agents had 0% deployment success in the benchmark. With a hardware-in-the-loop setup, agents achieved successful deployments and sometimes surpassed human expert results.

The tiny-computer problem

A model that runs beautifully on a laptop can fail completely on a microcontroller. The device may run out of memory, overheat, draw too much power, or simply fail to compile. These constraints matter for wildlife cameras, wearables, health devices, and privacy-preserving sensing systems that need local inference.

Today, expert engineers often tune these deployments by hand. They compress models, adjust firmware, measure real hardware behavior, and iterate through messy tradeoffs.

What Embedded Arena changes

Embedded Arena turns that engineering loop into a benchmark. An agent cannot merely write plausible code. It must compile, flash, run, measure, and respond to feedback from physical hardware.

That feedback is the key. In our experiments, hardware-in-the-loop optimization enabled large compressions: 250x for vision models with less than 3.3% accuracy loss, and 400x for audio with less than 6% feature error rate loss. We demonstrated the approach in an elk-detection camera trap and a phonetic-transcription wearable for child development research.

Why it matters

The world does not run only in cloud APIs. Many important AI systems need to work offline, cheaply, privately, and on a battery. Embedded Arena makes those constraints visible to agents, and it gives the research community a way to measure whether agentic coding systems can handle physical reality rather than just software benchmarks.

Read the paper Project page