Guides

Building an AI Agent That Controls a Phone

The architecture behind an AI agent that reliably operates a real Android device — vision, decisions, actions, verification, and learning — and the sharp edges you hit building it yourself.

A

Auten Team

May 31, 202610 min read
A glowing magenta AI brain reaching toward and controlling a smartphone

What does it actually take for an AI to operate a phone reliably? It is far more than "send a screenshot to a model and hope." A production agent needs a tight perception-decision-action-verification loop with memory, plus a transport layer that survives real networks. Here is the architecture Auten runs for every task — and the sharp edges you will hit if you build it yourself.

1. See the screen properly

The device returns an annotated screenshot with numbered markers on every interactive element — a technique called set-of-marks (SoM). The model reads the image and a structured element list, so it knows exactly what it can tap and where. Vision alone is not enough: models misjudge coordinates from raw pixels, but with numbered markers and an element table the action becomes "tap marker 12," which is precise and verifiable.

2. Decide and act, one step at a time

The agent calls discrete tools — tap, type, scroll, open app, key, set/paste clipboard — taking a single action and then re-observing. Acting on freshly observed reality, rather than a plan imagined up front, is what keeps it from drifting when the screen does something unexpected like a popup or a slow load.

The single most important rule

After each action the agent checks whether the screen actually changed. If a tap did nothing, repeating it blindly is the classic failure mode that ruins naive agents. Observing again — and changing strategy after a couple of failed attempts — is what separates a reliable agent from a flaky demo.

3. Text entry is harder than it looks

Typing into Android fields is a minefield: some apps accept accessibility setText, others need an input method (IME) to commit text, and some only accept a clipboard paste. A robust agent tries these strategies in order and detects which worked. This is exactly the kind of unglamorous edge case that determines whether real tasks succeed.

4. Verify the goal was reached

A separate verification step asks: did this task actually achieve what was asked? If not, the agent receives the reason as feedback and tries a different approach — pressing through a popup, taking another path — instead of declaring victory or giving up. Without verification, agents confidently report success on tasks they never completed.

5. Learn so it never has to think twice

Every action records an edge in a screen graph: from this screen, this action led to that screen. A successful run distills into a clean plan — the minimal action sequence that worked. Next time, the plan replays deterministically, with no model call. We cover this in depth in how Auten learns.

6. The transport problem nobody mentions

A phone behind mobile NAT cannot be reached directly. You need the device to open an outbound connection (a reverse tunnel) so your backend can send it commands. Building this — with reconnection, heartbeats, and routing across many devices — is a real piece of infrastructure that has nothing to do with AI but everything to do with whether your agent works in the field.

Why the learning loop matters so much

Pure LLM-per-action automation is slow and expensive, and it makes the same decisions over and over. Caching proven plans turns repeat tasks into free, deterministic replays. That is the difference between an impressive demo and something you can run thousands of times a day at a sane cost.

Perception, action, verification, memory, transport. Drop any one and reliability collapses.

Build it yourself vs use a platform

You can assemble this from scratch — a vision model, a tool layer, device transport, a verifier, and a plan store — but each piece hides sharp edges: accessibility quirks across vendors, IME text entry, NAT traversal, plan invalidation when apps update, and the cost control that makes it viable. Auten packages the whole loop behind an API so you can focus on the task, not the harness. If you want to try the finished version, start with the SDK.

Frequently asked questions

Why not just send screenshots to a vision model in a loop?

It works for demos but is slow, expensive, and unreliable: no verification, no memory, poor coordinate accuracy, and no handling of text-entry or transport edge cases.

What is set-of-marks?

A technique that overlays numbered markers on interactive elements so the model refers to "marker 12" instead of guessing pixel coordinates — far more accurate.

How does the agent avoid loops?

It checks whether each action changed the screen and changes strategy after repeated no-ops, rather than blindly retrying.

Try Auten

Grab an API key at auten.ai, connect a phone or spin up a hosted virtual device, and send your first natural-language task in minutes. The free tier needs no credit card.

Share this article