Building an AI Agent That Controls a Phone
The architecture behind an AI agent that reliably operates a real Android device — vision, decisions, actions, verification, and learning — and the sharp edges you hit building it yourself.
Auten Team

What does it actually take for an AI to operate a phone reliably? It is far more than "send a screenshot to a model and hope." A production agent needs a tight perception-decision-action-verification loop with memory, plus a transport layer that survives real networks. Here is the architecture Auten runs for every task — and the sharp edges you will hit if you build it yourself.
1. See the screen properly
The device returns an annotated screenshot with numbered markers on every interactive element — a technique called set-of-marks (SoM). The model reads the image and a structured element list, so it knows exactly what it can tap and where. Vision alone is not enough: models misjudge coordinates from raw pixels, but with numbered markers and an element table the action becomes "tap marker 12," which is precise and verifiable.
2. Decide and act, one step at a time
The agent calls discrete tools — tap, type, scroll, open app, key, set/paste clipboard — taking a single action and then re-observing. Acting on freshly observed reality, rather than a plan imagined up front, is what keeps it from drifting when the screen does something unexpected like a popup or a slow load.
The single most important rule
3. Text entry is harder than it looks
Typing into Android fields is a minefield: some apps accept accessibility setText, others need an input method (IME) to commit text, and some only accept a clipboard paste. A robust agent tries these strategies in order and detects which worked. This is exactly the kind of unglamorous edge case that determines whether real tasks succeed.
4. Verify the goal was reached
A separate verification step asks: did this task actually achieve what was asked? If not, the agent receives the reason as feedback and tries a different approach — pressing through a popup, taking another path — instead of declaring victory or giving up. Without verification, agents confidently report success on tasks they never completed.
5. Learn so it never has to think twice
Every action records an edge in a screen graph: from this screen, this action led to that screen. A successful run distills into a clean plan — the minimal action sequence that worked. Next time, the plan replays deterministically, with no model call. We cover this in depth in how Auten learns.
6. The transport problem nobody mentions
A phone behind mobile NAT cannot be reached directly. You need the device to open an outbound connection (a reverse tunnel) so your backend can send it commands. Building this — with reconnection, heartbeats, and routing across many devices — is a real piece of infrastructure that has nothing to do with AI but everything to do with whether your agent works in the field.
Why the learning loop matters so much
Pure LLM-per-action automation is slow and expensive, and it makes the same decisions over and over. Caching proven plans turns repeat tasks into free, deterministic replays. That is the difference between an impressive demo and something you can run thousands of times a day at a sane cost.
Perception, action, verification, memory, transport. Drop any one and reliability collapses.
Build it yourself vs use a platform
You can assemble this from scratch — a vision model, a tool layer, device transport, a verifier, and a plan store — but each piece hides sharp edges: accessibility quirks across vendors, IME text entry, NAT traversal, plan invalidation when apps update, and the cost control that makes it viable. Auten packages the whole loop behind an API so you can focus on the task, not the harness. If you want to try the finished version, start with the SDK.
Frequently asked questions
Why not just send screenshots to a vision model in a loop?
It works for demos but is slow, expensive, and unreliable: no verification, no memory, poor coordinate accuracy, and no handling of text-entry or transport edge cases.
What is set-of-marks?
A technique that overlays numbered markers on interactive elements so the model refers to "marker 12" instead of guessing pixel coordinates — far more accurate.
How does the agent avoid loops?
It checks whether each action changed the screen and changes strategy after repeated no-ops, rather than blindly retrying.
Try Auten
Share this article