Insights

Mobile AI Agents Still Fail Most Real Tasks — What the Numbers Actually Mean

A 2026 benchmark ran four mobile AI agents through 65 real Android tasks. The best one finished 43%. Here is why the reliability gap exists, what separates the agents that work from the ones that don't, and how to ship anyway.

A

Auten Team

June 8, 20269 min read
Abstract illustration of a smartphone surrounded by branching task paths, some glowing green for success and others dimmed for failure

If you have shipped anything built on a mobile AI agent, you already know the gap between the demo and the dashboard. The demo taps through three screens flawlessly. The dashboard, two weeks later, shows a success rate you would never put on a slide. A 2026 benchmark just put a hard number on that feeling: across 65 real Android tasks, the best mobile agent tested finished 43% of them. The worst finished 7%. That is the honest state of the art, and it is more useful to understand than to be disappointed by.

What the benchmark actually measured

The test, published by AImultiple, ran four open mobile agents — DroidRun, Mobile-Agent, AutoDroid, and AppAgent — through the AndroidWorld framework on an Android emulator. The 65 tasks were the unglamorous, real things people automate: creating a calendar event, adding a contact, taking a photo, recording audio, moving files around. Nothing exotic. These are tasks any human does without thinking, which is exactly what makes them a fair test for an agent that claims to use a phone "like a person."

The spread in results is the interesting part:

  • DroidRun — 43% success, roughly $0.075 per successful task (~3,225 tokens)
  • Mobile-Agent — 29% success, ~$0.025 per task
  • AutoDroid — 14% success, the cheapest at ~$0.017 per task
  • AppAgent — 7% success, and the most expensive at ~$0.90 per task

Two things jump out. First, no agent cracked the halfway mark on everyday tasks — this is a hard problem, not a solved one. Second, the cheapest agent was not the worst, and the most expensive was not the best. AppAgent burned the most money per task and still finished last. Cost and reliability are not the same axis, and conflating them is how teams end up paying premium prices for premium failure rates.

Why the best agent won: reasoning and state, not just sight

DroidRun's edge came from a multi-step reasoning architecture: it keeps explicit state about what it has done, generates an action plan before it acts, and tracks whether each step landed. That costs more tokens per task than a bare "look, tap, look again" loop — but it converts those tokens into the one thing that matters, completed tasks. The lesson is not "spend more." It is "spend on structure." An agent that knows where it is in a workflow recovers from a surprise; an agent that re-derives the world from a fresh screenshot every step does not.

AppAgent sits at the other end and explains itself by contrast. Its approach processes labeled screenshots through a multimodal model on every single interaction. That is pure vision: powerful in principle, but it pays full price for perception at each step and still loses the thread of the larger task. Vision tells you what is on the screen right now. It does not, on its own, tell you what you were trying to do or whether you are making progress.

The structure-vs-vision trade-off

Pure-vision agents are seductive because they work on anything with a screen. But reading the structured accessibility tree of an app — when it is available — is faster, cheaper, and far more reliable than re-recognizing pixels every turn. The agents that win in 2026 use vision to fill gaps, not as the whole strategy. We wrote about why this matters on Android specifically in our piece on the Android accessibility API for automation.

The four failure modes behind the 57%

Benchmarks report a number; production teams need to know what the failures look like so they can design around them. From this test and from running agents on real devices, the failures cluster into four kinds:

  • Lost context — the agent completes step 3 but has forgotten the goal it set in step 1, so it taps something locally reasonable and globally wrong.
  • Misread screen — a multimodal model mislabels a button, a modal, or a loading state, and acts on a phantom. More common with pure-vision approaches.
  • No recovery — something unexpected appears (a permission dialog, a cookie banner, an A/B-tested layout) and the agent has no plan for "this is not what I expected."
  • Silent success-failure — the agent reports done, but the task did not actually complete. This is the most dangerous because it poisons your metrics, not just your run.

Notice that only one of these is fundamentally a perception problem. The other three are reasoning, memory, and verification problems. Throwing a bigger vision model at them does not help — which is exactly why AppAgent's expensive per-step vision did not buy it reliability.

How to ship on top of a 43% world

A sub-50% raw success rate sounds unusable. It is not — if you stop treating a single agent run as the unit of work. The teams getting real value from phone automation in 2026 do three things.

First, they verify instead of trusting. Every task ends with an explicit check — did the contact actually get created, is the event on the calendar — rather than believing the agent's self-report. This kills the silent success-failure mode outright. Second, they retry intelligently: a fresh attempt with the agent's own failure as context recovers a large share of first-try misses, because most failures are situational, not fundamental. Third, and most important, they do not pay the reasoning tax twice. The expensive part of an agent run is the reasoning. Once an agent has figured out how to complete a task on a given screen, that path can be captured and replayed deterministically — no model in the loop, near-zero cost, and 100% repeatability until the UI changes. We go deep on this in how Auten learns a screen graph and replays it.

Stack those three and the math inverts. A 43% first-pass agent, wrapped in verification and one intelligent retry, clears the large majority of tasks — and the ones it clears get cheaper every time you run them, because the second run is a replay, not a reasoning session. The benchmark measures cold, single-shot reasoning. Production is a warm, repeated, verified system. They are not the same game.

An honest caveat about the number

This was a 65-task benchmark on an emulator using AndroidWorld, with a specific set of open agents and model choices. Real devices, different apps, and newer or proprietary agents will score differently — and a 43% top result will not stay the high-water mark for long. Treat the figure as a snapshot of difficulty, not a law of nature. The durable takeaway is the shape of the failures, not the decimal.

FAQ

Is a 43% success rate good or bad? It is honest. On novel, single-shot tasks with no replay and no verification, it reflects how hard general phone use is for an agent. The number you should care about is your system's end-to-end rate after verification and retries, which is much higher.

Why did the most expensive agent perform worst? Because its cost came from running a vision model on every interaction, and most failures are reasoning, memory, and verification problems — not perception problems. You cannot buy your way out of a structural weakness with more pixels.

Does using the accessibility tree instead of screenshots fix this? It removes a major failure mode — misreading the screen — and it is faster and cheaper. It does not, by itself, fix lost context or missing recovery logic. You still need reasoning, state, and verification around it. See our Auten vs. Appium comparison for how this plays against script-based tooling.

Will these numbers improve? Almost certainly, and fast — model and agent progress in this space is steep. But the engineering lesson outlives any single benchmark: design for failure, verify everything, and never pay for reasoning you have already done.

The takeaway

The reliability gap in mobile AI agents is real, and pretending otherwise is how projects die in pilot. But a sub-50% raw benchmark is a statement about cold single-shot reasoning, not about what a well-built system can ship. Lean on structure over raw vision, verify instead of trust, retry with context, and replay what you have already learned. That is the difference between a great demo and a production system you can put a number on with pride.

Want to build on an agent that already does the structured-reading, verification, and replay parts for you? Grab an API key at auten.ai, connect a phone or spin up a hosted virtual device, and send your first natural-language task in minutes. The free tier needs no card.

Share this article