Open-source phone agents just got real — and the 62% reality check
Open-AutoGLM puts a capable 9B phone-control model in everyone’s hands. A new benchmark shows the best agents still fail 2 in 5 real-app tasks. Here’s what that means for teams automating Android.
Auten Team

For most of the last two years, the strongest phone-control agents lived behind closed APIs. That changed quietly this spring. Open-AutoGLM, an open-source phone-agent framework from the team behind the GLM models, now ships a 9-billion-parameter vision-language model that runs on your own hardware and drives a real Android device through natural language. It crossed 25,000 GitHub stars within weeks. If you build mobile automation, this is a milestone worth understanding — and a reality check worth taking seriously.
What Open-AutoGLM actually is
Open-AutoGLM is a framework plus a pair of open weights. The models — AutoGLM-Phone-9B and a multilingual variant — are published on Hugging Face and ModelScope and are small enough (9B parameters) to self-host. The framework is a Python program that runs on your computer and controls a connected device the same way a developer would: over ADB (Android Debug Bridge) for Android 7.0+, or HDC for HarmonyOS. You enable USB debugging, describe a task in plain language — “open the shopping app and find wireless earbuds under €30” — and the agent perceives the screen visually, plans a step, taps or types, then re-reads the screen and continues.
Two design choices stand out. First, it is multimodal and visual: instead of parsing a brittle view hierarchy, the model looks at pixels, which is why a general VLM can adapt across apps it was never scripted for. Second, it keeps a human in the loop where it matters — there is a sensitive-operation confirmation step and a manual-takeover path for logins and verification codes, the two places where unattended agents do the most damage.
The short version
Why open weights change the calculus
When the only capable agents were hosted, three things were effectively decided for you: where your screen data went, how much each task cost, and whether you could run offline. Open weights reopen all three questions.
- Data residency. Screens of a phone often contain personal or customer data. A self-hosted model means screenshots never leave your infrastructure.
- Cost shape. Hosted agents bill per action or per token; a local model trades that for fixed GPU cost. For high-volume, repetitive workflows the local economics can win.
- Control. You can fine-tune, pin a version, and audit behavior — none of which you get from a black-box endpoint.
The catch is on the same line as the benefit: running the model locally needs a real GPU (the project lists roughly 24GB of VRAM for local deployment), plus the operational work of keeping devices, drivers, and ADB connections healthy. Open weights move the cost from a per-call invoice to your own engineering time. That is a trade, not a free lunch.
The 62% reality check
Here is the part the launch hype tends to skip. A new benchmark called AndroidDaily measured mobile GUI agents on 350 realistic daily tasks across 94 real, closed-source Android apps — shopping, transport, social, local services — the apps people actually use, not simulated sandboxes. The strongest model evaluated reached a 62.0% success rate.
Sit with that number. On everyday tasks in real apps, the best agents fail roughly two times in five. AndroidDaily is also notable for how it grades: because closed-source apps do not expose their internal state, the authors built GRADE, a process-aware reviewer that checks observable obligations, output quality, and forbidden actions. It agrees with human judges 87.37% of the time — which means the 62% figure is a fairly trustworthy measurement, not a generous one.
Most existing benchmarks rely on simulated or open-source apps. Real closed-source applications — the ones people use daily — were largely unevaluated.
This matches what we wrote earlier about the reliability gap in mobile AI agents: a model that can do a task once in a demo is a different thing from a system that does it 1,000 times unattended. Open-AutoGLM lowers the cost of the first. It does not, by itself, close the gap to the second.
Model versus system: where the other 38% lives
A capable model is one component of a production agent, not the whole thing. The failures that make up that missing 38% are mostly not “the model can’t read the screen” — they are systems problems:
- 1Recovery. An ad interstitial, a permission dialog, a network hiccup. A demo retries by hand; production needs automatic detection and a recovery policy.
- 2State and idempotency. Did the order actually submit, or did the tap miss? Re-running a half-finished task can double-charge a user. You need verification, not optimism.
- 3Fleet and isolation. One phone on a desk is easy. A hundred devices with sessions, accounts, and rate limits is an infrastructure problem.
- 4Observability. When a run fails at 3am, you need a replayable trace of what the agent saw and did — or you are debugging blind.
This is exactly the layer we focus on at Auten: treating the model as a swappable brain and investing in the orchestration, screen-graph memory, recovery, and audit trail around it. If you are weighing build-versus-buy, that boundary is the decision. Open-AutoGLM is a strong argument that the model should be commoditized and open. It is also a clear demonstration that the model is the easy 62%.
How to think about adopting it
Concretely, if Open-AutoGLM is on your radar:
- Prototype with it. For research, internal tooling, or low-stakes automation on a single device, a self-hosted open model is a genuinely good place to start — and you can read about building a phone-controlling agent for the mechanics.
- Benchmark on your own apps. The 62% headline is an average across 94 apps; your specific workflow may be far better or far worse. Measure before you commit.
- Budget for the system, not the model. Assume the model is the cheap part and the reliability layer is where your time goes.
- Keep the human-in-the-loop hooks. The confirmation and takeover steps are not friction to remove — they are the difference between a useful agent and an expensive accident.
One honest caveat
FAQ
Is Open-AutoGLM good enough to replace a commercial automation platform?
For prototypes and single-device, low-stakes tasks, often yes. For unattended, high-volume, or revenue-critical workflows, the model is necessary but not sufficient — the recovery, fleet, and observability layer is what determines whether it holds up.
Does it need a powerful machine?
To run the 9B model locally, the project points to a GPU with roughly 24GB of VRAM, plus Python 3.10+. You can also pair the framework with a hosted model if you don’t want to self-host the weights.
Why does it use ADB instead of an SDK inside the app?
ADB lets the agent control any installed app from the outside without modifying it, which is why a single model generalizes across apps. The trade-off is that it depends on USB/WiFi debugging being enabled and on the screen being readable — exactly the conditions AndroidDaily stresses.
What is the single most useful takeaway?
Open weights have commoditized the hardest-looking part of phone automation — the model. Your competitive edge now lives in the system around it. Plan accordingly.
Where this leaves us
Open-AutoGLM is good news. A capable, open, self-hostable phone agent lowers the barrier for everyone and pushes the whole field toward portability instead of lock-in. The honest framing is just to pair it with the AndroidDaily number: the best agents still fail two tasks in five on real apps. Treat the open model as a strong foundation and put your engineering where the reliability actually lives. If you’d rather not build that layer yourself, that is precisely the problem Auten exists to solve.
Share this article