Why AI agents aren’t replacing remote workers any time soon

Technology

11 December 2025

Dr Gleb Tsipursky

Agentic AI is exciting, but real benchmarks beat glossy promises, writes 'office whisperer' Dr Gleb Tsipursky.

The demos look slick, the promises even slicker. In slides and keynotes, agentic assistants plan, click, and ship your work while you sip coffee. Promoters like McKinsey call it the agentic AI advantage.

Then you put these systems on real client work, and the wheels come off. The newest empirical benchmark from researchers at the Center for AI Safety and Scale AI finds current AI agents completing only a tiny fraction of jobs at a professional standard.

Benchmarks, not buzzwords, describe reality

Headlines say “agents are here”. Data says otherwise. The new Remote Labor Index (RLI), a multi-domain benchmark built from 240 real freelance-type projects across 23 categories, reports an automation rate topping out at 2.5 per cent across leading agents, meaning almost all deliverables would be rejected by a reasonable client. The dataset spans design, operations, BI, audio-video, game development, CAD, architecture, and more, reflecting the work that actually shows up in remote markets, not cherry-picked lab tasks.

The point is not that AI fails everywhere. RLI documents scattered wins in text-heavy data visualisation, audio editing, and simple image generation. But the failures are systematic. Reviewers cite empty or corrupt files, missing assets, low-grade visuals, and inconsistencies across deliverables, the kinds of misses that doom client work regardless of clever reasoning traces. These aren’t close calls. Inter-annotator agreement sits at 94.4 per cent for the accept-or-reject decision, so we are not talking about taste.

If you need a concrete sense of difficulty, the benchmark’s human reference projects averaged 28.9 hours to complete, with a median of 11.5 hours and an average price of $632.6. Those are realistic project sizes. They include work like a World Happiness Report dashboard, a 2D promo for a tree services firm, 3D animations for new earbuds, an IEEE-formatted paper, an architectural concept for a container home, and a “Watermelon Game”-style casual web game. This is the right yardstick for agent claims.

Other grounded evaluations tell a similar story, such as the WebArena benchmark. And in software, SWE-bench shows that turning model skill into working patches across real repositories remains hard without tight scaffolding.

Tasks automate, but projects still require adults in the room

When I work with companies on AI adoption, I push a simple framing. Use AI to do well-scoped tasks inside a project, not to run the project. That rule aligns with the published evidence from benchmarks. The RLI team notes pockets of success in content drafting, audio clean-up, image assets, and basic data visualisation, which pair nicely with human review in marketing, product, and analytics teams. In my client work, this shows up as faster ad variants, cleaner query logic, quicker explainer scripts, and first-pass chart code that a developer can polish.

Contrast those gains with multi-hour, multi-file builds that require iterative verification. In METR’s HCAST findings, agents succeed 70–80 per cent on tasks humans do in under an hour, and under 20 per cent on tasks that take humans more than four hours. That is the difference between automating a component and carrying a project across the finish line.

This gap explains why the RLI authors also track a relative “Elo” progress signal, which rises over time even as absolute project completion stays low. Improvement is real. Hype overstates what that improvement means for near-term automation of whole projects.

Plan for augmentation now, not mass replacement

Hype has a business model. The agentic AI advantage storyline promises proactive, goal-driven assistants that automate complex processes across the firm. Markets respond to bold claims, then teams inherit the risk. Gartner even warns that more than two out of five so-called agentic initiatives will be scrapped by 2027 due to unclear value and rising costs, a wave of “agent washing” where conventional tooling gets relabeled as autonomy.

The balanced plan is to redesign work so humans direct, verify, and integrate agent outputs, then let evidence guide scope increase. OpenAI’s GDPval report shows that with human oversight, frontier models are approaching expert quality on carefully defined, economically valuable tasks. That supports staffing models where you automate slices of jobs, not the jobs themselves. It also matches early labour data. A recent Stanford employment analysis reports wage gains in AI-exposed roles without broad, immediate job loss, consistent with a world where AI changes task mix before it wipes out occupations.

The near-term playbook is straightforward. Use AI to reduce cycle time on repeatable tasks. Assign owners to verify outputs. Track acceptance rates and defect types, the same way the RLI evaluators categorised corrupt files, missing components, inconsistent renders, and low-quality assets. Expect headcount to shift as pieces of marketing, writing, programming, and analysis take fewer people, while roles that specify goals, judge quality, and integrate outputs become more central. On current trend lines, more capable AI agents will arrive over the next few years, helped by scaffolded workflows and better tool use, yet the evidence says whole-project autonomy for general remote-capable work is not a short-term outcome, regardless of hype from McKinsey and others.

Conclusion

Agentic AI is exciting, but real benchmarks beat glossy promises. The Remote Labor Index shows tiny automation rates on the kinds of projects companies actually pay for, backed by strong evaluation methods and consistent with other grounded benchmarks on web and desktop tasks. Progress will continue, and the smart move is to treat agents as force multipliers inside projects while humans stay accountable for outcomes. Leaders who adopt with discipline will bank the gains today and be ready for tomorrow without buying into a bubble.

Dr Gleb Tsipursky, called the “Office Whisperer” by The New York Times, helps leaders transform AI hype into real-world results. He serves as the CEO of the future-of-work consultancy Disaster Avoidance Experts and wrote seven best-selling books, including The Psychology of Generative AI Adoption.