All Posts
AI agentsartificial intelligenceautomationOpenClaw

AI Agents Are Going Mainstream. Here's What They Can Actually Do (and What's Still Hype)

Lorenz Kutschka··9 min read

Last week I watched a demo where an AI agent booked a flight, reserved a hotel, and sent a calendar invite to three colleagues. All from a single prompt. The audience clapped. I wrote down a question nobody asked: what happens when the agent picks the wrong date and you're on the hook for a non-refundable $1,400 ticket?

This is the state of AI agents in early 2026. The demos are incredible. The reliability is not.

OpenClaw, the open-source agent framework created by Peter Steinberger, crossed 180,000 GitHub stars in January. Steinberger got hired by OpenAI two weeks later. Big Tech companies are collectively spending an estimated $650 billion on AI infrastructure this year, according to analyst reports from Goldman Sachs and Sequoia Capital. The money is real. The ambition is enormous.

But I've been testing these tools for months, and the gap between what agents can do in a controlled demo and what they can do in your actual life is still significant. Here's what I've found.

What "AI Agent" Actually Means

An AI agent is software that can take actions autonomously, not just answer questions. You give it a goal, and it figures out the steps. A chatbot tells you how to book a flight. An agent books the flight.

The key distinction is autonomy. A copilot suggests code and you accept or reject it. An agent writes the code, runs the tests, fixes the errors, and opens the pull request. That's a fundamentally different trust model.

The term has gotten loose. Lots of products are calling themselves "agents" when they're really just chatbots with API access. If it can't make decisions and take multi-step actions without you approving each one, it's not an agent. It's autocomplete with extra steps.

The Framework Landscape Right Now

If you want to build agents, four frameworks dominate the conversation: LangChain, CrewAI, AutoGen, and OpenClaw. They are not interchangeable.

LangChain has the largest ecosystem. Over 92,000 GitHub stars, hundreds of integrations, massive community. But it's also the one developers complain about most. The abstraction layers can be frustratingly opaque. When something breaks three layers deep in a chain, debugging feels like archaeology.

CrewAI takes a different approach: role-based agents that collaborate like a team. You define agents with specific roles (researcher, writer, editor) and they hand off tasks to each other. It works surprisingly well for content workflows. Less proven for anything involving real-world actions.

AutoGen, from Microsoft, is built around multi-agent conversations. Agents talk to each other, debate, and refine outputs. It's powerful for complex reasoning tasks but heavyweight for simple automations.

OpenClaw is the newcomer that's eating the others' lunch. Steinberger designed it around a principle he calls "progressive autonomy," where agents start with minimal permissions and earn more based on track record. That 180K-star count didn't come from hype alone. The architecture genuinely solves problems the others don't.

What Actually Works Today

Some agent use cases are genuinely production-ready. I've been running several for weeks without babysitting.

Code review and testing. Agents that review pull requests, run test suites, and flag issues are the most mature category. Cursor, Copilot Workspace, and OpenClaw-based setups handle this well. They work because the feedback loop is tight: the code either passes tests or it doesn't.

Content monitoring and summarization. Agents that watch RSS feeds, track social media accounts, and summarize what's new are reliable because the stakes are low. A missed article isn't a financial loss. An inaccurate summary is annoying but not catastrophic.

Data extraction and transformation. Pulling structured data from unstructured sources, cleaning CSVs, converting formats. These are tedious tasks with clear success criteria, which makes them ideal for agents. If a task has a verifiable output, agents can probably handle it.

Where Agents Still Fall Apart

Multi-step reasoning with real consequences is where things go sideways. And they go sideways in ways that are hard to predict.

I tested an agent that was supposed to manage my email. It replied to a client inquiry with pricing information that was six months out of date. The email was grammatically perfect, professionally worded, and confidently wrong. Good: the tone and formatting were flawless. Bad: the actual content would have cost me a client if I hadn't caught it.

Hallucination risk scales with autonomy. A chatbot that hallucinates gives you wrong information. An agent that hallucinates takes wrong actions. That's a qualitative difference most people haven't internalized yet.

The other failure mode is what researchers call "goal drift." You ask an agent to find the cheapest flight to Berlin. It finds one with two layovers and a 14-hour travel time because technically that's the cheapest. The agent optimized for exactly what you said, not what you meant. This is the alignment problem in miniature, playing out in your travel itinerary.

The Real Money Problem

Agents that spend money on your behalf are the holy grail and the biggest liability. Klarna's AI shopping assistant reportedly processes millions of queries, but it operates within extremely tight guardrails. There's a reason for that.

Autonomous purchasing requires a level of judgment that current models don't reliably have. An agent that buys the wrong $15 item on Amazon is a nuisance. An agent that commits to the wrong $15,000 vendor contract is a lawsuit.

The companies making progress here, like Adept and Rabbit, are building in approval checkpoints for transactions above certain thresholds. That's the right approach but it also means the agent isn't truly autonomous. It's a smart assistant that drafts actions for your approval. Good: safety and control. Bad: you're still the bottleneck.

Why the Demos Look So Good

Every agent demo you see on Twitter follows the same pattern: a carefully chosen task, a curated environment, and zero edge cases. It's the equivalent of a self-driving car demo on a closed course with no other vehicles.

In production, the world is messy. Websites change their layouts. APIs return unexpected errors. Users phrase requests ambiguously. The agent has to handle not just the happy path but the thousand ways things can go wrong.

Devin, the AI software engineer from Cognition, showed impressive demos of building entire applications from scratch. Independent benchmarks from SWE-bench told a more nuanced story. It solved about 14% of real-world GitHub issues in early tests. That's actually impressive for 2025 technology, but it's a long way from "replaces a developer."

The 6-12 Month Horizon

Some capabilities are close but not quite there. Based on what I'm seeing in research papers and early beta products, here's what I expect to work by late 2026.

Reliable email management. Not just drafting replies but actually sending them, with enough context awareness to get the tone and content right 95%+ of the time. Google and Microsoft are both racing here.

Autonomous scheduling. Booking meetings by negotiating with other people's agents or assistants. Cal.com and Reclaim.ai are already partway there. The last mile is handling the weird edge cases: time zones, "let me check with my spouse," rescheduling chains.

Multi-app workflows. Tell an agent to "prepare the quarterly report" and have it pull data from Salesforce, create charts in Sheets, and draft a summary in Docs. Zapier's agent platform and OpenClaw's plugin ecosystem are building the plumbing. The pieces exist. The reliability doesn't. Yet.

The $650 Billion Question

Big Tech is spending $650 billion on AI infrastructure in 2026. That number, reported across multiple analyst estimates, is staggering. It's more than the GDP of Sweden.

Most of that is going toward training and running the foundation models that power agents. Nvidia can't ship GPUs fast enough. Microsoft, Google, Amazon, and Meta are all building custom silicon. The investment signals that these companies believe agents are the primary interface for the next decade of computing.

But infrastructure spending doesn't guarantee product-market fit. The dot-com bubble taught us that. You can build the most powerful servers in history, and it doesn't matter if the applications running on them aren't reliable enough for people to trust.

How to Think About Agents in Your Workflow

The practical framework I use is simple: start with tasks where failure is cheap and feedback is fast.

Let an agent monitor your industry news. Let it draft (not send) email replies. Let it clean up your data. Let it write first drafts of reports. These are all low-risk, high-value applications where you can verify the output before it matters.

Don't let an agent manage your calendar, reply to clients, or make purchases until you've watched it operate in draft mode for at least a few weeks. Trust is earned, even from software.

Staying Current Without Losing Your Mind

The pace of change in this space is genuinely unprecedented. In the time it took me to write this post, three new agent frameworks were probably launched. The paper that defines best practices today will be outdated in six months.

Staying on top of AI agent developments is a full-time job. That's what twixb does -- it monitors the sources that matter and surfaces the breakthroughs worth your attention, whether that's a new OpenClaw release, a research paper on agent reliability, or a startup quietly solving a problem everyone else is still talking about.

However you track this space, have a system. It moves too fast for bookmarks and memory. Set up alerts, subscribe to the right newsletters, or use a monitoring tool. Just don't rely on the algorithm to surface what matters.

Quick Reference

Works now: Code review, content monitoring, data extraction, first-draft generation.

Works with guardrails: Email drafting, research synthesis, simple purchasing with approval steps.

6-12 months out: Autonomous email, multi-app workflows, reliable scheduling.

Still hype: Fully autonomous purchasing, replacing knowledge workers, "one prompt to run your business."

The honest take: Agents are real, useful, and improving fast. They're also unreliable, overhyped, and occasionally dangerous. The people who will benefit most aren't the ones betting everything on full autonomy. They're the ones who figure out the right ratio of human judgment to machine execution, and keep adjusting it as the technology matures.

Related Posts

Build your own newsroom

Track the content that matters. Get AI summaries and key learnings delivered to your inbox.

Try Free for 14 Days