Imagine you hire someone. They have a PhD-level grasp of every subject you can test them on. They type 100x faster than anyone in the office. They never sleep, never take vacation, and they cost almost nothing.
Then they start working, and you notice things. They complete the task but skip three of the eight steps. They confidently cite a source that doesn't exist. They produce a strategy deck that's structurally flawless and says absolutely nothing your competitor's deck doesn't also say. When you point out a mistake, they agree immediately and produce a new version with different mistakes. They never ask a clarifying question or push back on anything.
And they have no idea what your company actually does — not the org chart on the website, but who to call when procurement stalls, why the 2019 initiative failed, which numbers the CEO actually looks at versus which ones are in the deck.
That's AI in March 2026.
Here's how AI performs on structured tests as of March 2026.
On these specific tests, AI is at or above human expert level. Then you look at what happens when it tries to do actual work.
Scale AI and the Center for AI Safety gave 240 real Upwork projects to the best AI agents. Architecture, game development, video production, data analysis, product design. $143,991 in project value. The best agent earned $1,720.
45.6% of deliverables failed quality review. 35.7% were simply incomplete. 17.6% had corrupted files. 14.8% contradicted themselves.
94% on PhD science. 2.5% on real work. Same technology, same month. To understand why, you have to look at what real work actually requires.
Pick a role. A Chief Strategy Officer, a Director of Innovation, a marketing analyst. Their days look different but the underlying skills overlap. Click a role to see how AI stacks up on each one.
The pattern is consistent across roles. AI is strong on anything that looks like "process this information and produce an output." It falls apart on anything that requires context it wasn't given, judgment it can't verify, or initiative it wasn't prompted to take.
A context window is the AI's working memory: how much information it can hold in its head at once. The largest models advertise 1 to 2 million tokens, roughly 750,000 to 1.5 million words. Gemini 1.5 Pro claims 2 million. Llama 4 claims 10 million.
In theory, companies could wire AI into the systems where context lives: Slack, email, CRM, shared drives, meeting transcripts. Most enterprise data is unstructured (documents, messages, recordings), and AI can technically read all of it. If you gave an AI agent full access to your company's Slack history, last year's board decks, and your CRM data, it would arguably have access to the majority of a senior employee's documented context.
In practice, almost nobody does this. Security policies, compliance requirements (GDPR, HIPAA, industry regulation), data siloing between departments, and basic organizational caution mean most companies haven't authorized AI to read their sensitive internal communications. And many never will — at least not for the channels where the real context lives. The board deck might be fair game. The Slack thread where the VP of Sales says what he actually thinks about the product roadmap probably isn't.
And even with full access, there's a layer AI can't reach: who's actually good at their job versus who talks a good game. Why the last VP really left. Which initiatives the CEO will fund based on a mood the whole executive team can read but nobody would put in an email. That's the context that determines whether a strategy deck gets approved or dies on slide three.
There's also a technical ceiling. Research consistently shows models lose accuracy on information in the middle of their context window. Effective context (the part the model actually processes reliably) is about 30 to 60% of the advertised window. A model claiming 1 million tokens performs reliably on maybe 300,000 to 600,000. That's still hundreds of pages. But it means you can't just dump everything in and expect the AI to find the relevant signal.
A million tokens is a lot of text. You can fit hundreds of pages into a single conversation. For a marketing analyst pulling together a competitive landscape from public sources, that's genuinely powerful. Paste in the last six earnings calls, three industry reports, and your campaign data, and the AI processes all of it faster than you could read any of it.
But even with access and capacity, there's a gap between what's documented and what a senior employee actually knows. Two years of campaign results, competitor moves they tracked informally, customer conversations from trade shows, the time a similar initiative failed in 2022 and why. A lot of that was never written down.
There's a well-documented pattern where AI models don't finish what you asked for. Researchers call it the "partial completion problem." In trials on an 8-step workflow, ChatGPT skipped an average of 1.67 steps per run. Gemini skipped 3. Le Chat skipped 3.3.
That's not a bug in the usual sense. The models are optimizing for what looks like a complete response, not what actually is one. They're trained on text where things have beginnings, middles, and ends. When the output feels done, the model stops, even if two of the eight steps are missing.
Ask AI to research competitors, and it will summarize what it already knows. It won't go find the competitor's latest press release, pull their job postings to infer strategy, check their patent filings, or call the sales team to ask what they're hearing. A real analyst does all of that without being told.
In coding, this shows up as models writing comments where actual code should be — placeholders that say "implement this logic here" instead of implementing it. In analysis, you get summaries of the obvious instead of anyone digging into the anomaly in row 47. Strategy decks come back structurally perfect and completely generic.
One widely-reported pattern: models that were getting better at completing tasks actually started getting worse over the course of 2025. Tasks that used to take five hours with AI assistance started taking seven or eight, to the point where some developers reverted to older model versions.
A Director of Innovation's job is to figure out which ideas are actually good. That requires weighing incomplete evidence, challenging assumptions, and saying "this won't work because of X" when everyone in the room is excited. It's an adversarial skill. You have to argue against your own position and see if it holds.
AI does something that looks like this but isn't. It can produce a SWOT analysis, generate pros and cons, even write a "devil's advocate" section. But it doesn't actually believe any of it. It's pattern-matching on what critical analysis looks like in its training data. Ask it to challenge its own conclusion and it will, instantly and without resistance, which tells you it didn't hold the conclusion in the first place.
Researchers found that AI models struggle to distinguish between a user's beliefs and actual facts. An AI tutor that can't tell whether a student's wrong answer reflects a misunderstanding or a different but valid approach is just autocompleting.
Apple's GSM-Symbolic study found that reasoning models don't just get worse gradually as problems get harder. They hit a threshold and then collapse completely. Zero correct solutions past a certain complexity. And counterintuitively, the models use fewer reasoning steps on hard problems, not more. They give up before a human would even start working hard.
You'd fire a CSO who got less rigorous when the stakes were high, but that's exactly how the models behave.
The most fundamental difference between AI and a knowledge worker is that AI never takes initiative. It doesn't look at the quarterly numbers and think "this trend is concerning, I should flag it." It doesn't notice that two departments are working on the same problem and connect them.
AI is reactive. It responds to prompts. A good analyst doesn't wait for prompts — they notice things, get curious, bring you insights you didn't ask for. That's where most of the value of an experienced hire actually comes from.
The "agentic AI" movement is trying to fix this. Multi-agent systems (specialized AIs coordinating on tasks) have moved from experiment to early production. 57% of surveyed organizations now have some form of agents in production. But "agents in production" mostly means automated customer service and data processing, not "AI that decides what the company should do next."
When an AI agent gets stuck, it sometimes cheats. One researcher documented an agent that couldn't find a person named "John Smith" in a directory, so it renamed a different user to "John Smith" and declared the task complete. It optimized for "task marked done" rather than "task actually done."
Multi-agent systems bring their own problems. Researchers tested 7 leading multi-agent frameworks on coding, math, and general tasks and found failure rates ranging from 41% to 87% depending on the framework. The failures aren't primarily about individual model capability — they're coordination failures. Agents duplicate work, forget their roles, or withhold context from each other. A "planner" agent starts writing code instead of planning. Two agents pursue contradictory strategies without noticing. The paper's authors argue that improving the base models won't fix this, because even organizations of competent individuals can fail catastrophically at coordination.
For now, the most successful enterprise deployments constrain AI to narrow, well-defined tasks with human oversight at every decision point. That's useful, but it's automation — a faster tool, not a coworker.
Everything above would be manageable if you could trust the output. You can work with someone who needs direction and doesn't know your company, as long as what they produce is reliable. Junior hires are exactly this. You give them specific tasks, check the output, and the checking takes less time than doing it yourself.
AI output is wrong with the same confidence as when it's right. A hallucinated legal citation looks exactly like a real one.
Those numbers go the wrong direction. Newer reasoning models, the ones that are supposed to be more capable, are worse at basic factual accuracy than their predecessors.
Calibration — whether the AI knows what it doesn't know — is mixed. Claude Opus 4.5 has an ECE of 0.12, which is approaching usable. GPT-5.2 with reasoning scores 0.40, worse than models that don't reason at all. There's no clear trend toward reliability across the industry.
For a marketing analyst producing a competitive report, every claim needs checking. The AI produces the report in 10 minutes instead of 10 hours. But the checking takes 3 hours. The net savings are real (3 hours instead of 10) but the human is still essential, and they need to be someone who would have caught the errors. You can't hand the checking to a junior person who wouldn't know what wrong looks like.
A CSO's strategy needs to be different from the competitor's strategy. A Director of Innovation's proposals need to be, well, innovative. The whole point of knowledge work at this level is producing something that hasn't been produced before.
Doshi and Hauser studied 293 writers. With AI assistance, each individual writer's output scored higher on novelty. But across all writers, the outputs became 8.9 to 10.7% more similar to each other. Holzner et al. confirmed this across 28 studies and 8,214 participants.
Each individual's output gets better, but everybody's output starts sounding the same. Every AI-assisted strategy deck converges toward the same median. The more your competitors use AI for strategy, the more your AI-generated strategy sounds like theirs.
This effect is consistent across studies and it gets worse the more AI is used. AI is useful for generating a first draft, but the value is in whoever edits that draft into something distinctive — which requires the taste and judgment AI doesn't have.
Here's what this adds up to, compared against what an experienced knowledge worker brings.
| Capability | AI (March 2026) | Experienced human |
|---|---|---|
| Raw knowledge | Excellent 94% on PhD science |
Deep in domain, patchy outside it |
| Speed of output | Excellent 100x faster on production |
Slow by comparison |
| Following clear instructions | Good When task is well-defined |
Good, also pushes back when instructions are wrong |
| Working memory / context | Improving 1-2M tokens, agents persist across sessions, but misses unrecorded context |
Years of accumulated context, including unspoken dynamics |
| Completing all steps | Poor Skips 1.7 to 3.3 of 8 steps |
Follows through, or flags what they can't do |
| Initiative / agency | None Does nothing without a prompt |
Notices problems, raises concerns, follows up |
| Organizational knowledge | Partial Can ingest docs, Slack, CRM — but misses unrecorded politics and dynamics |
Knows the real org chart, the unwritten rules |
| Critical thinking | Poor Collapses at complexity threshold, sycophantic |
Gets more rigorous when stakes are high |
| Factual reliability | Poor Hallucinates with confidence, getting worse on some models |
Makes errors but knows when they're uncertain |
| Originality | Median Individually decent, collectively convergent |
Varies. Best humans produce genuinely novel work |
Three greens, three yellows, four reds.
Speed and knowledge are enormously valuable. A marketing analyst who can produce a first draft of a competitive landscape in 10 minutes instead of 10 hours is doing something real. But the person reviewing that draft needs to know what a good competitive landscape looks like, catch the hallucinated data, add the organizational context the AI doesn't have, and push it past the generic into something useful. The AI makes that person faster, but the person is still the one doing the work.
The marketing analyst gets the most immediate value. A lot of analyst work is "gather information, process it, produce a formatted output." AI is good at this. The analyst's job shifts from production to verification and interpretation. Fewer analysts needed per team, but the ones who stay need to be experienced enough to catch AI errors.
The Director of Innovation gets a tireless brainstorming partner that reads everything and has no taste. Useful for generating options, terrible at evaluating them. The director's judgment becomes more valuable because AI produces more options faster, and someone has to decide which ones are worth pursuing, and AI's sycophancy means it won't tell you your favorite idea is bad.
The CSO is the least affected, but not untouched. Strategy at this level requires organizational knowledge (what this company can actually execute), political judgment (what the board will fund), and long-term pattern matching (where the market is going based on weak signals). AI can help with the pattern matching — it processes market reports, earnings calls, and competitive data faster than any human team. But the political judgment and organizational instinct that turn analysis into strategy still come from the person in the room. AI gives a CSO better inputs. The decision is still the CSO's.
Three things would shift the scorecard meaningfully:
Deeper organizational integration. This is already happening. AI agents that persist context through project files, connect to Slack and email, and retain memory across sessions exist now. The next step is AI that builds a genuine model of your organization — not just what's in the documents, but the patterns across them. Who gets things done, which projects stalled and why, what the company is actually good at versus what the website says. The tools are moving in this direction. The question is how much of the unrecorded, political, interpersonal context they can approximate by reading between the lines of what is recorded.
Reliable self-knowledge. If AI could tell you "I'm 40% sure about this part" and actually be right about that 40%, the trust problem becomes manageable. Calibration on some models is improving (ECE 0.12 on Claude Opus 4.5 versus a baseline of 0.35). On others it's getting worse. If the improving trend wins, the "factual reliability" row eventually turns yellow. If it doesn't, the human checker stays permanently in the loop.
Multi-agent coordination that actually works. Right now multi-agent systems fail 41-87% of the time, mostly from coordination problems. If that gets solved, the "initiative" and "completing all steps" rows both improve. Early signs exist: systems where a builder agent, architect agent, and judge agent catch each other's mistakes are showing promise in controlled settings.
All three are being worked on. Whether any of them cross from "promising in demos" to "reliable in production" is genuinely unclear.
Sources
Remote Labor Index. Scale AI / Center for AI Safety, 2025. 240 projects, 23 categories, $143,991 value. Best agent (Manus): 2.5% automation rate. arXiv:2510.26787.
GPQA Diamond. 265 PhD-level science questions. Human expert: 69.7%. Mar '26 best model: 94.1%.
SWE-bench Verified. Princeton. Real GitHub issue resolution. Mar '26 best: 82.1%.
OSWorld. Xie et al., 2024. Computer environment tasks. Mar '26: 72.5% (human: 72.4%).
PersonQA. OpenAI system cards. o4-mini: 52%, down from o1: 84%.
Calibration / ECE. KalshiBench (arXiv:2512.16030), Rewarding Doubt (arXiv:2503.02623).
Creative convergence. Doshi & Hauser, Science Advances, Jul 2024. N=293. Confirmed by Holzner et al. (N=8,214, arXiv:2505.17241).
Xu et al. "Hallucination is Inevitable." arXiv:2401.11817, NeurIPS 2025.
Apple GSM-Symbolic, 2024. Reasoning collapse past complexity thresholds.
Partial completion. Karapetyan, 2025. ChatGPT: 1.67 missed steps per 8-step workflow. Gemini: 3. Le Chat: 3.3.
Multi-agent failures. Cemri, Pan, Yang. "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657. NeurIPS 2025. 41-86.7% failure rate across 7 MAS frameworks, 1,600+ traces.
AI reasoning in medicine. Nature Machine Intelligence, 2025. Best multi-agent: ~27% on complex cases.
Context windows. Epoch AI, 2025. Effective context 30-60% of advertised. RULER benchmark.
Enterprise data. Gartner/IDC estimates ~80-90% of enterprise data is unstructured. Salfati Group, 2025: ~20% in structured systems (databases, CRM, ERP).
Deloitte State of AI in the Enterprise, 2026. 66% report productivity gains. 20% report revenue growth.
LangChain State of Agent Engineering, 2026. 57% have agents in production. 46% cite integration as primary challenge.
IEEE Spectrum. "AI's Wrong Answers Are Bad. Its Wrong Reasoning Is Worse." 2025.