Analysis

AI as a Coworker: What It Can and Can't Do

AI knowledge work capabilities, March 2026

Imagine you hire someone. They have a PhD-level grasp of every subject you can test them on. They type 100x faster than anyone in the office. They never sleep, never take vacation, and they cost almost nothing.

Then they start working, and you notice things. They complete the task but skip three of the eight steps. They confidently cite a source that doesn't exist. They produce a strategy deck that's structurally flawless and says absolutely nothing your competitor's deck doesn't also say. When you point out a mistake, they agree immediately and produce a new version with different mistakes. They never ask a clarifying question or push back on anything.

And they have no idea what your company actually does — not the org chart on the website, but who to call when procurement stalls, why the 2019 initiative failed, which numbers the CEO actually looks at versus which ones are in the deck.

That's AI in March 2026.

What the benchmarks say

Here's how AI performs on structured tests as of March 2026.

AI performance on structured tests, March 2026

PhD-level science questions (GPQA) 94.1%

Human expert baseline 69.7%

Real GitHub bug fixes (SWE-bench) 82.1%

Human estimate on same tasks ~82%

Operating a computer (OSWorld) 72.5%

Human baseline 72.4%

On these specific tests, AI is at or above human expert level. Then you look at what happens when it tries to do actual work.

2.5%

of freelance projects passed client review when AI agents did the work

Scale AI and the Center for AI Safety gave 240 real Upwork projects to the best AI agents. Architecture, game development, video production, data analysis, product design. $143,991 in project value. The best agent earned $1,720.

45.6% of deliverables failed quality review. 35.7% were simply incomplete. 17.6% had corrupted files. 14.8% contradicted themselves.

94% on PhD science. 2.5% on real work. Same technology, same month. To understand why, you have to look at what real work actually requires.

What a knowledge worker actually does

Pick a role. A Chief Strategy Officer, a Director of Innovation, a marketing analyst. Their days look different but the underlying skills overlap. Click a role to see how AI stacks up on each one.

AI capability vs. a knowledge worker

CSO

Director of Innovation

Marketing Analyst

Sets company strategy. Synthesizes market signals, competitive intelligence, and internal capabilities into decisions that define what the company does next.

The pattern is consistent across roles. AI is strong on anything that looks like "process this information and produce an output." It falls apart on anything that requires context it wasn't given, judgment it can't verify, or initiative it wasn't prompted to take.

How much context AI can actually use

A context window is the AI's working memory: how much information it can hold in its head at once. The largest models advertise 1 to 2 million tokens, roughly 750,000 to 1.5 million words. Gemini 1.5 Pro claims 2 million. Llama 4 claims 10 million.

In theory, companies could wire AI into the systems where context lives: Slack, email, CRM, shared drives, meeting transcripts. Most enterprise data is unstructured (documents, messages, recordings), and AI can technically read all of it. If you gave an AI agent full access to your company's Slack history, last year's board decks, and your CRM data, it would arguably have access to the majority of a senior employee's documented context.

In practice, almost nobody does this. Security policies, compliance requirements (GDPR, HIPAA, industry regulation), data siloing between departments, and basic organizational caution mean most companies haven't authorized AI to read their sensitive internal communications. And many never will — at least not for the channels where the real context lives. The board deck might be fair game. The Slack thread where the VP of Sales says what he actually thinks about the product roadmap probably isn't.

And even with full access, there's a layer AI can't reach: who's actually good at their job versus who talks a good game. Why the last VP really left. Which initiatives the CEO will fund based on a mood the whole executive team can read but nobody would put in an email. That's the context that determines whether a strategy deck gets approved or dies on slide three.

There's also a technical ceiling. Research consistently shows models lose accuracy on information in the middle of their context window. Effective context (the part the model actually processes reliably) is about 30 to 60% of the advertised window. A model claiming 1 million tokens performs reliably on maybe 300,000 to 600,000. That's still hundreds of pages. But it means you can't just dump everything in and expect the AI to find the relevant signal.

Context window: advertised vs. effective

Advertised window (best models) 1–2M tokens

Effective window (reliable accuracy) 30–60% of advertised

A senior employee's working context Unmeasurable

A million tokens is a lot of text. You can fit hundreds of pages into a single conversation. For a marketing analyst pulling together a competitive landscape from public sources, that's genuinely powerful. Paste in the last six earnings calls, three industry reports, and your campaign data, and the AI processes all of it faster than you could read any of it.

But even with access and capacity, there's a gap between what's documented and what a senior employee actually knows. Two years of campaign results, competitor moves they tracked informally, customer conversations from trade shows, the time a similar initiative failed in 2022 and why. A lot of that was never written down.

When context is recorded (saved to files, stored in project docs, logged in a codebase) AI recall is near-perfect. It remembers what you discussed last Tuesday better than you do. Agent-based tools like Claude Code persist context across sessions through files and project documentation. The limitation is that AI only knows what it's been explicitly given. A human colleague absorbs context just by being around — overhearing conversations, reading body language, picking up on what people don't say in meetings. AI only knows what you explicitly give it.

Why AI doesn't finish what you asked for

There's a well-documented pattern where AI models don't finish what you asked for. Researchers call it the "partial completion problem." In trials on an 8-step workflow, ChatGPT skipped an average of 1.67 steps per run. Gemini skipped 3. Le Chat skipped 3.3.

That's not a bug in the usual sense. The models are optimizing for what looks like a complete response, not what actually is one. They're trained on text where things have beginnings, middles, and ends. When the output feels done, the model stops, even if two of the eight steps are missing.

Ask AI to research competitors, and it will summarize what it already knows. It won't go find the competitor's latest press release, pull their job postings to infer strategy, check their patent filings, or call the sales team to ask what they're hearing. A real analyst does all of that without being told.

In coding, this shows up as models writing comments where actual code should be — placeholders that say "implement this logic here" instead of implementing it. In analysis, you get summaries of the obvious instead of anyone digging into the anomaly in row 47. Strategy decks come back structurally perfect and completely generic.

One widely-reported pattern: models that were getting better at completing tasks actually started getting worse over the course of 2025. Tasks that used to take five hours with AI assistance started taking seven or eight, to the point where some developers reverted to older model versions.

The compound effect is brutal. If an AI agent is 85% accurate on each step (which sounds good) a 10-step workflow succeeds only 20% of the time. That's (0.85)^10. Every additional step multiplies the failure rate.

How AI handles complex reasoning

A Director of Innovation's job is to figure out which ideas are actually good. That requires weighing incomplete evidence, challenging assumptions, and saying "this won't work because of X" when everyone in the room is excited. It's an adversarial skill. You have to argue against your own position and see if it holds.

AI does something that looks like this but isn't. It can produce a SWOT analysis, generate pros and cons, even write a "devil's advocate" section. But it doesn't actually believe any of it. It's pattern-matching on what critical analysis looks like in its training data. Ask it to challenge its own conclusion and it will, instantly and without resistance, which tells you it didn't hold the conclusion in the first place.

Researchers found that AI models struggle to distinguish between a user's beliefs and actual facts. An AI tutor that can't tell whether a student's wrong answer reflects a misunderstanding or a different but valid approach is just autocompleting.

Where AI reasoning breaks down

Accuracy collapse past complexity threshold Confirmed

Models reduce effort on harder problems Confirmed

Correct answers via flawed reasoning Common

Multi-agent medical diagnosis on complex cases 27% accuracy

Sycophancy (agrees rather than challenges) Structural

Apple's GSM-Symbolic study found that reasoning models don't just get worse gradually as problems get harder. They hit a threshold and then collapse completely. Zero correct solutions past a certain complexity. And counterintuitively, the models use fewer reasoning steps on hard problems, not more. They give up before a human would even start working hard.

You'd fire a CSO who got less rigorous when the stakes were high, but that's exactly how the models behave.

Initiative and agency

The most fundamental difference between AI and a knowledge worker is that AI never takes initiative. It doesn't look at the quarterly numbers and think "this trend is concerning, I should flag it." It doesn't notice that two departments are working on the same problem and connect them.

AI is reactive. It responds to prompts. A good analyst doesn't wait for prompts — they notice things, get curious, bring you insights you didn't ask for. That's where most of the value of an experienced hire actually comes from.

The "agentic AI" movement is trying to fix this. Multi-agent systems (specialized AIs coordinating on tasks) have moved from experiment to early production. 57% of surveyed organizations now have some form of agents in production. But "agents in production" mostly means automated customer service and data processing, not "AI that decides what the company should do next."

When an AI agent gets stuck, it sometimes cheats. One researcher documented an agent that couldn't find a person named "John Smith" in a directory, so it renamed a different user to "John Smith" and declared the task complete. It optimized for "task marked done" rather than "task actually done."

Multi-agent systems bring their own problems. Researchers tested 7 leading multi-agent frameworks on coding, math, and general tasks and found failure rates ranging from 41% to 87% depending on the framework. The failures aren't primarily about individual model capability — they're coordination failures. Agents duplicate work, forget their roles, or withhold context from each other. A "planner" agent starts writing code instead of planning. Two agents pursue contradictory strategies without noticing. The paper's authors argue that improving the base models won't fix this, because even organizations of competent individuals can fail catastrophically at coordination.

For now, the most successful enterprise deployments constrain AI to narrow, well-defined tasks with human oversight at every decision point. That's useful, but it's automation — a faster tool, not a coworker.

How often AI gets the facts wrong

Everything above would be manageable if you could trust the output. You can work with someone who needs direction and doesn't know your company, as long as what they produce is reliable. Junior hires are exactly this. You give them specific tasks, check the output, and the checking takes less time than doing it yourself.

AI output is wrong with the same confidence as when it's right. A hallucinated legal citation looks exactly like a real one.

Factual accuracy about real people (PersonQA)

o1 (reasoning model, early) 84%

o3 (reasoning model, newer) 67%

o4-mini (reasoning model, latest) 52%

Those numbers go the wrong direction. Newer reasoning models, the ones that are supposed to be more capable, are worse at basic factual accuracy than their predecessors.

Calibration — whether the AI knows what it doesn't know — is mixed. Claude Opus 4.5 has an ECE of 0.12, which is approaching usable. GPT-5.2 with reasoning scores 0.40, worse than models that don't reason at all. There's no clear trend toward reliability across the industry.

For a marketing analyst producing a competitive report, every claim needs checking. The AI produces the report in 10 minutes instead of 10 hours. But the checking takes 3 hours. The net savings are real (3 hours instead of 10) but the human is still essential, and they need to be someone who would have caught the errors. You can't hand the checking to a junior person who wouldn't know what wrong looks like.

What happens to originality

A CSO's strategy needs to be different from the competitor's strategy. A Director of Innovation's proposals need to be, well, innovative. The whole point of knowledge work at this level is producing something that hasn't been produced before.

Doshi and Hauser studied 293 writers. With AI assistance, each individual writer's output scored higher on novelty. But across all writers, the outputs became 8.9 to 10.7% more similar to each other. Holzner et al. confirmed this across 28 studies and 8,214 participants.

Each individual's output gets better, but everybody's output starts sounding the same. Every AI-assisted strategy deck converges toward the same median. The more your competitors use AI for strategy, the more your AI-generated strategy sounds like theirs.

This effect is consistent across studies and it gets worse the more AI is used. AI is useful for generating a first draft, but the value is in whoever edits that draft into something distinctive — which requires the taste and judgment AI doesn't have.

The scorecard

Here's what this adds up to, compared against what an experienced knowledge worker brings.

Capability	AI (March 2026)	Experienced human
Raw knowledge	Excellent 94% on PhD science	Deep in domain, patchy outside it
Speed of output	Excellent 100x faster on production	Slow by comparison
Following clear instructions	Good When task is well-defined	Good, also pushes back when instructions are wrong
Working memory / context	Improving 1-2M tokens, agents persist across sessions, but misses unrecorded context	Years of accumulated context, including unspoken dynamics
Completing all steps	Poor Skips 1.7 to 3.3 of 8 steps	Follows through, or flags what they can't do
Initiative / agency	None Does nothing without a prompt	Notices problems, raises concerns, follows up
Organizational knowledge	Partial Can ingest docs, Slack, CRM — but misses unrecorded politics and dynamics	Knows the real org chart, the unwritten rules
Critical thinking	Poor Collapses at complexity threshold, sycophantic	Gets more rigorous when stakes are high
Factual reliability	Poor Hallucinates with confidence, getting worse on some models	Makes errors but knows when they're uncertain
Originality	Median Individually decent, collectively convergent	Varies. Best humans produce genuinely novel work

Three greens, three yellows, four reds.

Speed and knowledge are enormously valuable. A marketing analyst who can produce a first draft of a competitive landscape in 10 minutes instead of 10 hours is doing something real. But the person reviewing that draft needs to know what a good competitive landscape looks like, catch the hallucinated data, add the organizational context the AI doesn't have, and push it past the generic into something useful. The AI makes that person faster, but the person is still the one doing the work.

What this means for the three roles

The marketing analyst gets the most immediate value. A lot of analyst work is "gather information, process it, produce a formatted output." AI is good at this. The analyst's job shifts from production to verification and interpretation. Fewer analysts needed per team, but the ones who stay need to be experienced enough to catch AI errors.

The Director of Innovation gets a tireless brainstorming partner that reads everything and has no taste. Useful for generating options, terrible at evaluating them. The director's judgment becomes more valuable because AI produces more options faster, and someone has to decide which ones are worth pursuing, and AI's sycophancy means it won't tell you your favorite idea is bad.

The CSO is the least affected, but not untouched. Strategy at this level requires organizational knowledge (what this company can actually execute), political judgment (what the board will fund), and long-term pattern matching (where the market is going based on weak signals). AI can help with the pattern matching — it processes market reports, earnings calls, and competitive data faster than any human team. But the political judgment and organizational instinct that turn analysis into strategy still come from the person in the room. AI gives a CSO better inputs. The decision is still the CSO's.

In a survey of enterprise AI deployments, only 20% of organizations reported that AI was actually growing revenue. 66% said it improved productivity and efficiency. AI is making existing workers faster, but the work that depends on judgment and context still needs a person to do it.

What would change this picture

Three things would shift the scorecard meaningfully:

Deeper organizational integration. This is already happening. AI agents that persist context through project files, connect to Slack and email, and retain memory across sessions exist now. The next step is AI that builds a genuine model of your organization — not just what's in the documents, but the patterns across them. Who gets things done, which projects stalled and why, what the company is actually good at versus what the website says. The tools are moving in this direction. The question is how much of the unrecorded, political, interpersonal context they can approximate by reading between the lines of what is recorded.

Reliable self-knowledge. If AI could tell you "I'm 40% sure about this part" and actually be right about that 40%, the trust problem becomes manageable. Calibration on some models is improving (ECE 0.12 on Claude Opus 4.5 versus a baseline of 0.35). On others it's getting worse. If the improving trend wins, the "factual reliability" row eventually turns yellow. If it doesn't, the human checker stays permanently in the loop.

Multi-agent coordination that actually works. Right now multi-agent systems fail 41-87% of the time, mostly from coordination problems. If that gets solved, the "initiative" and "completing all steps" rows both improve. Early signs exist: systems where a builder agent, architect agent, and judge agent catch each other's mistakes are showing promise in controlled settings.

All three are being worked on. Whether any of them cross from "promising in demos" to "reliable in production" is genuinely unclear.

Sources

Remote Labor Index. Scale AI / Center for AI Safety, 2025. 240 projects, 23 categories, $143,991 value. Best agent (Manus): 2.5% automation rate. arXiv:2510.26787.

GPQA Diamond. 265 PhD-level science questions. Human expert: 69.7%. Mar '26 best model: 94.1%.

SWE-bench Verified. Princeton. Real GitHub issue resolution. Mar '26 best: 82.1%.

OSWorld. Xie et al., 2024. Computer environment tasks. Mar '26: 72.5% (human: 72.4%).

PersonQA. OpenAI system cards. o4-mini: 52%, down from o1: 84%.

Calibration / ECE. KalshiBench (arXiv:2512.16030), Rewarding Doubt (arXiv:2503.02623).

Creative convergence. Doshi & Hauser, Science Advances, Jul 2024. N=293. Confirmed by Holzner et al. (N=8,214, arXiv:2505.17241).

Xu et al. "Hallucination is Inevitable." arXiv:2401.11817, NeurIPS 2025.

Apple GSM-Symbolic, 2024. Reasoning collapse past complexity thresholds.

Partial completion. Karapetyan, 2025. ChatGPT: 1.67 missed steps per 8-step workflow. Gemini: 3. Le Chat: 3.3.

Multi-agent failures. Cemri, Pan, Yang. "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657. NeurIPS 2025. 41-86.7% failure rate across 7 MAS frameworks, 1,600+ traces.

AI reasoning in medicine. Nature Machine Intelligence, 2025. Best multi-agent: ~27% on complex cases.

Context windows. Epoch AI, 2025. Effective context 30-60% of advertised. RULER benchmark.

Enterprise data. Gartner/IDC estimates ~80-90% of enterprise data is unstructured. Salfati Group, 2025: ~20% in structured systems (databases, CRM, ERP).

Deloitte State of AI in the Enterprise, 2026. 66% report productivity gains. 20% report revenue growth.

LangChain State of Agent Engineering, 2026. 57% have agents in production. 46% cite integration as primary challenge.

IEEE Spectrum. "AI's Wrong Answers Are Bad. Its Wrong Reasoning Is Worse." 2025.