Table of Contents
- The Pattern Behind the Failures
- Test Coverage: The Illusion of Completeness
- Hallucinated APIs and Phantom Methods
- Dependency Drift and Version Assumptions
- Review Heuristics That Actually Work
- The Four-Question Approval Framework
The Pattern Behind the Failures
A senior engineer at a fintech startup let Claude write production code for 30 days. Three incidents made it past code review. None were syntax errors. All three failures had the same root cause: Claude wrote code that looked right but made assumptions about the system that weren't true.
The first bug shipped on day 12. Claude generated a data validation function that checked for null values but missed empty strings. The tests passed because Claude also wrote the tests, and they checked the same cases. The production database started filling with blank entries. Cost: four hours of data cleanup and a minor customer-facing bug.
The second incident happened on day 18. Claude refactored an authentication middleware and confidently used a method that didn't exist in their version of Express.js. The method existed in the documentation Claude had been trained on, but their codebase ran on an older version. The error surfaced in staging, caught by a manual QA check, not automated tests.
The third failure was subtler. Claude optimized a batch processing job by introducing parallel execution. The logic was sound. The code was clean. But the underlying database connection pool couldn't handle the concurrent load. Production throughput dropped 40% until someone noticed the connection timeout errors buried in the logs.
Analogy: Claude Code is like a chef who learned to cook by reading every recipe ever published but has never actually tasted the food. The techniques are correct, the ingredients are real, but the dish might not work in your specific kitchen with your specific equipment.
Test Coverage: The Illusion of Completeness
When Claude writes tests for its own code, you get high coverage numbers that mean almost nothing. The tests validate the code's internal logic, not whether the code solves the actual problem.
Here's what that looks like in practice:
| What Claude Tests | What Actually Breaks |
|---|---|
| Function returns expected output for valid input | Function fails on edge cases from real user data |
| Error handling catches defined exception types | Production throws exceptions not in the test suite |
| Integration test mocks external API correctly | Real API has rate limits and different error formats |
| Database queries return expected schema | Query performance degrades with production data volume |
The solution isn't to distrust all AI-generated tests. It's to add a layer of integration tests written by humans who understand your production environment. Claude can write the unit tests. You write the tests that verify assumptions about external systems, data volumes, and runtime constraints.
One engineering team adopted a simple rule: any PR with AI-generated code must include at least one integration test written by the submitting engineer. Not a test of the code's logic, but a test of the code's interaction with real production dependencies. This caught 60% of issues before they reached staging.
Hallucinated APIs and Phantom Methods
Claude doesn't browse your actual codebase when it suggests solutions. It predicts what methods and APIs probably exist based on patterns it learned during training. Sometimes those predictions are wrong.
Common hallucinations include:
- Using newer framework methods that don't exist in your installed version
- Assuming helper functions exist that were common in open source but aren't in your private codebase
- Invoking API endpoints with parameter structures that changed between documentation and implementation
- Referencing configuration keys that follow naming conventions but don't actually exist in your config files
The dangerous part: these hallucinations pass type checks and linters if the signatures are plausible. They only fail at runtime.
One team solved this by maintaining a "known good patterns" file that Claude references. Before Claude writes integration code, they paste in actual snippets from their codebase showing how to call their authentication service, database layer, and third-party APIs. Claude then mimics those patterns instead of inventing its own.
Dependency Drift and Version Assumptions
Claude was trained on documentation and code examples from multiple versions of every library. When it generates code, it often blends patterns from different versions. The result compiles but behaves unexpectedly.
A real example: Claude wrote a Redis caching layer using async/await patterns introduced in ioredis v5, but the project used v4. The code looked modern and correct. It threw runtime errors in production because the older version returned promises differently.
Another team hit this with PostgreSQL query syntax. Claude used a FILTER clause supported in Postgres 14, but their production database ran Postgres 12. The query worked in the development Docker container (which had been updated) but failed in production (which hadn't).
| Dependency Risk | Detection Method |
|---|---|
| Version-specific syntax | Run tests against production-mirrored environment |
| Deprecated method usage | Enable strict deprecation warnings in CI |
| New features not in old versions | Lock dev and prod dependency versions |
| Breaking changes between minor versions | Automated integration tests on actual prod stack |
The fix: before merging AI-generated code that touches external dependencies, verify it against your exact production versions. Not the latest docs. Not the dev container. The actual versions running in prod.
Review Heuristics That Actually Work
Engineering leads need practical rules for approving AI-generated PRs without reading every line like it's a security audit. These heuristics catch 80% of issues:
First pass: the assumption audit
Scan for any line that interacts with external systems. Does the code assume the API returns data in a specific format? Does it assume the database supports a particular query feature? Does it assume the network is reliable? Flag every assumption. Verify each one against production reality.
Second pass: the error path check
Claude is optimistic. It writes happy path code extremely well. It writes error handling that looks complete but often isn't. Check every try/catch block, every error callback, every null check. Ask: what happens if this external call times out? What if it returns malformed data? What if it succeeds but returns an empty result?
Third pass: the performance sniff test
Look for loops that make network calls, queries without limits, and unbounded recursive operations. Claude understands algorithmic complexity but doesn't always consider production data scale. A function that works fine with 10 items might crash with 10,000.
Fourth pass: the integration surface area
Count how many different systems the code touches. Every integration point is a potential failure mode. If Claude wrote code that touches your auth service, payment provider, and email queue in the same function, that's three failure modes to test. If it only touches your local database, that's one.
The Four-Question Approval Framework
Before approving any AI-generated PR, answer these four questions:
-
Does this code make assumptions about external systems? If yes, have those assumptions been verified against current production state?
-
Are the tests testing the right thing? Do they validate business logic or just code logic? Do they test integration points or just isolated functions?
-
What's the blast radius if this fails? Does it affect a background job or a user-facing API? Can it be rolled back easily? Is there monitoring to detect failure?
-
Has a human traced the actual execution path? Not read the code, but mentally stepped through what happens when this code runs with real production data and real production dependencies?
If you can't answer yes to all four, the PR needs more work.
What This Means for Your Workflow
AI-generated code isn't inherently less reliable than human-written code. It's differently reliable. The failure modes are predictable once you know what to look for.
The teams seeing success with Claude Code in production share three practices:
They treat AI-generated code as a first draft that requires technical review focused on assumptions, not syntax. They invest in integration test infrastructure that validates behavior against real production dependencies. They maintain clear boundaries around which types of changes AI can make autonomously and which require paired programming with a human.
The goal isn't to eliminate AI-generated code from your production systems. It's to merge it with the same confidence you have in human-written code. That confidence comes from verification, not faith.
Start with low-risk changes. Utility functions, test scaffolding, data transformations that don't touch external systems. Build your review process around actual incidents, not theoretical risks. Track what breaks and why. Adjust your approval heuristics based on real data from your codebase.
Claude Code will write a lot of your production code in the next year whether you plan for it or not. Engineers are already using it. The question isn't whether to allow it. The question is whether you'll have a coherent process for reviewing it before it ships.