AI Code Passes Tests. Then It Breaks Production.

April 14, 2026

AISecurityDeveloper Tools

Close-up of cracked glass evoking broken software and security failure

Date: March 30 - April 2, 2026 Sources: Wiz — Common Security Risks in Vibe-Coded Apps

Qodo just raised $70 million to solve a problem their CEO calls "AI slop": AI-generated code that passes tests but fails in production. A $70M Series B does not land on a category that did not exist 18 months ago unless the underlying pain is provably real.

Across an analysis of 4.2 million developers, 27% of production code is now AI-generated. At that penetration, the quality of AI-written code is no longer a theoretical risk. It is operational reality, and the data has caught up.

The Wiz Study

Security researchers at Wiz analyzed 5,600 applications built primarily with vibe-coding tools (AI-generated code with minimal human structural review) and found:

2,000+ security vulnerabilities.
400+ exposed secrets.
175 instances of exposed personally identifiable information.
One in five organizations building on vibe-coding platforms facing systemic security risk.

The failure patterns repeat across apps and languages. Client-side authentication logic that any user can bypass by opening DevTools. Hardcoded secrets in roughly 40% more AI-generated code than human-written. Incomplete access control where the endpoint exists but the permission check never runs, or runs in the wrong order. Row-level security policies never configured on new tables added via AI scaffolding. XSS vulnerabilities in 86% of AI-generated code samples that handled user input in rendered output.

These are not exotic bugs. They are the same mistakes junior developers have made for twenty years. The difference is volume. AI agents reproduce them at scale because they optimize for working code, not secure code. That is not carelessness. It is exactly what they were trained to do.

Why Tests Pass Anyway

None of those failure modes show up in most CI pipelines. Unit tests do not catch client-side auth because the test calls the server directly. Integration tests do not catch hardcoded secrets because the secrets work in the test environment too. End-to-end tests do not catch missing RLS because the test user happens to be the owner of the row.

A passing test suite does not mean "this code is safe to ship." It means "this code does what the test said it should do." Those are not the same thing, and that gap is exactly where AI-generated code sits today.

Why Specialized Agents Beat General Review

Qodo's product runs six specialized agents in parallel on every pull request:

Bug detection
Security review
Code quality
Test coverage
A judge agent that resolves conflicts between findings
A recommendation agent that incorporates the team's PR history

On Martian's Code Review Bench (100 PRs, 580 known issues), the multi-agent system scored 64.3% F1, outperforming GitHub Copilot Code Review by 25 points. Specialized agents with narrow mandates catch more than one general agent asked to "review this code," because each stays on task.

Consider a Next.js API route generated by Claude Code: an endpoint that updates a content item. Four agents run in parallel on the PR.

The Security Agent flags that the permission check runs after the database fetch. The data is already loaded by the time the role check runs, so an unauthorized request gets partial processing before it is rejected.
The Bug Detection Agent flags that the catch block rethrows the raw API error without stripping it, leaking schema details to the client on failure.
The Recommendation Agent adds that this same auth-order pattern appeared in two earlier PRs from the same developer. This is probably a habit, not an isolated mistake.
The Judge Agent surfaces both findings and drops a duplicate style note that neither reviewer needed to flag twice.

One PR. Four agents. One minute of CI time. Three findings that a single "review this code" prompt would have missed.

The Pattern You Can Steal Without the Tool

You do not need to buy Qodo to apply the pattern. The underlying insight is that decomposition beats generalization, and you can replicate it today with any AI coding assistant.

Three separate focused Claude prompts on a PR will catch more than one general review prompt, because each stays on task:

One prompt focused on auth and secret handling.
One prompt focused on error paths and information leakage.
One prompt focused on test coverage and edge cases.

Run them in parallel. Merge the findings. Deduplicate. Review as a human.

In n8n or similar workflow tools, this is a parallel-branch pattern: each branch runs a specialized review prompt against the PR diff, and the branches merge into a summary node before a human reviewer sees the output. That is a harness, not a product.

A Concrete Checklist for AI-Heavy PRs

For any PR with more than roughly 30% AI-generated content, run these checks before merging:

Grep for hardcoded literals matching password, secret, key, token, api_key, bearer. Any literal string that matches, even in a test file, rotates the credential and moves the value to a secrets manager before merging.
Trace every protected endpoint to verify the permission check is server-side and runs before any business logic or data access.
Verify RLS policies on new tables. If an AI scaffolding tool added a new database table, the row-level security policy almost certainly was not configured. Verify it before the table sees real traffic.
Test with a non-owner user. The most common access-control failure mode is a policy that only protects against complete strangers, not against other logged-in users.
Render user input through the full XSS surface. If the PR handles any user-provided content that ends up rendered in HTML, test with a known XSS payload in staging. The 86% number from Wiz suggests this is still the single highest-hit failure.

None of this is new. All of it is overlooked when AI-generated code arrives faster than reviewer capacity.

Where the Narrative Misses the Mechanism

AI writes code that passes the tests it was told to pass, in the same way a new contractor writes code that passes the tests they were told to pass. The difference is that the new contractor learns your security bar on PR number three. The AI does not learn; it produces the same shape of code on PR one and PR three hundred.

That puts the burden on the review layer. Either the harness around the AI encodes the security bar (parallel specialized reviews, guardrail checks, policy gates), or the security bar drifts down to what the AI produces by default.

$70M in funding says the market has recognized this. The teams that catch up early are the ones who treat AI-generated code the way we treat untrusted contractor code: useful, often good, and never merged without verification that covers more than "the tests passed."

Tests are necessary. They are not sufficient. They never were, and the teams that act on that in the next year (decomposed review, specialized agents, explicit security gates on AI-heavy PRs) are the ones whose codebases will not be on the next version of the Wiz list.

AI Code Passes Tests. Then It Breaks Production.

April 14, 2026

AISecurityDeveloper Tools

Close-up of cracked glass evoking broken software and security failure

Date: March 30 - April 2, 2026 Sources: Wiz — Common Security Risks in Vibe-Coded Apps

The Wiz Study

Security researchers at Wiz analyzed 5,600 applications built primarily with vibe-coding tools (AI-generated code with minimal human structural review) and found:

2,000+ security vulnerabilities.
400+ exposed secrets.
175 instances of exposed personally identifiable information.
One in five organizations building on vibe-coding platforms facing systemic security risk.

Why Tests Pass Anyway

Why Specialized Agents Beat General Review

Qodo's product runs six specialized agents in parallel on every pull request:

Bug detection
Security review
Code quality
Test coverage
A judge agent that resolves conflicts between findings
A recommendation agent that incorporates the team's PR history

Consider a Next.js API route generated by Claude Code: an endpoint that updates a content item. Four agents run in parallel on the PR.

The Security Agent flags that the permission check runs after the database fetch. The data is already loaded by the time the role check runs, so an unauthorized request gets partial processing before it is rejected.
The Bug Detection Agent flags that the catch block rethrows the raw API error without stripping it, leaking schema details to the client on failure.
The Recommendation Agent adds that this same auth-order pattern appeared in two earlier PRs from the same developer. This is probably a habit, not an isolated mistake.
The Judge Agent surfaces both findings and drops a duplicate style note that neither reviewer needed to flag twice.

One PR. Four agents. One minute of CI time. Three findings that a single "review this code" prompt would have missed.

The Pattern You Can Steal Without the Tool

You do not need to buy Qodo to apply the pattern. The underlying insight is that decomposition beats generalization, and you can replicate it today with any AI coding assistant.

Three separate focused Claude prompts on a PR will catch more than one general review prompt, because each stays on task:

One prompt focused on auth and secret handling.
One prompt focused on error paths and information leakage.
One prompt focused on test coverage and edge cases.

Run them in parallel. Merge the findings. Deduplicate. Review as a human.

A Concrete Checklist for AI-Heavy PRs

For any PR with more than roughly 30% AI-generated content, run these checks before merging:

Grep for hardcoded literals matching password, secret, key, token, api_key, bearer. Any literal string that matches, even in a test file, rotates the credential and moves the value to a secrets manager before merging.
Trace every protected endpoint to verify the permission check is server-side and runs before any business logic or data access.
Verify RLS policies on new tables. If an AI scaffolding tool added a new database table, the row-level security policy almost certainly was not configured. Verify it before the table sees real traffic.
Test with a non-owner user. The most common access-control failure mode is a policy that only protects against complete strangers, not against other logged-in users.
Render user input through the full XSS surface. If the PR handles any user-provided content that ends up rendered in HTML, test with a known XSS payload in staging. The 86% number from Wiz suggests this is still the single highest-hit failure.

None of this is new. All of it is overlooked when AI-generated code arrives faster than reviewer capacity.

AI Code Passes Tests. Then It Breaks Production.

The Wiz Study

Why Tests Pass Anyway

Why Specialized Agents Beat General Review

The Pattern You Can Steal Without the Tool

A Concrete Checklist for AI-Heavy PRs

Where the Narrative Misses the Mechanism

Read More

AI Code Passes Tests. Then It Breaks Production.

The Wiz Study

Why Tests Pass Anyway

Why Specialized Agents Beat General Review

The Pattern You Can Steal Without the Tool

A Concrete Checklist for AI-Heavy PRs

Where the Narrative Misses the Mechanism

Read More