Manwe 19 Apr 2026

如果 AI 生成的代码通过了测试但无人完全理解,是否应允许其投入生产?

不,仅因通过测试就允许其投入生产是行不通的。绿色测试仅证明代码满足了已知检查;证据反复表明,AI 生成的代码仍可能隐藏漏洞、表现出无界资源行为、存在不清晰的失败路径以及维护陷阱。只有在负责的工程师能够解释安全主张、界定影响范围、观察失败情况,并在压力下能够禁用、回滚或替换它时,才应将其发布。

Generated with GPT-5.4 · 62% overall confidence · 5 advisors · 5 rounds
截至 2026 年 12 月 31 日,大多数发布 AI 编码策略的安全成熟软件组织将要求为生产环境中的 AI 生成代码指定可问责的人类代码所有者或审查员,而不是仅因测试通过就允许部署。 78%
截至 2027 年 4 月 19 日,至少有一篇公开的工程事后复盘将指出,AI 生成或 AI 辅助的代码通过了自动化测试,但维护人员并未完全理解其作为生产环境中断、安全漏洞或紧急回滚的促成因素。 72%
截至 2028 年 4 月 19 日,仅通过功能开关、回滚路径或受控内部工具将不透明的 AI 生成代码引入生产环境的团队,其严重事故报告数量将少于仅基于测试成功就将此类代码引入核心面向客户路径的团队。 64%
  1. 今日,除非 AI 生成的代码是隔离的、可回滚的且有人负责,否则立即停止任何生产环境部署。请明确说明:“我们并非仅因测试通过就发布此代码。在生产部署前,一名负责工程师必须解释故障模式、资源限制、可观测性、回滚路径及紧急停止开关。”
  2. 在 24 小时内,将该组件归类为以下三类之一:核心产品逻辑、影响客户/数据的支撑逻辑,或低风险可替换工具。若为核心逻辑且无人能预测其副作用,则重写该代码或禁止其进入生产环境。若有人辩称测试已足够,请回应:“请展示负载行为、重试行为、数据变更路径及回滚命令。若无法做到,则测试不足。”
  3. 在本周结束前,任何有限发布前必须提供一页操作合同,内容包括:负责人、目的、输入、输出、涉及数据、最大预期基数、每请求的数据库/API 调用次数、超时限制、重试限制、仪表盘、告警、紧急停止开关、回滚命令及过期日期。若负责人抗拒文档化,请转向说明:“这不是文书工作。这是值班工程师在凌晨 3 点使用的页面。”
  4. 在生产部署前,仅使用类生产环境的仪表盘、日志、追踪和运行手册进行预演故障演练。触发高基数、依赖超时、重复消息、输入格式错误及部分数据库故障等场景。若负责人无法判断代码是重试、丢弃、重复还是损坏了工作,则禁止发布。
  5. 若代码属低风险且仍值得发布,本周仅通过功能开关或窄范围金丝雀发布:限制流量、限制资源使用、添加告警、记录决策、定义自动回滚阈值,并计划在 30 天内移除或完成全面理解工作。请明确说明:“这是受控例外,而非先例。除非我们要么理解它,要么替换它,否则它将过期。”
想用 Manwe 跑自己的决策?
下载 Manwe 测试版,把一个真实问题变成顾问小组、证据、分歧和决策记录。
下载测试版
回合 1

顾问们一致认为,仅通过测试不足以将 AI 生成的代码视为生产环境安全。对于从业者而言,标准应包含人类理解、对失败的明确责任归属,以及诸如爆炸半径限制、可审计性、回滚路径等风险控制措施,并针对代码可能危害系统的具体方式进行针对性审查。

The Contrarian REFRAME
The wrong part of this question is treating 'passes tests' like it means 'known safe.' It means the code survived the checks somebody thought to write, and then it gets wired into a live system where real users, data, permissions, failures, and future maintenance hit it. If no human can explain the behavior and risks, production isn't the test environment anymore; it's the discovery tool. I wouldn't ban AI-generated code, but I would ban deploying mystery code just because the dashboard turned green.
Jaya Thakur REFRAME
The Contrarian, I agree with the green dashboard point, but I'd shift the approval test from 'does it pass' to 'who can take custody of the failure.' In avionics verification, a module nobody could bound or explain was not treated as clever automation; it was treated as an unowned hazard. AI-generated code might deserve a controlled staging run, but production needs a human maintainer who can diagnose it under pressure without asking the generator to guess again.
Noriko Nakamura SHARE
Jaya, custody is necessary, but it's not enough when the code can damage state faster than a human can reason under incident pressure. In my experience, the scary AI-generated stuff isn't just wrong output; it's cleanup jobs, migrations, retry loops, and permission changes that pass tests and then quietly corrupt production data. Before I let that near live systems, I want a kill switch, audit trail, restore point, and a narrow blast radius, not just a named engineer holding the pager.
The Auditor FACTCHECK
Noriko, that control set is defensible, but it doesn't fully solve the premise if nobody understands the code. The evidence supports that AI-generated code can pass tests while still carrying hidden flaws, including vulnerabilities, and it also says oversight should match the system's risk and context. So the stronger rule is not 'green tests plus rollback,' it's 'green tests plus genuine human review for the specific failure modes this code can create.'
回合 2

顾问们认为,审查 AI 生成的代码不应仅限于验证发布前的当前行为。他们强调了可维护性、在现实故障模式下的故障注入、明确的行为契约,以及将不透明的 AI 生成组件更多地视为具有来源、版本和替换计划的第三方依赖。

Jaya Thakur REFRAME
The Auditor, I'd widen 'review' beyond the release decision. In avionics work, code that nobody can safely modify six months later is already a hazard, even if today's behavior is boxed in. For AI-generated code, I'd require a maintainer to demonstrate a small, realistic change and explain the expected side effects before production approval, because mystery code becomes most dangerous during the next patch, not the first deploy.
Noriko Nakamura REFRAME
Jaya, the six-month patch test is useful, but I'd add a nastier gate: make the maintainer run it through a bad day. For AI code touching queues, retries, auth, caching, migrations, or money movement, I want a fault-injection rehearsal with duplicate events, delayed dependencies, partial writes, clock skew, and the kill switch exercised for real. In my experience, mystery code usually doesn't fail inside the tidy function; it fails at the boundary where two systems each think the other one behaved.
The Auditor REFRAME
Noriko, the fault-injection gate is useful, but it still treats the code as a black box that survives rehearsed chaos. The missing control is a written behavioral contract: invariants, forbidden side effects, data ownership, and the exact conditions under which the code may retry, mutate, or stop. Without that, passing tests and even passing a bad-day drill only proves it handled the scenarios someone imagined, while the evidence here already says AI-generated code can hide flaws and vulnerabilities outside test coverage.
The Contrarian REFRAME
The Auditor, the contract helps, but it still treats mystery code like it's normal source code with extra paperwork. In my experience, the closer comparison is a third-party black box: they need provenance, version pinning, an upgrade path, and a replacement plan. If this AI blob is buried inside core business logic and nobody can swap it out, then when a rule changes, a person opens the file, can't predict side effects, and starts rewriting production logic under pressure. That's the real failure path, not just the first deploy.
回合 3

顾问们一致认为,不透明的 AI 生成代码不应仅仅因为通过测试或审查仪式就被视为安全。他们主张,其发布应仅限于狭窄且可替换的边界内,并明确所有权、来源、运营合同、可观测性以及理解或替换该代码的具体计划。

The Auditor WARN
Jaya, Noriko, and The Contrarian, the trap in this consensus is turning every concern into another gate and then pretending the code is understood because it survived the ceremony. The briefing distinguishes nominal review from genuine review, and mystery code that nobody can explain is exactly where nominal review becomes a liability. The right course is to allow it only behind a narrow, replaceable boundary with an accountable owner and a written operational contract; if it sits in core logic and nobody can predict its side effects, it should be rewritten or withheld from production. Bottom line: I believe the failure won't be the first green deploy, it'll be the first urgent incident where the team can't tell whether a patch fixes the bug or changes the product's rules.
Dr. James Kowalski REFRAME
The Auditor, I'd move the decision point upstream: if the team can't reconstruct where the code came from, what prompt or model produced it, and which human accepted each assumption, then production review has already failed. In safeguards work, we called this continuity of knowledge; once the camera feed or seal history breaks, a later clean inventory doesn't restore confidence. For AI code, green tests are the clean inventory, but provenance is the seal history. Bottom line: I believe teams will be burned less by mysterious syntax than by not knowing which hidden assumption they're now accountable for.
Jaya Thakur DOUBT
The Auditor, I agree with the narrow-boundary rule, but I don't like how easily that becomes a permanent exception with a polite label on it. In my experience, once a team ships one opaque component, the next one feels less shocking, review standards slide, and architectural knowledge starts living in tests instead of people. I'd require an expiry date and a funded replacement or comprehension plan, not just an owner and a rollback switch. Bottom line: the first production incident may be manageable; the slower failure is a codebase where nobody can judge the risk of the next change.
Noriko Nakamura WARN
Jaya, the expiry date is useful, but I'd add a pager test before the deploy. Put the accountable engineer through a staging incident with only the dashboards, logs, and traces they'd have at 3 a.m.; if they can't tell whether the component is retrying, dropping, duplicating, or corrupting work, it doesn't ship. I've watched teams produce beautiful documentation and still fail because there was no control point to stop the bad behavior fast. Bottom line: the consensus can become paperwork-heavy and still operationally blind.
回合 4

顾问们认为,AI 生成的生产代码应被视为受控工件,而不能仅仅因为测试通过一次就予以接受。对于技术从业者而言,核心要点是建立运行保障:追踪生成器的输入和版本,定义绝对禁止发生的事项,收集能够揭示故障的证据,并证明另一位工程师能够在压力下诊断、回滚、修补或禁用该系统。

Jaya Thakur REFRAME
Noriko, the pager drill is necessary, but I'd add configuration control around the code's birth certificate. In avionics verification, generated artifacts weren't trusted just because the output behaved once; we cared whether the generator version, inputs, switches, and review record could recreate the same artifact later. The person should treat unexplained AI code as a controlled generated artifact, because the next risk is a harmless-looking regeneration that passes the old tests while quietly changing an assumption nobody knew existed.
Dr. James Kowalski REFRAME
Jaya, I’d move the argument from understanding the code to understanding the safety claim. In nuclear safeguards, a declaration isn’t trusted because the paperwork balances; it’s trusted when inspectors can connect the claim to instruments, access, seals, and an anomaly response. For production AI code, that means the team must be able to state exactly what must never happen, what evidence would reveal it, and who has authority to shut it down. Bottom line: if nobody can build that assurance case, the passing tests are just a polite guess wearing a badge.
Noriko Nakamura REFRAME
James, I don't want the assurance case to turn into a PDF nobody can operate. The missing test is a cold handoff: wake up an engineer who didn't write it, give them a weird production symptom, and see whether they can decide fast whether to roll back, patch forward, or disable the path. I've watched systems stay broken because the code was 'owned' but the owner could only say the tests passed and the dashboard looked normal. Bottom line: opaque AI code doesn't earn production by passing tests; it earns it by surviving an operational handoff drill.
The Contrarian REFRAME
James, I'd move one step before the assurance case: production approval is a human act, not a test result. The EU AI Act's high-risk oversight idea points at the same trap: a person in the loop who can't meaningfully intervene is theater, not oversight. If nobody understands the code enough to refuse it, explain it, or disable the right behavior, the organization is just laundering an unknown decision through a human name. Bottom line: the consensus fails if drills and documents create the appearance of control while the approval chain still has no real veto knowledge.
回合 5

顾问们一致认为,当 AI 生成的代码对负责部署的团队不透明时,仅通过测试是不够的。他们重点关注实际生产风险:无界资源行为、不明确的失败路径、攻击面,以及在后续维护期间无法将行为追溯至需求。

Noriko Nakamura WARN
Jaya, provenance matters, but the ugly production failure I’d worry about is resource behavior. I once watched a perfectly green helper take down a control plane because it did a harmless-looking full scan inside a hot path; tests never exercised the real cardinality. If nobody understands the AI code, they don’t know its locks, retries, database fanout, cache churn, or memory growth. Bottom line: the person shouldn’t ship it until an accountable engineer can bound what happens under load and name the kill switch.
The Contrarian REFRAME
James, the uncomfortable part is not just the safety claim; it's whether the organization is pretending it has control. A review ticket that says approved while nobody can explain the failure path is nominal oversight, and the briefing calls that exact move a common compliance failure. Bottom line: allow opaque AI code only if leadership is willing to write in the incident report, 'we shipped an opaque component and accepted that risk.' If they won't own that sentence, they shouldn't ship it.
Dr. James Kowalski REFRAME
Noriko, the handoff drill is useful, but I'd add an adversary drill. Attackers don't care whether the team understands the AI-generated code; they'll fuzz it, feed it malformed inputs, and look for the one behavior nobody knew existed. In safeguards work, an unlabelled pipe was never just a documentation problem, because it could be a diversion path; opaque production code is the same kind of hidden pathway. Bottom line: if they can't map the abuse cases and enforce input, permission, and egress boundaries, passing tests should not matter.
Jaya Thakur REFRAME
James, I'd narrow the lens even more than abuse cases. In avionics, the uncomfortable question was always traceability: which requirement justifies this behavior, and which evidence proves that behavior still holds after the next change? If the AI code is clever but nobody can connect its internal choices to a requirement, the tests become a snapshot, not an argument. Bottom line: they may get through this release, but the first maintenance change will turn that opacity into uncontrolled engineering debt.
  1. Compliance Generation for Privacy Documents under GDPR: A Roadmap for Implementing Automation and Machine Learning
  2. Wikipedia: Machine learning
  3. Wikipedia: Reading
  4. Wikipedia: Explainable artificial intelligence
  5. Tests of Machine Intelligence
  6. Security Degradation in Iterative AI Code Generation -- A Systematic Analysis of the Paradox
  7. JU_KS@SAIL_CodeMixed-2017: Sentiment Analysis for Indian Code Mixed Social Media Texts
  8. Accountability in artificial intelligence: what it is and how it works ...
  9. Science, Technology and Mission Design for the Laser Astrometric Test Of Relativity
  10. AI Risk Management Framework | NIST
  11. Some punctured codes of several families of binary linear codes
  12. Developers Perception of Peer Code Review in Research Software Development
  13. Wikipedia: COVID-19
  14. Tests of Classical Gravity with Radio Pulsars
  15. Wikipedia: Health informatics
  16. Wikipedia: OpenClaw
  17. SoK: Analysis of Software Supply Chain Security by Establishing Secure Design Properties
  18. AI Generated Code Crashed Production: How to Prevent Deployment ...
  19. The Impact of AI-Generated Solutions on Software Architecture and Productivity: Results from a Survey Study
  20. Reliability Assessment and Safety Arguments for Machine Learning Components in System Assurance
  21. Wikipedia: Google DeepMind
  22. Why AI-Generated Code Passes Tests But Fails in Production
  23. Understanding Code Patterns - Analysis, Interpretation & Measurement
  24. The Future of AI-Driven Software Engineering
  25. Wikipedia: Facial recognition system
  26. Minimalist Compliance Control
  27. pynucastro: an interface to nuclear reaction rates and code generator for reaction network equations
  28. Investigations on the energy balance in TDCB tests
  29. AI-Generated PRs Lack Human Oversight, Leading to Poor Code Quality ...
  30. On the efficacy of safety-related software standards
  31. Tactics for Internal Compliance: A Literature Review
  32. Wikipedia: Software development
  33. Making AI Compliance Evidence Machine-Readable
  34. Is human oversight to AI systems still possible? - ScienceDirect
  35. Designing AI Systems that Augment Human Performed vs. Demonstrated Critical Thinking
  36. Wikipedia: Reliability engineering
  37. Wikipedia: AI safety
  38. DLT Compliance Reporting
  39. Wikipedia: Vibe coding
  40. Debugging AI-Generated Code: 8 Failure Patterns & Fixes
  41. Why AI-Generated Code Passes Tests But Fails in Production
  42. Communications Standards for Unmanned Aircraft Systems: The 3GPP Perspective and Research Drivers
  43. Wikipedia: Artificial intelligence arms race
  44. Wikipedia: Artificial intelligence
  45. SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs
  46. Dislocated Accountabilities in the AI Supply Chain: Modularity and Developers' Notions of Responsibility
  47. DeBiasMe: De-biasing Human-AI Interactions with Metacognitive AIED (AI in Education) Interventions
  48. Adaptive Coverage Policies in Conformal Prediction
  49. (PDF) Adapting The Future of Export Controls in AI: to Rapid ...
  50. Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization
  51. Export Controls and Strategic Adaptation in AI-Enabling Supply Chains ...
  52. Wikipedia: Language model benchmark
  53. Test4Enforcers: Test Case Generation for Software Enforcers
  54. JaCoText: A Pretrained Model for Java Code-Text Generation
  55. Announcing a New Framework for Securing AI-Generated Code
  56. Testing with AI Agents: An Empirical Study of Test Generation Frequency, Quality, and Coverage
  57. AI Writes Code But Who Maintains It? The Hidden Challenges
  58. Wikipedia: Hallucination (artificial intelligence)
  59. AI-Generated Code Fails in Production (and Why Your Manager Won't ...
  1. AI Safety is Stuck in Technical Terms -- A System Safety Response to the International AI Safety Report
  2. Exploitation of material consolidation trade-offs in multi-tier complex supply networks
  3. Faster SAT Solving for Software with Repeated Structures (with Case Studies on Software Test Suite Minimization)
  4. Generative AI Assistants in Software Development Education: A vision for integrating Generative AI into educational practice, not instinctively defending against it
  5. Generative AI and Empirical Software Engineering: A Paradigm Shift
  6. GoSurf: Identifying Software Supply Chain Attack Vectors in Go
  7. Guidelines for conducting and reporting case study research in software engineering
  8. Maven-Hijack: Software Supply Chain Attack Exploiting Packaging Order
  9. Open Problems in Engineering and Quality Assurance of Safety Critical Machine Learning Systems
  10. Systematic review of research on artificial intelligence applications in higher education – where are the educators?
  11. Testing gravity with Extreme-Mass-Ratio Inspirals
  12. Wikipedia: China–United States trade war
  13. Wikipedia: Economy of China
  14. Wikipedia: GPT-3
  15. Wikipedia: Hezbollah armed strength
  16. Wikipedia: Media coverage of the Gaza war
  17. Wikipedia: One Big Beautiful Bill Act

本报告由AI生成。AI可能会出错。这不是财务、法律或医疗建议。条款