Quality Assurance
Validation & QA
NEXUS doesn't just generate code — it proves it works. Every delivery passes through automated testing, security scanning, and a composite quality gate before it can merge.
Four Pillars of QA
Automated Test Generation
Agents generate unit and integration tests alongside implementation code — not as an afterthought. Test pass rate is the single largest input to the Quality Score (30%).
E2E Smoke Tests
Before merge, a real end-to-end smoke test exercises the critical user flow in a live environment — proven on the Todo app case study (22/22 tasks, smoke test PASS).
Security Scanning
Every delivery is scanned for secrets, injection, XSS, insecure deserialization, and vulnerable dependencies — contributing 25% of the Quality Score.
MergeGate Scoring
The composite Quality Score Q gates every merge: tests (30%), security (25%), efficiency (20%), self-correction (15%), constitution (10%).
MergeGate Quality Score (Q)
Five weighted signals, one merge decision
Q = 0.30 × tests_pass_rate
+ 0.25 × security_score
+ 0.20 × token_efficiency
+ 0.15 × self_correction_rate
+ 0.10 × constitution_scoreAchieved on the Todo app case study (22/22 tasks)
Validation Campaign Methodology
The same paired-pilot methodology that produced the published -28% time / -37% tokens / -50% cost numbers (Phase C, 11 pilots / 22 runs / 8 codebases).
Select Paired Tasks
Identify a real feature or fix with clear, testable acceptance criteria — scoped so it can be delivered both traditionally and via NEXUS OS.
Run Before / After
Deliver the same scope twice: once with traditional human-only development, once through a NEXUS Forge run — same repo, same starting commit.
Measure Across Axes
Record wall-clock time, total tokens consumed, total cost, MergeGate Q score, and security findings before/after AutoFix for each run.
Aggregate Across Codebases
Repeat across multiple codebases and task types to avoid overfitting — the published numbers (-28% time, -37% tokens, -50% cost) are aggregated across 11 pilots / 22 runs / 8 codebases.
Live Validation Pilot Snapshot
Current in-flight calibration data
4
Universes in pilot
58.26
Avg composite Q
81.2s
Avg run duration
219
Sessions calibrated