Quality Assurance

Validation & QA

NEXUS doesn't just generate code — it proves it works. Every delivery passes through automated testing, security scanning, and a composite quality gate before it can merge.

Four Pillars of QA

Automated Test Generation

Agents generate unit and integration tests alongside implementation code — not as an afterthought. Test pass rate is the single largest input to the Quality Score (30%).

E2E Smoke Tests

Before merge, a real end-to-end smoke test exercises the critical user flow in a live environment — proven on the Todo app case study (22/22 tasks, smoke test PASS).

Security Scanning

Every delivery is scanned for secrets, injection, XSS, insecure deserialization, and vulnerable dependencies — contributing 25% of the Quality Score.

MergeGate Scoring

The composite Quality Score Q gates every merge: tests (30%), security (25%), efficiency (20%), self-correction (15%), constitution (10%).

MergeGate Quality Score (Q)

Five weighted signals, one merge decision

Tests pass rate30%
Security score25%
Token efficiency20%
Self-correction rate15%
Constitution adherence10%
Q = 0.30 × tests_pass_rate
  + 0.25 × security_score
  + 0.20 × token_efficiency
  + 0.15 × self_correction_rate
  + 0.10 × constitution_score
Q ≥ 75MergeGate PASS
60 ≤ Q < 75AutoFix loop triggered
Q < 60BLOCK + human escalation
Quality Score100 Elite

Achieved on the Todo app case study (22/22 tasks)

Validation Campaign Methodology

The same paired-pilot methodology that produced the published -28% time / -37% tokens / -50% cost numbers (Phase C, 11 pilots / 22 runs / 8 codebases).

Step 01

Select Paired Tasks

Identify a real feature or fix with clear, testable acceptance criteria — scoped so it can be delivered both traditionally and via NEXUS OS.

Step 02

Run Before / After

Deliver the same scope twice: once with traditional human-only development, once through a NEXUS Forge run — same repo, same starting commit.

Step 03

Measure Across Axes

Record wall-clock time, total tokens consumed, total cost, MergeGate Q score, and security findings before/after AutoFix for each run.

Step 04

Aggregate Across Codebases

Repeat across multiple codebases and task types to avoid overfitting — the published numbers (-28% time, -37% tokens, -50% cost) are aggregated across 11 pilots / 22 runs / 8 codebases.

Live Validation Pilot Snapshot

Current in-flight calibration data

Live

4

Universes in pilot

58.26

Avg composite Q

81.2s

Avg run duration

219

Sessions calibrated