What if generating a correct SQL query is not just a language task, but a software engineering problem? DeepEye-SQL reframes Text-to-SQL through the lens of the Software Development Life Cycle, achieving state-of-the-art performance with a small open-source model.

Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, Yuyu Luo*
Text-to-SQL is not merely a language generation task. Generating a correct SQL query demands structured orchestration and verifiable correctness — just like building software.

Understand what to build by grounding intent in the database schema
Build multiple independent SQL solutions in parallel for fault tolerance
Verify each SQL with 8 deterministic checkers, not LLM self-refinement
Ship the best SQL through a confidence-gated quality gate
| Method | Core Paradigm | Verification | SQL Selection | Efficiency |
|---|---|---|---|---|
| CHESS | Linear Inference | LLM Self-Refinement | Heuristic Majority Voting | High API Cost (Gemini 1.5 Pro) |
| Alpha-SQL | Search-Centric (MCTS) | Basic Syntax Check | Heuristic Majority Voting | High Inference Latency |
| OmniSQL | Data-Centric SFT | Basic Syntax Check | Heuristic Majority Voting | High Training Cost |
| DeepEye-SQL | SE-Lifecycle Centric | Deterministic Unit Testing | Confidence-Aware Selection | Efficient & Training-Free (~3B) |
Before writing any code, a software engineer must fully understand the requirements. Similarly, DeepEye-SQL first scopes the user's intent by grounding the natural language question in the database schema. This phase employs three complementary schema linking strategies to ensure no relevant table or column is missed.
"What is the average rating of restaurants in San Francisco that serve Italian food?"
LLM identifies 'rating', 'restaurants', and 'reviews' from the question keywords. Misses the 'locations' table which also has city data.
Each strategy has blind spots. Direct linking may miss implicit tables. Reversed linking may hallucinate schema elements. Value-based linking only finds columns with matching values. Their union provides robust, fault-tolerant coverage.
In safety-critical software, N-version programming uses multiple independent teams to build the same system. DeepEye-SQL applies this principle: three fundamentally different SQL generators work in parallel on the same question, producing diverse candidate solutions that avoid correlated failures.

"Find the top 3 departments with the highest average salary"
Self-consistency randomly samples from the same generator, producing correlated errors. N-version programming uses fundamentally different reasoning paradigms, so failures are independent — a critical distinction for fault tolerance.
SELECT d.name,
AVG(e.salary) AS avg_sal
FROM employees e
JOIN departments d
ON e.dept_id = d.id
GROUP BY d.name
ORDER BY avg_sal DESC
LIMIT 3Same semantics, different syntax — diversity by design.
Just as software undergoes rigorous unit testing, each SQL candidate passes through a chain of 8 deterministic checkers. Unlike LLM-based self-refinement, these checkers use rule-based logic to catch specific bug patterns — producing actionable bug reports that guide targeted fixes.

Traditional approaches ask the LLM to "check its own work" — but LLMs often repeat the same mistakes. DeepEye-SQL's checkers use deterministic rules (schema metadata, execution results, SQL parsing) to produce precise bug reports. The LLM only handles the targeted fix, not the diagnosis.
A software product is not released just because it compiles — it must pass a Quality Gate. DeepEye-SQL applies this principle through confidence-aware selection: execute all candidates, cluster by results, estimate confidence, and adaptively choose between a fast shortcut (high confidence) or rigorous peer review (low confidence).

Execute all SQL candidates against the database and group them by their execution results. Queries producing identical results belong to the same cluster.
[(Engineering, 95000), (Sales, 82000), (Marketing, 78000)]
[(Engineering, 95000), (Sales, 82000), (HR, 75000)]
DeepEye-SQL achieves top results on major Text-to-SQL benchmarks using only a ~3B active parameter open-source model, surpassing methods that rely on much larger proprietary models.
Uses a MoE model with ~30B total but only ~3B activated parameters. No fine-tuning required — purely inference-time techniques.
Outperforms methods using GPT-4o, Gemini 1.5 Pro, and other proprietary models with 200B+ parameters on both benchmarks.
Unlike data-centric approaches (OmniSQL) that require expensive SFT, DeepEye-SQL works out-of-the-box with any capable open-source LLM.
Ablation study on BIRD-Dev with Qwen3-Coder-30B-A3B (Table 10 from paper). Click each variant to see its impact on the final accuracy.
ICL-based Generation (Phase 2) and Tool-Chain Testing & Revision (Phase 3) each contribute the largest single improvement of -2.5% and -2.1% respectively. Semantic Value Retrieval (Phase 1) also shows a significant -2.1% drop when removed. This confirms that the SDLC-inspired design creates complementary components where each phase addresses different failure modes.