SIGMOD 2026 · ACM

DeepEye-SQLA Software-Engineering-InspiredText-to-SQL Framework

What if generating a correct SQL query is not just a language task, but a software engineering problem? DeepEye-SQL reframes Text-to-SQL through the lens of the Software Development Life Cycle, achieving state-of-the-art performance with a small open-source model.

75.07%
BIRD-Test (Official)
73.5%
BIRD-Dev EX
89.8%
Spider-Test EX
~3B
Active Params
DeepEye-SQL SDLC-inspired framework overview
SDLC-Inspired Four-Phase Pipeline

Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, Yuyu Luo*

DIAL Lab @ HKUST(GZ)

Scroll to explore
KEY INSIGHT

The Paradigm Shift

Text-to-SQL is not merely a language generation task. Generating a correct SQL query demands structured orchestration and verifiable correctness — just like building software.

SDLC to DeepEye-SQL pipeline mapping
SDLC-Inspired Four-Phase Pipeline: Each phase addresses a specific challenge in Text-to-SQL
01
Requirements

Schema Linking

Understand what to build by grounding intent in the database schema

C1: Ambiguous requirement inference
02
Implementation

N-Version SQL

Build multiple independent SQL solutions in parallel for fault tolerance

C2: Insufficient fault tolerance
03
Unit Testing

Tool-Chain Testing

Verify each SQL with 8 deterministic checkers, not LLM self-refinement

C3: Unreliable verification
04
Release

Confidence Selection

Ship the best SQL through a confidence-gated quality gate

C4: Uncalibrated confidence

How DeepEye-SQL Differs

MethodCore ParadigmVerificationSQL SelectionEfficiency
CHESSLinear InferenceLLM Self-RefinementHeuristic Majority VotingHigh API Cost (Gemini 1.5 Pro)
Alpha-SQLSearch-Centric (MCTS)Basic Syntax CheckHeuristic Majority VotingHigh Inference Latency
OmniSQLData-Centric SFTBasic Syntax CheckHeuristic Majority VotingHigh Training Cost
DeepEye-SQLSE-Lifecycle CentricDeterministic Unit TestingConfidence-Aware SelectionEfficient & Training-Free (~3B)
PHASE 1 — REQUIREMENTS
Phase 1Requirements Analysis

Intent Scoping & Semantic Grounding

Before writing any code, a software engineer must fully understand the requirements. Similarly, DeepEye-SQL first scopes the user's intent by grounding the natural language question in the database schema. This phase employs three complementary schema linking strategies to ensure no relevant table or column is missed.

Natural Language Question

"What is the average rating of restaurants in San Francisco that serve Italian food?"

ratingrestaurantsSan FranciscoItalian
Database Schema
50%
coverage
restaurants
idnamecitycuisine_typeprice_range
reviews
idrestaurant_idratingreview_textdate
locations
restaurant_idaddresscitystatezip_code
Strategy Analysis
Direct Schema Linking
LLM analyzes question keywords against schema

LLM identifies 'rating', 'restaurants', and 'reviews' from the question keywords. Misses the 'locations' table which also has city data.

Identified Tables:
restaurantsreviews
Misses: locations table (city column)
Why Three Strategies?

Each strategy has blind spots. Direct linking may miss implicit tables. Reversed linking may hallucinate schema elements. Value-based linking only finds columns with matching values. Their union provides robust, fault-tolerant coverage.

PHASE 2 — IMPLEMENTATION
Phase 2Implementation

N-Version Programming for SQL Generation

In safety-critical software, N-version programming uses multiple independent teams to build the same system. DeepEye-SQL applies this principle: three fundamentally different SQL generators work in parallel on the same question, producing diverse candidate solutions that avoid correlated failures.

N-version programming concept
Input Question

"Find the top 3 departments with the highest average salary"

Skeleton-based
Top-Down Design
Why N-Version, Not Self-Consistency?

Self-consistency randomly samples from the same generator, producing correlated errors. N-version programming uses fundamentally different reasoning paradigms, so failures are independent — a critical distinction for fault tolerance.

Generated SQL
SELECT d.name,
       AVG(e.salary) AS avg_sal
FROM employees e
JOIN departments d
  ON e.dept_id = d.id
GROUP BY d.name
ORDER BY avg_sal DESC
LIMIT 3
All Three Outputs
Skeleton-basedUses alias: avg_sal
ICL-basedUses full table names
Divide & ConquerUses alias: department, avg_salary

Same semantics, different syntax — diversity by design.

PHASE 3 — TESTING
Phase 3Unit Testing

SQL Unit Testing & Revision via Tool-Chain

Just as software undergoes rigorous unit testing, each SQL candidate passes through a chain of 8 deterministic checkers. Unlike LLM-based self-refinement, these checkers use rule-based logic to catch specific bug patterns — producing actionable bug reports that guide targeted fixes.

Tool-chain checker pipeline
Fail-Fast
Logic
Quality
Click any checker to see a real example
Deterministic vs. LLM Self-Refinement

Traditional approaches ask the LLM to "check its own work" — but LLMs often repeat the same mistakes. DeepEye-SQL's checkers use deterministic rules (schema metadata, execution results, SQL parsing) to produce precise bug reports. The LLM only handles the targeted fix, not the diagnosis.

PHASE 4 — RELEASE
Phase 4Release & Quality Gate

Confidence-Aware SQL Selection

A software product is not released just because it compiles — it must pass a Quality Gate. DeepEye-SQL applies this principle through confidence-aware selection: execute all candidates, cluster by results, estimate confidence, and adaptively choose between a fast shortcut (high confidence) or rigorous peer review (low confidence).

Confidence-aware selection concept

Execute all SQL candidates against the database and group them by their execution results. Queries producing identical results belong to the same cluster.

6 SQL Candidates
S1(Skeleton)C1
SELECT d.name, AVG(e.salary) FROM employees e JOIN departments d ON e.dept_id=d.id GROUP BY d.name ORDER BY AVG(e.salary) DESC LIMIT 3
S2(ICL)C1
SELECT departments.name, AVG(employees.salary) FROM employees INNER JOIN departments ON employees.dept_id=departments.id GROUP BY departments.name ORDER BY 2 DESC LIMIT 3
S3(D&C)C1
SELECT dep.name, AVG(emp.salary) AS avg_sal FROM employees emp JOIN departments dep ON emp.dept_id=dep.id GROUP BY dep.name ORDER BY avg_sal DESC LIMIT 3
S4(Skeleton-v2)C1
SELECT d.name, AVG(e.salary) avg_salary FROM employees e, departments d WHERE e.dept_id=d.id GROUP BY d.name ORDER BY avg_salary DESC LIMIT 3
S5(ICL-v2)C2
SELECT d.name, SUM(e.salary)/COUNT(*) FROM employees e JOIN departments d ON e.dept_id=d.id GROUP BY d.name ORDER BY 2 DESC LIMIT 3
S6(D&C-v2)C2
SELECT department_name, AVG(salary) FROM emp_view GROUP BY department_name ORDER BY AVG(salary) DESC LIMIT 3
Execution Result Clusters
Cluster 14 queries
[(Engineering, 95000), (Sales, 82000), (Marketing, 78000)]
67% of candidates
Cluster 22 queries
[(Engineering, 95000), (Sales, 82000), (HR, 75000)]
33% of candidates
RESULTS
EXPERIMENTAL RESULTS

State-of-the-Art Performance

DeepEye-SQL achieves top results on major Text-to-SQL benchmarks using only a ~3B active parameter open-source model, surpassing methods that rely on much larger proprietary models.

75.07%
BIRD-Test (Official Leaderboard)
#1 Open-Source
73.5%
BIRD-Dev Execution Accuracy
89.8%
Spider-Test Execution Accuracy

BIRD-Dev Execution Accuracy (%)

XiYan-SQL
73.3%
GPT-4o + Qwen2.5 (>200B)
CHASE-SQL
73%
Gemini-1.5-Pro (>200B)
Alpha-SQL
69.7%
Qwen2.5-Coder-32B (~32B)
CHESS
68.3%
Gemini-1.5-Pro (>200B)
DeepEye-SQL~3B active
73.5%

Spider-Test Execution Accuracy (%)

XiYan-SQL
89.7%
GPT-4o + Qwen2.5 (>200B)
CHASE-SQL
87.6%
Gemini-1.5-Pro (>200B)
DeepEye-SQL~3B active
89.8%

Efficient

Uses a MoE model with ~30B total but only ~3B activated parameters. No fine-tuning required — purely inference-time techniques.

Competitive

Outperforms methods using GPT-4o, Gemini 1.5 Pro, and other proprietary models with 200B+ parameters on both benchmarks.

Training-Free

Unlike data-centric approaches (OmniSQL) that require expensive SFT, DeepEye-SQL works out-of-the-box with any capable open-source LLM.

ABLATION STUDY

Every Component Matters

Ablation study on BIRD-Dev with Qwen3-Coder-30B-A3B (Table 10 from paper). Click each variant to see its impact on the final accuracy.

BIRD-Dev Execution Accuracy
73.5%
Variant
Phase
EX (%)
Δ (%)
DeepEye-SQL
Complete framework with all components
All
73.5
w/o Semantic Value Retrieval
Remove semantic value retrieval from Phase 1
Phase 1
71.4
-2.1
w/o Robust Schema Linking
Use direct linking only, remove reversed & value-based linking
Phase 1
71.8
-1.7
w/o Skeleton-based Generation
Remove skeleton-based generator from N-version programming
Phase 2
72.2
-1.3
w/o ICL-based Generation
Remove in-context learning generator from N-version programming
Phase 2
71
-2.5
w/o D&C-based Generation
Remove divide-and-conquer generator from N-version programming
Phase 2
72.3
-1.2
w/o Tool-Chain Testing & Revision
Remove deterministic checkers, rely on LLM self-refinement
Phase 3
71.4
-2.1
w/o Confidence-aware Selection
Replace with majority voting instead of confidence-aware selection
Phase 4
72.7
-0.8
Key Takeaway

ICL-based Generation (Phase 2) and Tool-Chain Testing & Revision (Phase 3) each contribute the largest single improvement of -2.5% and -2.1% respectively. Semantic Value Retrieval (Phase 1) also shows a significant -2.1% drop when removed. This confirms that the SDLC-inspired design creates complementary components where each phase addresses different failure modes.