feat: Text2SQL — guarded natural-language query (sqlglot AST + mandatory readonly) by dividduang · Pull Request #2 · fastapi-practices/ai

dividduang · 2026-06-25T15:52:01Z

What

Text2SQL: a user asks a natural-language question, an LLM generates SQL, the system executes it against a SELECT-only readonly account and returns rows. Exposed as a chat capability (text2sql_query function tool) plus a standalone /queries endpoint, with dataset/table/example management.

An earlier Text2SQL was removed from this plugin. This re-introduces it security-first, directly addressing the reasons it was pulled (weak guards, write-engine fallback, config bloat, parallel agent runtime).

Security model (fail-closed)

sqlglot AST guard (text2sql/guardrails.py): single SELECT only; every Table node is checked against the dataset allowlist by full schema.table — no namespace stripping, so mysql.user / information_schema.* exfil via name collision is blocked. Rejects tableless recon (@@hostname, USER(), VERSION(), SLEEP, LOAD_FILE …), dangerous functions, system/session variables, and write/DDL nodes anywhere in the tree (incl. PostgreSQL DELETE … RETURNING inside a subquery). Clamps LIMIT to max_rows.
Mandatory readonly account (text2sql/readonly_db.py): if AI_TEXT2SQL_READONLY_* is unset, execution refuses — no fallback to the writable main DB.
AI_TEXT2SQL_ENABLED gate (default false) on both the capability builder and /queries.
The final SQL is re-guarded and re-executed by _execute_final (defense in depth); the agent itself never executes SQL.

Architecture

Single-shot agent, no bespoke tool loop: table/column metadata is pre-fetched server-side and inlined into the system prompt; the agent emits {sql, summary} via structured output. This avoids a second pydantic-ai Agent with its own tool loop / model resolution / history (the prior duplication).
Model resolution reuses AIDefaultModelScene.text2sql via ai_default_model_service (no parallel PROVIDER_ID/MODEL_ID config).
Dataset → table (allowlist) → example (few-shot) data model; schema introspection reuses the code-generator pattern and excludes ai_* / gen_* plugin tables.

Tests

41 guardrail cases (pure-logic, no DB) covering: namespace-collision bypass, tableless recon, all DML/DDL, multi-statement, INTO OUTFILE, DELETE…RETURNING in subquery, UNION/CTE exfil, large-LIMIT clamp.

Config (`plugin.toml`)

AI_TEXT2SQL_ENABLED (default false), _SCHEMA, _MAX_ROWS, _TIMEOUT, _MAX_RETRIES, _READONLY_{HOST,PORT,USER,PASSWORD}.

Follow-ups (not in this PR)

Optional bounded self-correction retry (single-shot today; the final SQL is re-guarded regardless).

- Add text2sql engine: guardrails, schema metadata, readonly DB access - Add dataset/table/example CRUD, schema, service layer - Wire text2sql capability and v1 router endpoint - Add web search (Exa/Tavily) and Text2SQL settings in plugin.toml - Add guardrails tests

…config mgmt - Resolve conflicts in plugin.toml, .env.example, 8 SQL files - Reassign Text2SQL snowflake menu IDs 840-843 -> 849-852 to avoid collisions with upstream's EditAIDefaultModel/AIConfigManage/QuickPhrase - Preserve all Text2SQL python (crud/model/schema/service/text2sql/tests) - Adopt upstream: default_model mgmt, AI config menu, chat refactor

Address upstream review blockers for the Text2SQL feature. Security (fail-closed): - Rewrite guardrail with sqlglot AST: full schema.table allowlist (fixes mysql.user->user namespace-collision bypass), reject tableless recon (@@hostname/USER()/SLEEP/LOAD_FILE...), deny dangerous funcs/vars, scan for write/DDL nodes (DELETE...RETURNING in subquery), LIMIT clamp. - Make readonly DB mandatory (no main-DB fallback). - Wire AI_TEXT2SQL_ENABLED (capability builder + run_query). - _execute_final: add asyncio.wait_for timeout; remove dead Text2SqlTimeoutError. Model consolidation (M4): - Add AIDefaultModelScene.text2sql; resolve via ai_default_model_service. - Remove AI_TEXT2SQL_PROVIDER_ID / AI_TEXT2SQL_MODEL_ID. Tests: 41 guardrail cases incl. namespace-collision, tableless recon, INTO OUTFILE, DELETE...RETURNING, large-LIMIT clamp.

Replace the bespoke 3-tool pydantic-ai Agent (list_tables/describe_table/ execute_sql) with a single structured-output call (output_type=Text2SqlResult, no tools). Table/column context is pre-fetched server-side and inlined into the system prompt. Why: the tool loop duplicated the plugin's existing capability/builtin_toolset pipeline (reviewer 'major reject reason'). The execute_sql self-correction loop was redundant -- the final SQL is re-guarded + re-executed by _execute_final regardless. Net effect: smaller attack surface (the Agent can no longer execute SQL at all), one LLM round-trip instead of N. run_query return contract unchanged. _resolve_model / _execute_final / _write_history untouched. Drops now-unused imports (json, guardrail exceptions).

The chat capability tool text2sql_query was described as 'FBA 业务数据' with order/supplier examples, so the model did not invoke it for log/count questions (e.g. 'how many operation logs today'). Broaden the description to explicitly cue logs/counts/stats and state that any database-data question should prefer this tool.

wu-clan · 2026-06-26T15:07:28Z

建议搞成独立插件，vb 时发给 AI：使用 fba skills depends_on 将 text2sql 插件独立

dengjingren added 5 commits June 22, 2026 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Text2SQL — guarded natural-language query (sqlglot AST + mandatory readonly)#2

feat: Text2SQL — guarded natural-language query (sqlglot AST + mandatory readonly)#2
dividduang wants to merge 5 commits into
fastapi-practices:masterfrom
dividduang:feat/text2sql

dividduang commented Jun 25, 2026

Uh oh!

wu-clan commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dividduang commented Jun 25, 2026

What

Security model (fail-closed)

Architecture

Tests

Config (plugin.toml)

Follow-ups (not in this PR)

Uh oh!

wu-clan commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Config (`plugin.toml`)