Skip to content

feat: Text2SQL — guarded natural-language query (sqlglot AST + mandatory readonly)#2

Open
dividduang wants to merge 5 commits into
fastapi-practices:masterfrom
dividduang:feat/text2sql
Open

feat: Text2SQL — guarded natural-language query (sqlglot AST + mandatory readonly)#2
dividduang wants to merge 5 commits into
fastapi-practices:masterfrom
dividduang:feat/text2sql

Conversation

@dividduang

Copy link
Copy Markdown

What

Text2SQL: a user asks a natural-language question, an LLM generates SQL, the system executes it against a SELECT-only readonly account and returns rows. Exposed as a chat capability (text2sql_query function tool) plus a standalone /queries endpoint, with dataset/table/example management.

An earlier Text2SQL was removed from this plugin. This re-introduces it security-first, directly addressing the reasons it was pulled (weak guards, write-engine fallback, config bloat, parallel agent runtime).

Security model (fail-closed)

  • sqlglot AST guard (text2sql/guardrails.py): single SELECT only; every Table node is checked against the dataset allowlist by full schema.table — no namespace stripping, so mysql.user / information_schema.* exfil via name collision is blocked. Rejects tableless recon (@@hostname, USER(), VERSION(), SLEEP, LOAD_FILE …), dangerous functions, system/session variables, and write/DDL nodes anywhere in the tree (incl. PostgreSQL DELETE … RETURNING inside a subquery). Clamps LIMIT to max_rows.
  • Mandatory readonly account (text2sql/readonly_db.py): if AI_TEXT2SQL_READONLY_* is unset, execution refuses — no fallback to the writable main DB.
  • AI_TEXT2SQL_ENABLED gate (default false) on both the capability builder and /queries.
  • The final SQL is re-guarded and re-executed by _execute_final (defense in depth); the agent itself never executes SQL.

Architecture

  • Single-shot agent, no bespoke tool loop: table/column metadata is pre-fetched server-side and inlined into the system prompt; the agent emits {sql, summary} via structured output. This avoids a second pydantic-ai Agent with its own tool loop / model resolution / history (the prior duplication).
  • Model resolution reuses AIDefaultModelScene.text2sql via ai_default_model_service (no parallel PROVIDER_ID/MODEL_ID config).
  • Dataset → table (allowlist) → example (few-shot) data model; schema introspection reuses the code-generator pattern and excludes ai_* / gen_* plugin tables.

Tests

41 guardrail cases (pure-logic, no DB) covering: namespace-collision bypass, tableless recon, all DML/DDL, multi-statement, INTO OUTFILE, DELETE…RETURNING in subquery, UNION/CTE exfil, large-LIMIT clamp.

Config (plugin.toml)

AI_TEXT2SQL_ENABLED (default false), _SCHEMA, _MAX_ROWS, _TIMEOUT, _MAX_RETRIES, _READONLY_{HOST,PORT,USER,PASSWORD}.

Follow-ups (not in this PR)

  • Optional bounded self-correction retry (single-shot today; the final SQL is re-guarded regardless).

dengjingren added 5 commits June 22, 2026 22:32
- Add text2sql engine: guardrails, schema metadata, readonly DB access
- Add dataset/table/example CRUD, schema, service layer
- Wire text2sql capability and v1 router endpoint
- Add web search (Exa/Tavily) and Text2SQL settings in plugin.toml
- Add guardrails tests
…config mgmt

- Resolve conflicts in plugin.toml, .env.example, 8 SQL files
- Reassign Text2SQL snowflake menu IDs 840-843 -> 849-852 to avoid
  collisions with upstream's EditAIDefaultModel/AIConfigManage/QuickPhrase
- Preserve all Text2SQL python (crud/model/schema/service/text2sql/tests)
- Adopt upstream: default_model mgmt, AI config menu, chat refactor
Address upstream review blockers for the Text2SQL feature.

Security (fail-closed):
- Rewrite guardrail with sqlglot AST: full schema.table allowlist (fixes
  mysql.user->user namespace-collision bypass), reject tableless recon
  (@@hostname/USER()/SLEEP/LOAD_FILE...), deny dangerous funcs/vars, scan
  for write/DDL nodes (DELETE...RETURNING in subquery), LIMIT clamp.
- Make readonly DB mandatory (no main-DB fallback).
- Wire AI_TEXT2SQL_ENABLED (capability builder + run_query).
- _execute_final: add asyncio.wait_for timeout; remove dead Text2SqlTimeoutError.

Model consolidation (M4):
- Add AIDefaultModelScene.text2sql; resolve via ai_default_model_service.
- Remove AI_TEXT2SQL_PROVIDER_ID / AI_TEXT2SQL_MODEL_ID.

Tests: 41 guardrail cases incl. namespace-collision, tableless recon,
INTO OUTFILE, DELETE...RETURNING, large-LIMIT clamp.
Replace the bespoke 3-tool pydantic-ai Agent (list_tables/describe_table/
execute_sql) with a single structured-output call (output_type=Text2SqlResult,
no tools). Table/column context is pre-fetched server-side and inlined into
the system prompt.

Why: the tool loop duplicated the plugin's existing capability/builtin_toolset
pipeline (reviewer 'major reject reason'). The execute_sql self-correction loop
was redundant -- the final SQL is re-guarded + re-executed by _execute_final
regardless. Net effect: smaller attack surface (the Agent can no longer execute
SQL at all), one LLM round-trip instead of N.

run_query return contract unchanged. _resolve_model / _execute_final /
_write_history untouched. Drops now-unused imports (json, guardrail exceptions).
The chat capability tool text2sql_query was described as 'FBA 业务数据' with
order/supplier examples, so the model did not invoke it for log/count questions
(e.g. 'how many operation logs today'). Broaden the description to explicitly
cue logs/counts/stats and state that any database-data question should prefer
this tool.
@wu-clan

wu-clan commented Jun 26, 2026

Copy link
Copy Markdown
Member

建议搞成独立插件,vb 时发给 AI:使用 fba skills depends_on 将 text2sql 插件独立

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants