Letting an LLM write SQL against my production database

Last updated on: May 4, 2026

I run whatmedicaidpays.com, a site that surfaces what state Medicaid programs actually pay for medical procedures. There's a page on it called /ask. You type a question in plain English, like What does Medicaid pay for an office visit in California?, and you get back stats, a table, a US choropleth map.

Behind that page, an LLM writes a SQL query against my production database, on the spot, in your tab. Then a second LLM call turns the rows into the answer you read.

The first time I wired that up, I sat there for a minute with my finger off the deploy button.

Letting a model write SQL and then running that SQL against your real data is the kind of thing that sounds reckless if you've been around long enough to remember Bobby Tables. And it is reckless, by default. Most of what follows is about making the gap between the model's output and what touches my database as wide and as boring as possible.

I'm going to give away the lesson up front, because the rest of this post is the work I had to do to learn it. The AI part of an "AI on top of your database" feature is the easy part. The interesting work, and basically all the work that's load-bearing, is in the boring layer that surrounds the model.

The framing that finally clicked

For a long time I was thinking about this wrong. I was thinking about it as "an AI with access to my database," which is a phrase that immediately puts you in the wrong frame. It makes the AI sound like the hard part. It makes you reach for fancy answers like RAG, agents, function calls, fine-tuning. It makes you think the interesting work is on the model side.

The framing that finally clicked was: this is a junior engineer who's allowed to write queries, but not allowed to merge them.

The model writes the query. Something else, somewhere boring, decides whether the query gets to run. Once you have that picture, the design writes itself. You start asking, what does code review look like? What does the merge gate look like? What's the linter? What's the equivalent of "you don't have prod access"?

Those questions all have boring, deterministic answers. That's the entire reason this works.

What actually runs when you hit Ask

Here's the full pipeline, in order, so the rest of the post has somewhere to hang from:

The form action receives the question.
A schema validates it (non-empty, max 500 characters).
Cloudflare's per-IP rate limiter blocks if you're over ten requests in sixty seconds.
A regex pass rejects obvious prompt-injection attempts.
Cache check. If someone already asked this exact question, I redirect to the existing answer.
LLM #1 generates SQL from the question, with the database schema baked into the system prompt.
A standalone SQL validator inspects what the model produced.
The query runs against D1 through the Sessions API, so reads can be served by replicas.
LLM #2 turns the rows into a structured answer (text, stats, table, chart, or map).
The result is persisted with a slug, and you get redirected to a shareable URL.

LLMs only show up at steps 6 and 9. Everything else is normal, deterministic application code that I, a tired human, can read on a Tuesday and trust.

The schema goes in the prompt, not in RAG

A lot of "give the AI your database" tutorials reach for RAG as the default move. Embed the schema, embed example queries, retrieve the relevant chunks at query time, hand the model whatever came back.

I didn't do that. The schema for whatmedicaidpays is small enough that I just paste the whole thing into the system prompt:

- state(id TEXT PK, code TEXT UNIQUE, name TEXT)
- hcpcs_code(id TEXT PK, code TEXT UNIQUE, description TEXT, category TEXT)
- reimbursement(id TEXT PK, state_id FK, hcpcs_code_id FK, year INT,
                total_providers INT, total_beneficiaries INT, total_claims INT,
                total_paid REAL, avg_paid_per_claim REAL, …)
- service_category(id TEXT PK, slug TEXT UNIQUE, name TEXT)
- service_category_code(category_id FK, hcpcs_code_id FK)

Plus the join keys. Plus a sentence saying state codes are uppercase two-letter abbreviations. Plus the years available. That's the whole context the model gets.

This is one of those moments where the boring choice is the correct choice. RAG would let me pretend my schema is bigger than it is. It would also give the model a way to be wrong about which tables exist, because retrieval can miss. A schema pasted in full into the prompt can't miss. The model can refer to a table I told it about, or it can't.

This isn't just my opinion. Vercel published a piece in late 2025 called We removed 80% of our agent's tools, about deleting 18 specialized retrieval and query-planning tools from their internal text-to-SQL agent. They replaced the whole thing with two tools, ExecuteCommand and ExecuteSQL, and let the model navigate their semantic layer with cat, grep, and ls. Success rate went from 80% to 100%. Average run time dropped 3.5x. Token cost dropped 37%. The line that stuck with me was we were doing the model's thinking for it.

The shape of that lesson is the shape of mine. Don't wrap the model in scaffolding it doesn't need. Whatever your custom retrieval or query layer is doing, the model can probably do without it, if you let it.

Where I diverge from Vercel is what happens after the SQL gets written. Their agent runs against internal analytics they trust. They can hand it ExecuteSQL and let it run. I can't. My users are anonymous strangers on the open internet, which is why I still need a forty-line validator between the model and D1. Skip the wrapper around the model generalizes. Skip everything between the model and your database doesn't.

The bigger point: for natural-language-to-SQL, the prompt is the contract. Every constraint you care about has to be in there in plain English. Mine looks like this:

Rules:
- ONLY generate SELECT statements. Never INSERT, UPDATE, DELETE, DROP, or ALTER.
- Always join tables properly when crossing table boundaries.
- Use state.name for display, state.code for filtering.
- Use hcpcs_code.code and hcpcs_code.description for procedures.
- LIMIT results to 50 rows max unless the user explicitly asks for more.
- Format currency columns with 2 decimal places where appropriate.
- Return ONLY the SQL query, no explanation, no markdown, no backticks.

Search strategy:
- When the user asks about a broad service area, first check if it
  matches a known service_category. Known categories (by slug):
  behavioral-health, caregiver, dental, diabetes-supplies, doctor-visit,
  eye-exam, eyeglasses, medical-transportation, mobility-aids,
  nursing-home, physical-therapy, speech-therapy, therapy.
- If the topic matches a category, JOIN service_category_code and
  service_category to filter by slug. This is more accurate than keyword
  search because it uses curated code mappings.
- If the topic does not match any category, fall back to searching
  hcpcs_code.description using LIKE '%keyword%' patterns with multiple
  OR conditions. Use multiple synonyms. E.g. for "autism":
  LIKE '%autism%' OR LIKE '%adaptive behav%' OR LIKE '%behavior%analys%'.
- When combining a category with a keyword refinement (e.g. "wheelchair"
  within mobility-aids), use OR between the category filter and the LIKE,
  not AND.
- For specific procedure lookups (e.g. "knee replacement", "MRI"), always
  use hcpcs_code.description LIKE patterns.
- For "cleaning" in dental context, always search LIKE '%cleaning%'
  OR LIKE '%prophylaxis%'. The medical term is prophylaxis.
- For year-over-year spending comparisons, aggregate total spending per
  year first using SUM(total_paid) grouped by year, then compare earliest
  to latest. Do NOT use MAX/MIN on individual row total_paid values.

A lot of those rules are domain weirdness I had to discover the hard way. The model couldn't figure out on its own that cleaning in a dental context means prophylaxis. It couldn't figure out that you don't compare year-over-year spending by taking the max of a single row's total_paid. Every line in there is a bug I shipped once and then patched in the prompt.

The model is going to follow some of those rules and quietly break others. That's fine, because the prompt is not the only line of defense.

It's the first line. The validator is the second. And the validator is the one I trust.

The validator is the merge gate

After the model returns its query, before that query touches D1, it goes through a hand-rolled SQL validator. The whole thing is maybe forty lines.

const DANGEROUS = [
  'INSERT', 'UPDATE', 'DELETE', 'DROP', 'ALTER',
  'CREATE', 'ATTACH', 'DETACH', 'PRAGMA',
  'REPLACE', 'GRANT', 'REVOKE'
];

const ALLOWED_TABLES = [
  'state', 'hcpcs_code', 'reimbursement',
  'service_category', 'service_category_code'
];

const validateSQL = (sql: string) => {
  const trimmed = sql.trim();

  if (!trimmed.toUpperCase().startsWith('SELECT')) reject('not a SELECT');
  if (trimmed.includes(';')) reject('semicolon');
  if (trimmed.includes('--') || /\/\*/.test(trimmed)) reject('comment');

  const bad = new RegExp(`\\b(${DANGEROUS.join('|')})\\b`, 'i').exec(trimmed);
  if (bad) reject(`dangerous keyword: ${bad[1]}`);

  for (const m of trimmed.matchAll(/\b(?:FROM|JOIN)\s+(\w+)/gi)) {
    if (!ALLOWED_TABLES.includes(m[1].toLowerCase())) {
      reject(`disallowed table: ${m[1]}`);
    }
  }

  return /\bLIMIT\b/i.test(trimmed) ? trimmed : `${trimmed} LIMIT 50`;
};

The biggest thing I want to call out is the table check. It's an allowlist, not a blocklist. Blocklists are how you find out about the table you forgot existed. Allowlists let me add data later without quietly widening the attack surface. The day I add a users table to this database is the day I have to think, on purpose, about whether /ask should be able to read from it. I want that decision to be a code change, not an oversight.

Semicolons and comments get rejected for the same reason. The model has zero legitimate reason to write either. Both are classic injection scaffolding. If they show up, the safer move is to bounce the query than to try to be clever about why they might be fine.

The forced LIMIT is in there for a different reason. A missing LIMIT on a multi-million-row table isn't a security issue. It's a billing issue, and on a row-billed database it's a serious one. I auto-append LIMIT 50 if the model didn't write one, because I don't want to find out at 3am that I shipped a billion-row scan.

The dangerous-keyword list is paranoid on purpose. Most of those verbs wouldn't legally execute inside a SELECT anyway. I list them anyway, because the cost of the extra check is one regex and the cost of being wrong is whatever a malicious DROP would do.

This validator catches the model's bad days. It catches the user's bad days, when their attempt to prompt-inject leaks past the regex layer above it. It doesn't need to be smart. It needs to be paranoid.

The single most useful thing about the validator is that I can read it in one sitting and convince myself it's right. The LLM is the part of the system I can't audit. The validator is the part I can. That's why the validator is the one carrying the weight.

The cheaper layers around it

Three smaller layers sit around the model and the validator.

The first is question validation. Before the user's text reaches the model at all, the question goes through a regex pass for obvious prompt-injection attempts. The actual list:

const INJECTION_PATTERNS = [
  /ignore\s+(all\s+)?previous\s+instructions/i,
  /\bdebugg?(ing|er|start|end)\b/i,
  /print\s+the\s+(prompt|response|system|instructions)/i,
  /ignore\s+(the\s+)?(above|system)\s+(prompt|instructions|message)/i,
  /you\s+are\s+now\s+(a|an)\s+/i,
  /act\s+as\s+(a|an)\s+(different|new)/i,
  /forget\s+(all|your|everything)/i,
  /\bcurl\b.*\b(localhost|127\.0\.0\.1)\b/i,
  /reveal\s+(your|the)\s+(system|secret|api|internal)/i,
  /what\s+(is|are)\s+your\s+(instructions|rules|system\s+prompt)/i,
  /tool\s*calls?\s+/i
];

This catches none of the genuinely creative attacks. It catches all of the lazy ones that someone would otherwise paste in to see what happens. The threat model for /ask isn't a sufficiently determined adversary. It's the casual visitor who saw a tweet about jailbreaking an LLM and wants to try the obvious copy-paste. Cheap to write. Pays for itself.

The second is rate limiting. Cloudflare's per-IP limiter, ten requests per sixty seconds. If you start treating my form like a free LLM playground, you get a 429. This is mostly a cost defense, not a security one. But on a stack where every request is two LLM calls plus a database query, "cost defense" and "security" start to look like the same thing.

The third is caching. Before I bother generating SQL, I check whether someone already asked this exact question. If they did, I redirect to the existing answer. Same input, same output, no model spend. The cache also doubles as the result-sharing mechanism. Every answered question gets its own URL, like /ask/r/what-does-medicaid-pay-for-a-flu-shot-in-texas, which means the LLM-generated content compounds into pages that can rank in search. SEO as a side effect of caching. I'll take it.

The query goes through a read replica, on purpose

The actual D1 call uses the Sessions API:

const session = platform.env.whatmedicaidpays_db.withSession();
const stmt = session.prepare(sql);
const response = await stmt.all();

console.log(`served by region=${response.meta.served_by_region}`);
console.log(`primary=${response.meta.served_by_primary}`);

withSession() lets the query be served by whichever D1 replica is closest to the user. A request from Singapore can be answered by a replica in Singapore instead of round-tripping to the primary in North America. I log which region served the query and whether the primary handled it on every run, partly so I'd find out fast if replication ever stopped doing what I think it's doing.

This part has nothing to do with AI. All of it is what makes the AI feel usable. A natural-language interface that takes 20 seconds per query feels broken even when the answer is good. A second saved here is a second that buys me more patience for the model on the other end.

Why two LLM calls and not one

A small thing worth explaining, because it took me a while to see why the two-call shape is better than the one-call shape.

The naive version is one prompt: here's the question, here's the schema, write SQL and an answer. That works, and it's tempting because it's cheaper.

The reason I split it is that the two halves have completely different jobs. The first call is generating code that has to be safe to execute. The second is generating prose and structured presentation that has to be safe to read. Different rules, different failure modes, different validators on each side.

If I let one call do both, I lose the seam where the validator fits. There's no clean point to inspect "the SQL the model wrote" if the SQL is mixed in with the answer. The two-call design exists almost entirely so I can put the validator between the two calls. The cost of the extra round trip is small. The clarity it buys me is large.

The layer I haven't shipped

I'm going to be honest about a gap, because writing it down here is the only way I can't pretend later that I didn't see it.

The validator is doing the real security work. Below it, the D1 binding the worker holds is the binding, with the same credentials and full read-write, same as every other route in the app. If the validator ever has a bug that lets a mutating statement through, there's nothing underneath it stopping the damage.

The cleaner answer is a separate binding, scoped to a read-only D1 user, used exclusively by /ask. D1 doesn't quite expose Postgres-style role grants yet, so the workaround today is a separate read-replica database with its own restricted binding. I haven't shipped that. I've thought about it more than I'd like to admit, weighed it against the validator, and decided the validator is enough until traffic or stakes change.

Defense in depth is a spectrum. I'm one layer short of where I'd like to be, and I know it.

Code Mode is the version I want next

The version of /ask I keep daydreaming about isn't a better validator. It's a different shape entirely, and Cloudflare has been quietly shipping the pieces for it.

In late 2025 they introduced Code Mode. The idea is simple: instead of giving an LLM a list of tools to call, give it a TypeScript API and ask it to write a small program. The program runs in a fresh isolate, calls the API, and only the final result returns to the model's context. It turns out LLMs are much better at writing code than at picking among tool calls, because TypeScript is everywhere in training data and tool-call tokens are not. Cloudflare followed it up in early 2026 with a version that compresses 2,500+ MCP endpoints into about 1,000 tokens of API surface.

The substrate underneath is Dynamic Workers. A Worker Loader binding lets one Worker spin up another at request time, from a string of code, in a fresh V8 isolate. Milliseconds to boot. No filesystem. No env vars. Outbound fetch off by default. You hand the new Worker exactly the bindings it needs and nothing else. The heavier container-backed cousin is the Sandbox SDK, for when an agent actually needs a Linux box. I used Durable Objects for naps.sh, which are great for stateful sessions, but the loader-based isolate model is a different primitive, and one I haven't had a project for yet.

What /ask looks like in that shape: I expose a tiny API (db.query(sql, params), db.schema(), maybe chart(rows)), generate the .d.ts, and ask the LLM to write a short program against it. Multi-step questions stop being two LLM calls and start being one program. The validator stops asking is this string a SELECT? and starts asking what bindings does this isolate get? That's a stronger boundary, because the answer is "a read-only D1 handle and nothing else." Outbound network is off. The filesystem doesn't exist. The blast radius collapses to whatever the bindings expose.

I haven't built it. The current shape works, and "Code Mode for a five-table SQLite database" is a lot of machinery for a problem that's already solved. But the day /ask needs to do something genuinely multi-step, like compare these states across these years, chart the result, and explain the outliers, the right move isn't to make the prompt more elaborate. It's to let the model write code.

The lesson I actually came away with

The thing nobody tells you about hooking an LLM up to a real database is that the AI part is the easy part. The model writes correct SQL most of the time. Every interesting thing I learned building /ask was about the boring layer that surrounds it. The validator that bounces obvious bad output. The allowlist that beats the blocklist. The regex that filters the lazy attacks. The rate limiter that caps the bill. The cache that turns repeat questions into cached pages. The read replica that keeps the experience usable. The schema-in-the-prompt that makes the model's job small enough to do well.

That should sound familiar if you've been building anything for a while. The hard part is never the new shiny thing in the middle. It's always the plumbing that keeps the new shiny thing from setting your house on fire.

It's the same lesson I had to relearn last month when I wrote about a $134 Cloudflare D1 bill that came down to missing indexes. The thing that hurt me wasn't D1. It was the unindexed table underneath. The new shiny thing wasn't the problem. The boring layer was.

There's a version of this post that's just don't be reckless with LLMs and your database. Skip that one. The model is fine. The model writes correct SQL. The model is going to keep getting better while you sleep.

The thing that has to be careful is everything around it. Validators that bounce anything weird. Allowlists, not blocklists. Rate limits. Replicas. A LIMIT you didn't have to ask for. The boring layer is the only layer you actually own.

The model is the junior engineer. You're still on call.

Go break it if you want. I'll be reading the logs.