I run whatisthatbook.com. You type a fuzzy memory of a book. Something like "dystopian novel where firemen burn books." Or "blue cover, girl on a horse, 90s." It tries to name the book.

The pipeline is roughly one LLM call to Gemini 2.5 Flash Lite that returns five {title, author} candidates. Each one gets checked against OpenLibrary to confirm the book exists. Dedupe, render, done.

For months I shipped prompt changes and prayed. I had no idea whether any of my "improvements" actually helped, or how much, or for whom. I'd tweak the system prompt, re-read the first three results for a query I made up in my head, decide it felt better, and ship it. Sometimes I'd swap the model. Same drill.

Eventually I wired up feedback buttons. "This is it!" and "Not the one" on each result card, "None of these match" at the bottom of the page. The site was running about 4,000 searches a week. The first week of feedback data came back: roughly one in twenty clicks was "this is it!" The other nineteen were "not the one" or "none of these match."

This is the part of the post where I'm supposed to say I had a moment of clarity. I didn't. I had a Sunday afternoon and a number I couldn't argue with.

Building the eval

I'd been putting off evals for the usual reasons. Evals feel like the kind of thing you build at a Big AI Company. The kind of thing that needs a platform and a process and a quarterly budget. They felt out of scale for one person and a side project.

That was wrong. The smaller the project, the worse your instinct is for whether things are improving, because you don't have a team to argue you out of your own taste. Evals are most useful exactly when you have no one else to disagree with you.

So I sat down and built one. The brief I gave myself:

  • 100 cases, stored as JSONL in the repo at evals/search/cases.jsonl. In the repo, so a PR can review them and a CI gate can run them.
  • Each case has a user query, a correct book, and the book's OpenLibrary work ID. Work IDs are stable. Titles get retitled, IDs don't.
  • Sources: about a third from r/whatsthatbook solved threads (the answer is in the comments, free real-world distribution), a quarter from saved production searches (verbatim user queries from my own DB), the rest hand-crafted around classics, foreign-language originals, and post-cutoff releases.
  • A validator that hits the OpenLibrary API at write-time, so a bad work ID can't sneak in. If I typo an ID, the file fails to validate, and the PR fails.

This took a weekend. Most of the time was reading Reddit, picking good cases, and resisting the urge to make every case a softball. Hard queries are the whole point.

I committed it, wired up a runner script that scores P@1 / P@5 / MRR with per-difficulty buckets, and ran the first baseline against the production model.

The first baseline was 25 percent

P@1: 25 percent. P@5: 25 percent. MRR: 0.25.

Twenty-five percent. On a hundred cases, half of which I'd hand-picked as easy or medium. The model couldn't find Fahrenheit 451 from "dystopian novel where firemen burn books." It couldn't find Wuthering Heights. It couldn't find Harry Potter.

My first reaction was: of course. This is what shipping on vibes gets you. The model is bad. The pipeline is bad. The whole approach is bad. I had built an eval, the eval had spoken, and the eval was telling me the entire thing was broken.

I sat with that for about twenty minutes before I opened the failure list.

Three Wuthering Heights

Case 097: query about haunted moors and Heathcliff. Eval expects OpenLibrary work OL21177W (Wuthering Heights). Model returned OL45338173W. I looked it up.

OL45338173W is also Wuthering Heights. Same author. Same book.

I scrolled. Old Man and the Sea: eval wants OL63073W, model returned OL15110072W and OL27910351W. All three are Old Man and the Sea. Adventures of Huckleberry Finn: same story, two work entries for one book. Harry Potter and the Philosopher's Stone vs. Sorcerer's Stone: the US retitling has its own OpenLibrary work, separate from the UK original, and the eval pinned one while the model returned the other.

OpenLibrary has many duplicate work entries that have never been merged. Classics are the worst offenders, because every translation, every reissue, every academic study guide, every "annotated edition" gets a chance to spawn another work record. The model wasn't wrong. My scorer was. It was checking string-equal on an opaque ID that turned out not to be unique.

The baseline wasn't measuring the model. It was measuring how often the model happened to return the same OpenLibrary work entry I'd pasted into the eval set. About 75 of the 100 failures looked like this.

How I fixed the eval, not the model

Two ways to fix it.

The cheap way: title+author fallback in the runner. If the work ID doesn't match but the title and author normalize-match the expected, count it as a hit. Five lines of code. Done by lunch.

I didn't do that. It blurs the metric in a way I'd regret later. A model returning some random "Wuthering Heights Study Guide" would score a hit. So would "Wuthering Heights: A Critical Edition with Essays." For a recognition task where the whole point is "did we name the right book," that's a metric you can't trust.

The other way: make each case carry a list of acceptable_work_ids. The eval set encodes, explicitly, that these N work IDs are the same book. The scorer counts a hit if the model returns any of them. Strict, reproducible, debuggable.

The trick was finding the aliases without doing it by hand for a hundred classics.

I built a small pipeline. For each case, run the search and collect every work ID the model returned across a few model variants. For each candidate, fetch it from OpenLibrary, then cross-check against Wikidata's P648 property, which indexes the canonical OpenLibrary ID for a book. Wikidata only stores the canonical, not the duplicates. So if my eval's canonical and the candidate both resolved to the same Wikidata Q-number, they were definitely the same book. If they didn't, but the editions shared at least one ISBN through OpenLibrary's editions endpoint, that was also a strong signal. Everything else went into a manual review queue.

Four scripts: eval-suggest-ids, eval-research-ids, eval-review-ids, eval-apply-research. They live in apps/web/scripts/. I'd recommend them to no one. They exist because I needed them once.

The output: 58 aliases across 25 cases. I committed them, re-ran the baseline.

P@1: 25 percent → 38 percent. P@5: 25 percent → 41 percent. The easy bucket, mostly classics, went from 40 percent to 74 percent.

The model hadn't changed. Not one token of the prompt had changed. The 14-point absolute jump on P@5, and the 34-point jump on easy queries, was entirely the eval finally catching up to a piece of OpenLibrary data drift I hadn't planned for.

What I almost did with the 25 percent

If I'd shipped the eval and stopped there, if the regression gate had stayed in place and I'd just run with the 25 percent number, the gate would have kept working. Deltas are deltas. Every PR would have seen the same wrong denominator, and a real prompt improvement would have shown up as a real lift.

But I would have quoted 25 percent. In the ticket. In the README. To myself, when deciding whether this side project was worth keeping going at all. I'd have used a wrong number to make real decisions for weeks before I noticed.

The eval that fixed the eval came out two days after the eval. That gap could easily have been two months.

The lesson

Evals don't measure the model. They measure the gap between what the model does and what your scorer says it should do. Both halves can be wrong. Usually the half you didn't think about is.

This is a shape of problem I keep hitting. When I let an LLM write SQL against my production database, the load-bearing work wasn't the model either. It was the validator and the allowlist sitting between the model and my data. Here, it's the scorer sitting between the model and the metric. Same boring layer, different job.

Two things I keep in mind now.

The first is to inspect the failures before you trust the score. Especially the first time you run a new scorer against real data. The score is a single number; the failures are evidence. The number lies easily and the evidence rarely does.

The second is that scorer keys are not free. An external ID drags every quirk of its source into your eval. OpenLibrary has duplicate works. ISBNs aren't unique across editions. Goodreads IDs get retired. A scorer key is a foreign dependency, and you need a story for what happens when the dependency gets messy.

Since then, the eval has done its actual job. It killed a plot-summary embedding re-ranker I thought was an obvious win — turned out to regress P@1 by 2 to 12 points depending on the embedding model. It killed a multi-query expansion experiment I thought was the cheapest cleanest upgrade on my list. Turned out same-model paraphrases add no information the model didn't already have. It signed off on a prompt upgrade that ran +15 points and a model swap that ran +9. None of those are decisions I would have made well from a PostHog dashboard and a vibe.

But the very first thing the eval caught wasn't a regression, or an improvement, or a model that didn't generalize.

It was itself.