The two upgrades my eval killed

I wrote last time about building an eval set for whatisthatbook.com. The first thing it caught was a bug in itself. The next two things it caught were upgrades I was about to ship.

This is the post about those two. Both felt like obvious wins. Both regressed. Without the eval I would have shipped them, watched the PostHog dashboard get marginally worse, and spent the rest of the year unsure whether I'd actually helped or hurt the thing.

Upgrade 1: re-rank candidates by plot-summary similarity

The pipeline returns five candidates from one LLM call. The failure mode I lost the most sleep over was the candidate that exists but doesn't match: a real book, by a real author, with a real OpenLibrary entry, that has nothing to do with what the user described. Existence-checking catches typos. It doesn't catch hallucinated matches.

The obvious fix: for each candidate, fetch the plot summary from OpenLibrary, embed both the user's query and the summary, score them by cosine similarity, and re-rank or drop the candidates that score below some threshold. Catches the hallucinated-match case head on. It's the move you make. Every "production RAG" deck I'd ever skimmed mentioned it.

I built it. text-embedding-3-small for the embeddings, OpenLibrary's description field for the summary, Google Books description as fallback, threshold sweeps from 0.55 to 0.70 to find the cutoff.

Then I ran the eval.

P@1 dropped by 2 to 12 points depending on which embedding model I used. Every configuration was worse than just keeping the LLM's original ordering.

The post-mortem took an evening. The shape: the LLM's own ranking already encodes signal that cosine similarity can't approximate. When the model has high confidence, the top candidate is usually right, and the embedding distance is noise. When the model has low confidence, the candidate it returned at all tends to be a partial match, and the summary-vs-query similarity isn't sharp enough to confirm or reject it. The threshold AUC came in around 0.75, too weak to use as a kill switch.

The technique works when there's a strong, independent signal to add. The embedding-on-summary lane wasn't a new signal. It was a fuzzy echo of what the LLM had already decided.

I kept the plot-summary fetcher and the embedding pipeline. They'll be useful when I build a real vector index for hybrid retrieval, which is a different problem with a different shape. But the "re-rank what the LLM gave you" version is dead.

Upgrade 2: multi-query expansion

Same shape of "obvious." The LLM returns five book candidates. What if it also returned three or four reformulations of the user's query? Each reformulation hits OpenLibrary's keyword search in parallel. Union the candidates. Dedupe. More recall, basically free.

The cost: a few hundred extra completion tokens and three or four extra OpenLibrary searches per query. At my volume that's maybe ten dollars a month. Cheaper than coffee.

I shipped it on a branch and ran the eval. First run came back at 52% P@5, up from a baseline I remembered as 45-ish. Looked like a +7 lift.

Then I checked the baseline file. It had moved. A model promotion from a few days earlier had bumped the production baseline to 58% on the same eval set. The treatment wasn't beating baseline. It was losing by 6 points.

I went looking for confounds. The extra OpenLibrary calls were triggering rate limits — three or four extra searches per query meant the treatment ran into 429s the baseline never saw, dropping candidate quality on a chunk of cases. I added retries and a local disk cache for OL responses, re-ran both arms cleanly.

The final numbers: baseline ran 58% P@5 twice on two different days, both times. Treatment swung between 51% and 57% across two runs. Six points of run-to-run variance, never beating baseline cleanly.

The acceptance criterion in my own ticket was +5 points minimum to ship. I missed it by, depending on the run, between 1 and 7 points in the wrong direction.

Why? Same answer as upgrade 1, slightly rotated. Same-model paraphrase doesn't add information. The reformulations come from the same LLM that produced the candidates. They're paraphrases of its own interpretation of the query. They read different but they encode the same underlying guess. Running OpenLibrary's keyword search over four phrasings of "the same thing the model already thought" doesn't widen the candidate net in any useful way. It gives you more chances to retrieve the same handful of books, plus some noise.

If I wanted real recall, I needed a different source of evidence, not a different phrasing of the same evidence. The lane I'm building now skips the model on hard queries and pulls candidates straight from Reddit and DataForSEO. Independent signal beats restated signal.

What both failures had in common

Both upgrades added work without adding new information.

Plot-summary re-ranking dressed the model's own confidence in a different vocabulary. The LLM already knew which candidate it liked best. The embedding distance wasn't a new opinion. It was a fuzzy paraphrase of the same one.

Multi-query expansion dressed the model's own query interpretation in synonyms. The reformulations weren't independent witnesses. They were the model agreeing with itself four more times.

The test I run on every proposed upgrade now, before I bother building it: if the new component disagrees with the rest of the pipeline, do we get a better answer or a worse one? For re-ranking, the answer was worse — when the rerank disagreed with the LLM, it was usually wrong. For multi-query, the answer was neither — the union didn't surface books a single-query LLM run wouldn't have.

If a proposed component can only echo or noisily restate what the model already produced, it isn't an upgrade. It's a wrapper.

The lesson

Most of the value of an eval set isn't shipping wins. It's not shipping things that feel like wins.

I'd have shipped both of these. Both ideas were defensible. Both were in someone's recent retrieval post. Both would have left the product slower, more expensive, and less accurate, without me knowing, because I'd have read the same vibes I read before: first three results for a query I made up in my head, decide it felt better, ship it. I would have written, on my own ticket, shipped plot-summary re-ranking, should help with the worst hallucinations. And the user-visible behavior would have gotten quietly worse.

Evals are mostly defense. They sit between you and the upgrade you'd ship because it looked like the right move. They aren't the part of the work that gets the dopamine. There's nothing to celebrate when an experiment closes red. But across a year, the experiments an eval kills probably matter more than the ones it blesses.

Two ideas dead. Substrate kept for later. A week of building saved on each one. The wins are the loud part of an eval set.

The kills are the load-bearing part.