Oxford pretends AI benchmarks are science, not marketing
Chatbot vendors routinely make up a new benchmark, then show how well their hot new chatbot does on it. Like that time OpenAI’s o3 model trounced the FrontierMath benchmark, and it’s just a coincidence that OpenAI paid for the benchmark and got access to the questions ahead of time.
Every new model will be trained hard against all the benchmarks. There is no such thing as real world performance — there’s only benchmark numbers.
There’s a new conference paper from Oxford University’s Reasoning With Machines Lab: “Measuring what Matters: Construct Validity in Large Language Model Benchmarks.” [press release; paper, PDF]
Reasoning With Machines doesn’t work on reasoning, really. It’s almost entirely large language models — chatbots. Because that’s where the money — sorry, the industry interest — is. But this paper got into the NeurIPS 2025 conference.
The researchers did a systematic review of 46,000 AI papers. Well. What they actually did is they ran the papers through GPT-4o Mini. Using a chatbot anywhere in your supposedly scientific process is a bad sign if you’re claiming to do serious research.
The chatbot pointed the researchers at 445 benchmarking tests. You’ll be 0% surprised that most of these benchmarks were rubbish:
vague definitions for target phenomena or an absence of statistical tests. We consider these challenges to the construct validity of LLM benchmarks: many benchmarks are not valid measurements of their intended targets.
Wow, that’s terrible! How did these benchmarks get that way? Well, the paper never asks that question.
But pretty obviously, science-shaped text to make a product look good is precisely the job of marketing material. The purpose is to generate something to put in the press release.
So what’s the Reasoning With Machines answer to this problem? What’s the action item?
We built a taxonomy of these failures and translated them into an operational checklist to help future benchmark authors demonstrate construct validity.
Now, that’s the right answer — if what the benchmark authors are doing is actually science. But chatbot benchmarks are not science. They were never science. They’re marketing.
This paper never addresses this. This is an 82-page paper, and it never talks about what the AI benchmarks were created for, and what they’re used for in the real world. The word ”marketing” does not appear in the paper. The concept of marketing doesn’t appear in the paper. Not even as “we’re not addressing this right now,” it’s just not there.
It’s like when someone pretends you can talk about chatbots purely as technical artifacts — and somehow never mention what the chatbots are made for, who’s paying for them, why they’re paying all the money they have for chatbots, and the political programme they’re promoting the chatbots to advance. It’s glaringly dodging the issue.
That’s what this paper does — it artificially separates benchmarks from why the benchmarks are this bad. The researchers cannot have been this unaware.
What are the researchers envisioning here? The people creating the chatbot benchmarks — a lot of these are Ph.D scientists. Are these just poor distracted lab workers who somehow forgot how to do science, so it’s a good thing Reasoning With Machines is here to help?
No. Their job was to create marketing materials shaped like science.
This paper treats chatbot benchmarks as defective science that can be fixed. And that was never what chatbot benchmarks were for.
The Oxford Reasoning With Machines Lab is pretending not to understand something that they absolutely should understand, given most of the lab’s work is chatbots.
That’s because this paper is also marketing — to sell Reasoning With Machines’ services to the chatbot vendors, so they can do their marketing better. And make the benchmark lies a bit less obvious.
neurologist
fic title alphabet meme
Rules: How many letters of the alphabet have you used for [starting] a fic title? One fic per line, ‘A’ and 'The’ do not count for 'a’ and ’t’. Post your score out of 26 at the end, along with your total fic count.
Most of the letters I could fill had more than one option, so I went for a combination of variety of fandoms and personal favorite fics and fic titles.
A: Acted Over (Julius Caesar, Brutus/Cassius)
B: The Bridge-Keeper's Riddle (The Venture Bros, Dr. Girlfriend/Henchman 21/The Monarch)
C: Correcting an Oversight (Star Trek: Discovery, Jett Reno/Sylvia Tilly)
D: Down Where It's Wetter (The Little Mermaid, Ariel)
E: The Emperor's Favorite (Star Trek: Discovery, Michael/Mirror Philippa)
F: For a Thousand Summers (I Will Wait For You) (Star Trek: The Next Generation, Guinan/Picard)
G: Geese Resting (Always Coming Home, poem)
H: Her Person's Person (Star Trek: The Next Generation, Data/Geordi + Spot)
I: It's Just a Leap to the Left (Quantum Leap/Rocky Horror, Sam Beckett/Frank N. Furter)
J:
K:
L: Lamp for the Dead (Star Trek: The Next Generation, Ro Laren & Sito Jaxa)
M: A Memory of Warmth (She-Ra and the Princesses of Power, Light Hope/Mara)
N: No One Can Make It Alone (Kipo and the Age of Wonderbeasts, Kipo/Wolf)
O: The Origin of Love (Star Trek: Picard, Seven & Hugh)
P: Patch Notes (World of Warcraft, Thrall & Vol'jin)
Q: A Quiet Evening In (101 Dalmatians, Anita/Roger)
R: Remembrance (World of Warcraft, Koltira/Thassarian)
S: Soft and Supple When Alive (Hainish Cycle, Pao/Sutty)
T: To Heaven (Dogsbody, Kathleen & Sol)
U: Uncertain Provenance (The Little Mermaid, Ariel/Eric)
V: Void Sale (The X-Files, Marita/Krycek)
W: With Stars in Their Hair (A Little Princess, Becky/Sara)
X:
Y: The Year of the Two-Legged Table (The Guest, Choi Yoon/Kang Kil-Young/Yoon Hwa-Pyung)
Z:
Bonus: 2 months 2 days 12 hours 22 minutes till… (Quantum Leap 2022, Hannah/Ben/Addison)
22/26 if you don't count the bonus point I awarded myself for a title that begins with a number instead of a letter! I have 262 works on AO3.
Fic prompt request: Please suggest a word or phrase starting with J, K, X, or Z!
Thankful Thursday
Today I am thankful for...
- The Black Blood of the Earth - Funranium Labs. NO thanks for Bronx making it impossible to get back to sleep at 5am.
- My daughter (who I just got off a video call with).
- Video calls. And having alternatives to Zoom. (We used Discord.)
- MDN Web Docs
- Remembering stuff in time.
cashew
Thanks, WikiMedia!
One of those rare plants that (like strawberries) the seed is external to the fruit. Said fruits are called cashew apples and supposedly are tasty. English got the name not via Portuguese but rather French acajou, and in the process the a was taken to be an indefinite article and separated off as "a cajou" -- French apparently got it directly from Old Tupi (though I'd expect a Portuguese intermediary?) akaîu.
---L.
Last Paris pics
This is the garden:
( Here be pics!. )
Poet's Corner: two by John Curry
In autumn’s half-light
footfalls fade beneath dusk’s hush,
where pavements remember the weight of souls
long gone. No wailing banshees,
just pasts echo’s, like an exhale after breath
held too long. these ghosts, keep to the margins,
content with the company of forgetfulness,
moving only when the past stirs,
I do not fear them—
they are nothing more
than the memory of what was,
replaying in the quiet theatre of now.
Restless Dream by John Curry
Beneath a moon thin and wan,
shadows stretch with a ghastly yawn,
A whisper cuts through hallowed air,
voices long gone, still linger there.
A raven’s cry—a fractured scream—
splinters through my restless dream.
Eyes like coals at midnight’s hour,
trees breathe secrets, dark and dour.
In corridors of creaking gloom,
the clock beats out our doom.
Each tick a nail, each tock a breath,
hammering time’s dirge to death.
horror’s face, with spectral leer,
dwells not without—but festers here.
Saori WX60 floor loom assembly WIP


Loom assembly to continue...after...catten removes herself from possibly having screws DROPPED on her... /o\
Special thanks to Jill of Saori Santa Cruz,
From the World of Minor Threats - Welcome to Twilight #3

Minor Threats is a creator-owned series written by Patton Oswalt and Jordan Blum. This issue is a solo story starring one of the characters not from Minor Threats, but from a spinoff series, The Alternates, about superheroes dealing with depression from no longer being in an alternate dimension called The Ledge, which...
Okay, look, it's a Gail Simone comic about a lobster man at a comic convention. It's got jokes and sight gags. That's all you need.
Warning for mild gore and coarse language.
( Read more... )
SNAP [curr ev, US]
I commend the following video to you. It's longish - 26 minutes – but worth your time.
2025 Nov 1: Hank Green [
Hank Green, of vlogbrothers fame, invites Jeannie Hunter, Tennessee regional director of the Society of St. Andrew (aka EndHunger.org), on to his personal chanenel explain how the US's Supplemental Nutrition Assistance Program, aka SNAP, aka "Food Stamps", actually works.
Hunter turns out to be a great interview subject and the resultant conversation was fascinating. I highly recommend it - not just to understand what's at stake in the goverment shutdown, but for your own simple enjoyment of learning how things actually work, and also so you can more eloquently advocate for this system.
Forthcoming
Wolf Worm Hardcover – March 24, 2026
by T. Kingfisher
The Faraway Inn Paperback – March 31, 2026
by Sarah Beth Durst
Platform Decay (The Murderbot Diaries, 8) Hardcover – May 5, 2026
by Martha Wells
Sea of Charms (The Spellshop, 3) Hardcover – July 28, 2026
by Sarah Beth Durst
Daggerbound (Swordheart, 2) Hardcover – August 25, 2026
by T. Kingfisher
Drawing to a close
I've come to the end of the month, but once again have extra pages left over, so I'll be continuing until sometime in mid-November. Meanwhile, here's the rest of October.











Slow motion
For once, it felt like I was getting ahead of things. Early Clay Fest firing allowed me to get a head start on Clayfolk. Finishing the Clayfolk firing, and pricing all the pots as they came out of the kiln, meant that I could get the van sorted and loaded on one of the last sunny days of October. Starting to make pots for Holiday Market in early November meant I could possibly even do some glazing next week, so I optimistically signed up for a firing the first week of December. For once, I'd actually be well stocked for the beginning of Holiday Market!
Then the rains came.
Don't get me wrong, I love rain. It's Oregon's thing, after all. But when the humidity is 130% in my studio, things don't dry. What in summer is throw today--turn over tonight--trim or add handles in the morning becomes throw today--make something else tomorrow--maybe finish things off a day later? And meanwhile, the shelves fill up, stacked on every available flat surface, and nothing is dry enough to fire.


So my schedule gets scrambled and the studio gets full and even if we get a little sunshine, there's no point trying to dry pots outdoors--between the weak autumn sunlight and the still-high humidity, I'm just rearranging deck chairs on the Titanic. Finally managed to dry enough pots to load the kiln yesterday, but couldn't actually fire it until tonight, because I had two days worth of casseroles, batter bowls, mixing crocks, honey jars, painted mugs and pasta bowls uncovered in the studio, all waiting their turn for trimming, handles and knobs. Finally finished everything about 4:30 this afternoon, so right now the kiln is warming up, and--hopefully--warming up the studio in turn.