Network

Entry tags:

MCYT: MCYTBLR Holiday Exchange

( You're about to view content that a community administrator has advised should be viewed with discretion. )

Oxford pretends AI benchmarks are science, not marketing

Posted by David Gerard

https://pivot-to-ai.com/2025/11/06/oxford-pretends-ai-benchmarks-are-science-not-marketing/

Chatbot vendors routinely make up a new benchmark, then show how well their hot new chatbot does on it. Like that time OpenAI’s o3 model trounced the FrontierMath benchmark, and it’s just a coincidence that OpenAI paid for the benchmark and got access to the questions ahead of time.

Every new model will be trained hard against all the benchmarks. There is no such thing as real world performance — there’s only benchmark numbers.

There’s a new conference paper from Oxford University’s Reasoning With Machines Lab: “Measuring what Matters: Construct Validity in Large Language Model Benchmarks.” [press release; paper, PDF]

Reasoning With Machines doesn’t work on reasoning, really. It’s almost entirely large language models — chatbots. Because that’s where the money — sorry, the industry interest — is. But this paper got into the NeurIPS 2025 conference.

The researchers did a systematic review of 46,000 AI papers. Well. What they actually did is they ran the papers through GPT-4o Mini. Using a chatbot anywhere in your supposedly scientific process is a bad sign if you’re claiming to do serious research.

The chatbot pointed the researchers at 445 benchmarking tests. You’ll be 0% surprised that most of these benchmarks were rubbish:

vague definitions for target phenomena or an absence of statistical tests. We consider these challenges to the construct validity of LLM benchmarks: many benchmarks are not valid measurements of their intended targets.

Wow, that’s terrible! How did these benchmarks get that way? Well, the paper never asks that question.

But pretty obviously, science-shaped text to make a product look good is precisely the job of marketing material. The purpose is to generate something to put in the press release.

So what’s the Reasoning With Machines answer to this problem? What’s the action item?

We built a taxonomy of these failures and translated them into an operational checklist to help future benchmark authors demonstrate construct validity.

Now, that’s the right answer — if what the benchmark authors are doing is actually science. But chatbot benchmarks are not science. They were never science. They’re marketing.

This paper never addresses this. This is an 82-page paper, and it never talks about what the AI benchmarks were created for, and what they’re used for in the real world. The word ”marketing” does not appear in the paper. The concept of marketing doesn’t appear in the paper. Not even as “we’re not addressing this right now,” it’s just not there.

It’s like when someone pretends you can talk about chatbots purely as technical artifacts — and somehow never mention what the chatbots are made for, who’s paying for them, why they’re paying all the money they have for chatbots, and the political programme they’re promoting the chatbots to advance. It’s glaringly dodging the issue.

That’s what this paper does — it artificially separates benchmarks from why the benchmarks are this bad. The researchers cannot have been this unaware.

What are the researchers envisioning here? The people creating the chatbot benchmarks — a lot of these are Ph.D scientists. Are these just poor distracted lab workers who somehow forgot how to do science, so it’s a good thing Reasoning With Machines is here to help?

No. Their job was to create marketing materials shaped like science.

This paper treats chatbot benchmarks as defective science that can be fixed. And that was never what chatbot benchmarks were for.

The Oxford Reasoning With Machines Lab is pretending not to understand something that they absolutely should understand, given most of the lab’s work is chatbots.

That’s because this paper is also marketing — to sell Reasoning With Machines’ services to the chatbot vendors, so they can do their marketing better. And make the benchmark lies a bit less obvious.

Video — Podcast

https://pivot-to-ai.com/2025/11/06/oxford-pretends-ai-benchmarks-are-science-not-marketing/

https://pivot-to-ai.com/?p=6310

Current Mood: unsympathetic

Our worst moments

( You're about to view content that the journal owner has advised should be viewed with discretion. )

Entry tags:

health,
neurologist

neurologist

I had my twice-a-year appointment with the neurologist. All the low-tech neurology stuff was fine, with little change from the previous exam. We are reducing my dose of gabapentin, which we talked about last time, and I told him I want to give that a try.

Entry tags:

fic title alphabet meme

Via

octahedrite and everyone else.

Rules: How many letters of the alphabet have you used for [starting] a fic title? One fic per line, ‘A’ and 'The’ do not count for 'a’ and ’t’. Post your score out of 26 at the end, along with your total fic count.

Most of the letters I could fill had more than one option, so I went for a combination of variety of fandoms and personal favorite fics and fic titles.

A: Acted Over (Julius Caesar, Brutus/Cassius)
B: The Bridge-Keeper's Riddle (The Venture Bros, Dr. Girlfriend/Henchman 21/The Monarch)
C: Correcting an Oversight (Star Trek: Discovery, Jett Reno/Sylvia Tilly)
D: Down Where It's Wetter (The Little Mermaid, Ariel)
E: The Emperor's Favorite (Star Trek: Discovery, Michael/Mirror Philippa)
F: For a Thousand Summers (I Will Wait For You) (Star Trek: The Next Generation, Guinan/Picard)
G: Geese Resting (Always Coming Home, poem)
H: Her Person's Person (Star Trek: The Next Generation, Data/Geordi + Spot)
I: It's Just a Leap to the Left (Quantum Leap/Rocky Horror, Sam Beckett/Frank N. Furter)
J:
K:
L: Lamp for the Dead (Star Trek: The Next Generation, Ro Laren & Sito Jaxa)
M: A Memory of Warmth (She-Ra and the Princesses of Power, Light Hope/Mara)
N: No One Can Make It Alone (Kipo and the Age of Wonderbeasts, Kipo/Wolf)
O: The Origin of Love (Star Trek: Picard, Seven & Hugh)
P: Patch Notes (World of Warcraft, Thrall & Vol'jin)
Q: A Quiet Evening In (101 Dalmatians, Anita/Roger)
R: Remembrance (World of Warcraft, Koltira/Thassarian)
S: Soft and Supple When Alive (Hainish Cycle, Pao/Sutty)
T: To Heaven (Dogsbody, Kathleen & Sol)
U: Uncertain Provenance (The Little Mermaid, Ariel/Eric)
V: Void Sale (The X-Files, Marita/Krycek)
W: With Stars in Their Hair (A Little Princess, Becky/Sara)
X:
Y: The Year of the Two-Legged Table (The Guest, Choi Yoon/Kang Kil-Young/Yoon Hwa-Pyung)
Z:
Bonus: 2 months 2 days 12 hours 22 minutes till… (Quantum Leap 2022, Hannah/Ben/Addison)

22/26 if you don't count the bonus point I awarded myself for a title that begins with a number instead of a letter! I have 262 works on AO3.

kitarella_imagines suggested that a fun extra challenge could be to fill in the letters that we're missing. It's been a minute since I wrote any fic, but I could give it a try, so...

Fic prompt request: Please suggest a word or phrase starting with J, K, X, or Z!

Current Mood: hopeful

A possible limit to the politics of disgust?

( You're about to view content that the journal owner has advised should be viewed with discretion. )

Entry tags:

Star Trek: The T'hy'la

( You're about to view content that a community administrator has advised should be viewed with discretion. )

Entry tags:

DC K.O.: Knightfight #1: Is This Real Or Just A Fantasy?

( Read more, )

Current Mood: grateful
Current Location: Schildhaven in Den Haag

Entry tags:

thanks

Thankful Thursday

Today I am thankful for...

The Black Blood of the Earth - Funranium Labs. NO thanks for Bronx making it impossible to get back to sleep at 5am.
My daughter (who I just got off a video call with).
Video calls. And having alternatives to Zoom. (We used Discord.)
MDN Web Docs
Remembering stuff in time.

Current Mood: depressed

brief political interlude

( You're about to view content that the journal owner has marked as possibly inappropriate for anyone under the age of 18. )

cashew

cashew (KASH-oo, kuh-SHOO) - n., a tree (Anacardium occidentale) native to northeastern Brazil widely cultivated in tropical climates for its edible nuts and fruit; the nuts of this tree.

Thanks, WikiMedia!

One of those rare plants that (like strawberries) the seed is external to the fruit. Said fruits are called cashew apples and supposedly are tasty. English got the name not via Portuguese but rather French acajou, and in the process the a was taken to be an indefinite article and separated off as "a cajou" -- French apparently got it directly from Old Tupi (though I'd expect a Portuguese intermediary?) akaîu.

---L.

Current Mood: photographical
Current Location: 'ome from the market

Last Paris pics

The building which houses the Musée Jacquemart André is rather fine in its own right!

This is the garden:

( Here be pics!. )

Entry tags:

Poet's Corner: two by John Curry

Ghosts by John Curry

In autumn’s half-light
footfalls fade beneath dusk’s hush,
where pavements remember the weight of souls
long gone. No wailing banshees,

just pasts echo’s, like an exhale after breath
held too long. these ghosts, keep to the margins,
content with the company of forgetfulness,
moving only when the past stirs,

I do not fear them—
they are nothing more
than the memory of what was,
replaying in the quiet theatre of now.

Restless Dream by John Curry

Beneath a moon thin and wan,
shadows stretch with a ghastly yawn,
A whisper cuts through hallowed air,
voices long gone, still linger there.

A raven’s cry—a fractured scream—
splinters through my restless dream.
Eyes like coals at midnight’s hour,
trees breathe secrets, dark and dour.

In corridors of creaking gloom,
the clock beats out our doom.
Each tick a nail, each tock a breath,
hammering time’s dirge to death.

horror’s face, with spectral leer,
dwells not without—but festers here.

Entry tags:

crafts: weaving

Saori WX60 floor loom assembly WIP

Loom assembly to continue...after...catten removes herself from possibly having screws DROPPED on her... /o\

Special thanks to Jill of Saori Santa Cruz,

merrileemakes, and my husband for helping me figure out which part of assembly I borked yesterday!

Entry tags:

From the World of Minor Threats - Welcome to Twilight #3

Minor Threats is a creator-owned series written by Patton Oswalt and Jordan Blum. This issue is a solo story starring one of the characters not from Minor Threats, but from a spinoff series, The Alternates, about superheroes dealing with depression from no longer being in an alternate dimension called The Ledge, which...

Okay, look, it's a Gail Simone comic about a lobster man at a comic convention. It's got jokes and sight gags. That's all you need.

Warning for mild gore and coarse language.
( Read more... )

Entry tags:

SNAP [curr ev, US]

Americans, as I hope you know, on Nov 1st, the Federal government, being shut down, did not transmit the money to the states to pay for the Supplemental Nutrition Assistance Program, aka SNAP, aka "Food Stamps". In many states, SNAP money is supposed to hit recipients' EBT cards on the first of the month. It didn't. There is in the SNAP budget funds to cover emergencies, but Trump said he would not release it; lawsuits ensued, and as of right now, partial payments are going to be or have been made.

I commend the following video to you. It's longish - 26 minutes – but worth your time.

2025 Nov 1: Hank Green [

hankschannel on YT]: "This Shutdown is Different"

Hank Green, of vlogbrothers fame, invites Jeannie Hunter, Tennessee regional director of the Society of St. Andrew (aka EndHunger.org), on to his personal chanenel explain how the US's Supplemental Nutrition Assistance Program, aka SNAP, aka "Food Stamps", actually works.

Hunter turns out to be a great interview subject and the resultant conversation was fascinating. I highly recommend it - not just to understand what's at stake in the goverment shutdown, but for your own simple enjoyment of learning how things actually work, and also so you can more eloquently advocate for this system.

Watch on YouTube

Entry tags:

Forthcoming

Some titles I'm looking forward to reading in 2026:

Wolf Worm Hardcover – March 24, 2026
by T. Kingfisher

The Faraway Inn Paperback – March 31, 2026
by Sarah Beth Durst

Platform Decay (The Murderbot Diaries, 8) Hardcover – May 5, 2026
by Martha Wells

Sea of Charms (The Spellshop, 3) Hardcover – July 28, 2026
by Sarah Beth Durst

Daggerbound (Swordheart, 2) Hardcover – August 25, 2026
by T. Kingfisher

Current Mood: fannish

Defictionalized Artifacts

( You're about to view content that the journal owner has advised should be viewed with discretion. )

Entry tags:

Drawing to a close

One thing that doesn't rely on relative humidity, fortunately, is my Inktober drawing. I'm still setting aside 15 minutes after breakfast every morning to do a little drawing. It's a surprisingly refreshing way to start my creative day.

I've come to the end of the month, but once again have extra pages left over, so I'll be continuing until sometime in mid-November. Meanwhile, here's the rest of October.

Entry tags:

Slow motion

For once, it felt like I was getting ahead of things. Early Clay Fest firing allowed me to
get a head start on Clayfolk. Finishing the Clayfolk firing, and pricing all the pots as they came out of the kiln, meant that I could get the van sorted and loaded on one of the last sunny days of October. Starting to make pots for Holiday Market in early November meant I could possibly even do some glazing next week, so I optimistically signed up for a firing the first week of December. For once, I'd actually be well stocked for the beginning of Holiday Market!

Then the rains came.

Don't get me wrong, I love rain. It's Oregon's thing, after all. But when the humidity is 130% in my studio, things don't dry. What in summer is throw today--turn over tonight--trim or add handles in the morning becomes throw today--make something else tomorrow--maybe finish things off a day later? And meanwhile, the shelves fill up, stacked on every available flat surface, and nothing is dry enough to fire.

So my schedule gets scrambled and the studio gets full and even if we get a little sunshine, there's no point trying to dry pots outdoors--between the weak autumn sunlight and the still-high humidity, I'm just rearranging deck chairs on the Titanic. Finally managed to dry enough pots to load the kiln yesterday, but couldn't actually fire it until tonight, because I had two days worth of casseroles, batter bowls, mixing crocks, honey jars, painted mugs and pasta bowls uncovered in the studio, all waiting their turn for trimming, handles and knobs. Finally finished everything about 4:30 this afternoon, so right now the kiln is warming up, and--hopefully--warming up the studio in turn.