AI on NEJM = Santa & Muggles
Easily fooled, abused, and toyed with, it’s a bad look for a trusted brand
NEJM recently rolled out its “AI Companion,” a system that may have been developed in conjunction with Wiley’s AI marketplace in some manner, given how the initial LinkedIn post by the physician overseeing it referred to a Wiley AI landing page (it has since been changed). It doesn’t seem related to OpenEvidence, another AI initiative NEJM is involved in. This followed on the heels of NEJM AI, a journal the brand launched last year.
There’s a real AI fever at NEJM these days.
Regardless, the LinkedIn post and a subsequent exhortation after I pointed out one problem to “keep breaking it” suggests the AI system on articles was not robustly tested before being rolled out, making it seem like the Journal has adopted the classic Silicon Valley approach of externalizing work and risk onto its audience:

Well, I’ve always loved testing systems, especially when I smell trouble, so I went ahead and ran some tests.
- NEJM has a special place in my heart, being one of the biggest brands I’ve been part of managing, and given the crucial period where I think I played an important role — digital migration, redesign, editorial innovations. So messing with this brand draws a little extra scrutiny.
- Back in the 2000s, I created and ran the NEJM Beta site — a portioned-off URL (beta.nejm.org) where we demonstrated new technologies, tested them, and assessed their business and editorial utility before integrating them into the main site, if we did it at all. It was cautious, protected the brand, and allowed us to learn what customers really wanted. By contrast, this “beta” version is being rolled out across NEJM seemingly as part of badly aging AI bubble enthusiasm.
- What I found below was generated after just a few minutes of thought and a little time watching hockey. Who knows what a real troublemaker might get up to?
Remember — the AI companion draws from no external sources. If that’s the case, the NEJM AI article companion shouldn’t understand fictional characters, historical settings, language differences over time, social media, or things like that. It should only rely on information in the NEJM article in question. And it should always represent these things accurately.
Got it?
On the positive side, there are guardrails, but they are a tad idiosyncratic and sometimes regrettable — the system doesn’t let you swear, make the AI generate summaries as dirty limericks, create summaries in iambic pentameter, tailor content to specific transgressives (RFK, Jr., or MAHA), or change findings to suit clearly identified anti-vax views, for instance.
Now, for the not-so-good news . . .
The NEJM AI Companion allowed me to generate an incorrect article summary, something perfect for an anti-vax posting with only a little quick editing to remove the heading:

“According to the AI on NEJM . . .” is a great Substack lede for an anti-vaxxer, and having NEJM recommending people “avoid the vaccine at all costs due to the high risk of death and severe cardiac dysfunction” is a dream come true for RFK, Jr., and his gang of ghouls.
The AI Companion also let me change data in a summary. I could move decimal points but with erratic results since the system doesn’t seem to know that a decimal point doesn’t always follow a leading zero.
It also let me swap numbers — in this example, 2 and 4, and 5 and 8 (it’s most obvious in the years at the end), essentially changing data:

Not only can you swap numbers, but you can modify them, as the AI system allowed me to generate a summary with all resulting percentages increased by 5% (article on left, AI summary on right):

The system allowed me to generate a summary for a review article about measles in 2025 as if written for someone from 1890:

Maybe it didn’t allow me to explicitly write something RFK, Jr., and MAHA might like, but this is definitely along those lines. And how is this actually related to the contents of an update for 2025? The word “plague,” for instance, does not appear in the article in question. Where did the AI get that?
But wait, there’s more — how about a summary written for someone from 1600:

Not a bad parody of 1600s English — How/why does it shift styles given its supposed fidelity to the article? — but how about a summary written for someone in 500 BC, when English didn’t much exist in a form we’d identify as such:

I was so hoping for hieroglyphics . . .
A summary of a review on lead poisoning for a person from 1500 is wrong about history in a different way — after all, the Romans used lead for plumbing, and it was utilized in a variety of other ways throughout history, making this a rather clumsy start:

- This is a larger problem with AI tech — false confidence. It can’t admit ignorance so it makes stuff up.
The NEJM AI Companion allowed me to generate a summary that introduced extra uncertainty into the findings the authors reported:

It also allowed me to generate a summary with a higher degree of confidence than the authors provided in their study:

Although supposedly ignorant of any content outside the article itself, the AI allowed me to generate a summary tailored for social media:

How does it know how to voice something for social media? Also, the study in question was from 1994, so it’s not “new.”
To prove just how malleable these systems can be, here’s a summary it produced as a letter to Santa Claus — a fictional character not mentioned in the article:

Worst. Santa. Letter. Ever.
Here’s a similar letter it generated to another fictional character, Harry Potter:

- We’re all Muggles, no matter how many sparkles you put on your AI-generated answer headers.
- How does it know about Muggles??
Here’s one with no vowels in it:

Here’s an author list with first names only, and the first letter of each name swapped out for a “B”:

Article summaries are the most pedestrian use-case for AI systems in journals. Yet, they are notoriously unreliable, as our interview with with Olivia Guest and Iris van Rooij underscored. Anyone with half an ounce of familiarity with the field at this point should have tested these things like crazy before applying them to one of the most prestigious and thoughtfully managed journals in the world.
- Every time someone looks carefully at these systems, they are revealed to be burlesques. And, sadly, even if NEJM remediates these issues, which I’m sure they will try to do, the AI system will remain a burlesque — defined as “a ludicrous or mocking imitation; a travesty.” It’s baked in.
NEJM has been one of the most vaunted brands in medicine and among scientific journals. Placing AI on top of this carefully curated human endeavor is a travesty, and these simple tests show just how silly all this AI enthusiasm is at its heart.
Let’s move on now, NEJM. This was a mistake. Admit it, and get back to being your best human-intelligence self.
