Can You Retract from an LLM?

Atomized, tokenized, and weighted, papers may not be addressable anymore

Can You Retract from an LLM?

Our recent podcast discussion of the “double bubble” faced by scientific publishers — OA and AI — caused me to finally try to address a question that’s been on my mind off and on:

  • Can a paper be retracted from an LLM?

Let’s look at two LLMs in medical publishing used by two top medical journals — NEJM’s “AI Companion” and OpenEvidence (which adds JAMA, among others) — to gauge an answer.

Bottom line? Expect the unexpected.

Let’s dive in.

NEJM’s AI Companion

NEJM’s recently launched AI Companion didn’t start auspiciously. Even those responsible for it admitted it was half-baked and likely to fail (highlight mine) as they externalized risk to users:

Smacks of an untested “black box” product known for failure, yet rolled out anyhow. Really irresponsible product development.

Instead of making it tell me lies or write letters to Santa or Harry Potter as before, today let’s examine how the AI Companion deals with a retracted article.

As you might recall, this is an article-level system designed to . . . do stuff at the article level. (I really can’t cheerlead since I can’t conjure any solid reason for it to exist.) But whatever it does, you’d expect it to shine at the article level.

In May 2020, a research article about cardiovascular disease, drugs, and mortality around Covid-19 was published in NEJM. It was retracted weeks later (June 2020) by the authors because not all authors had access to the underlying data. They apologized.

Running NEJM’s AI Companion on the retracted article, I was anticipating the system would have some wording about how the article’s claims could not be relied upon due to its status as a retracted article, or some such stuff. Instead, this is what I received:

Pretty confident for a retracted study.

To show you how proximal to the retraction notice this was, here is the page with the AI summary on the right and the red retraction notice on the left:

The red bar is the retraction notice and link:

Yet, even with all this information available — a prominent link to a retraction explaining when and why this paper was retracted — the NEJM AI Companion was totally at sea when asked about it:

Notice that a human with a modicum of experience with scientific journals would have no trouble assessing the situation — here is a paper that has been retracted, a link to an explanation, and an abstract summarizing the now-retracted findings. Yet, if I use the NEJM AI Companion, this article-level system is ignorant of a status I can see with my own eyes. The LLM can‘t see a retracted article.

Verdict: Fail.

  • Did I mention that I really can’t find any epistemic or scientific reason for the NEJM AI Companion to exist?

To write the remainder of this post, I asked a few experts in cognitive science, neural nets, and LLMs for their thoughts. I hope I am representing their insights well enough.

OpenEvidence and Retractions

For these next examples, let’s look at OpenEvidence, an LLM that has secured content deals with NEJM, JAMA, and a handful of other top medical journals. I covered some fundamental problems with OpenEvidence recently, and others have noted similar issues, with one critic writing:

[I]s OpenEvidence reliable and trustworthy? The answer is: sometimes.

For retracted articles, the norm in scientific and scholarly publishing is that a work is retained and marked accordingly, thereby preserving a record with the additional information that the work proved problematic. According to COPE:

Notices of retraction should link to the retracted article, clearly identify it with title and authors, and be published promptly and be freely accessible to all readers.

Yet, this form/norm of retraction doesn’t seem possible if the paper is integrated into an LLM.

  • Does that place LLMs out of compliance with COPE?
    • A question for another day . . .

However, retractions seem to confuse LLMs, while LLMs flummox those applying them. This latter point may be the most worrisome, as the publishers of NEJM and JAMA both seem unwilling or unable to explain exactly how the LLMs they’ve contracted with work. Even OpenEvidence seems unwilling/unable.

I finally got an answer from a spokesperson at NEJM. And it was puzzling:

OpenEvidence expunges retracted articles from their system.

“Expunge” is a complicated concept for an LLM, as it means to erase or remove completely. As I understand it, LLMs process content by atomizing and tokenizing it. The system then generates weightings, which interact with these individual tokenized content elements across a neural net architecture. Weightings evolve over time as statistical relationships are modified by usage patterns and neural net pathways are carved into deeper statistical ruts. Things can change when new content tokens appear or users ask new questions. It’s all quite dynamic, with each original item uploaded becoming less addressable over time due to that dynamism and the fragmented nature of the content and weightings in a neural net architecture.

Erasing/expunging a paper would require an audit trail for every tokenized element across the neural net and weightings or rewinding the system either through a recovery process or reinstall to the state before weightings developed with the retracted content in place.

  • It sounds like a major drag on compute and commerce to “expunge” a retracted article.
    • You’d need precision targeting to avoid mistakes with retraction statuses, as well.

But, as we’re learning, LLMs don’t work as advertised. They aren’t “intelligent.” They are messy, unreliable text extrusion machines.

Because of all this, OpenEvidence can confuse works by the same author across journals, calling one unretracted paper “retracted” while ignoring an actual retracted paper.

For this test, I identified a retracted paper in JAMA Pediatrics by Harald Walach, et al, regarding CO2 levels and mask wearing in children — a paper from those glory days of the Covid-19 pandemic when people were jumpy about rebreathing their exhaled air when masked up. As you can see, the JAMA Pediatrics article has obviously been retracted:

However, when I asked OpenEvidence about works by Walach, it produced this response:

Despite what OpenEvidence claims, the article by Walach et al in the Elsevier journal Environmental Research has not been retracted:

So why is OpenEvidence saying that it has been?

The only reason I can come up with is that, as they say in common parlance, it hallucinated a retraction.

My speculation is that due to mixed up tokens and weightings, the LLM confused the JAMA Pediatrics Walach paper with the Environmental Health Walach paper and conferred the “retracted” status on the latter without surfacing the former — either because the JAMA Pediatrics study was to some degree “expunged” or simply because it is older than the other publication.

Either way, the response appears to be inaccurate in important ways that should make us worry about injecting these things into systems documenting scientific claims.

We’re not done. There’s another JAMA Pediatrics paper to discuss.

Shortly after this paper’s publication, outside experts demanded the work be retracted. It’s an OA meta-analysis/systematic review in JAMA Pediatrics from January 2025. This paper was used to justify removing federal recommendations for fluoridation, with RFK, Jr., telling President Trump that, “The more you get, the stupider you are.”

This eugenics dog whistle paper has been so popular with conspiracy theorists that it still tops the “Most Viewed“ list at JAMA Pediatrics nearly a year later:

Now, let’s assume that JAMA does the right thing and retracts the paper.

What happens at OpenEvidence?

It has authors associated with other papers that aren’t retracted, after all.

As opposed to normal licensing arrangements, the contents of the article aren’t in OpenEvidence to be marked as “retracted” or presented to readers as such. Content uploads don’t survive uploading as coherent packages of metadata, HTML, and PDF (or XML, JATS, or some other standard format). They are processed in a manner we don’t understand. Researchers I asked said the secrecy around these systems is a source of frustration — two now sardonically refer to LLMs as “magic beans.” Nobody knows exactly how they are built, how much is now scripted, how much modification occurs at the prompt stage, what intermediate systems exist, etc. And, as we’ve seen, there’s possible confusion about tokenized elements — authors, journals, statuses.

And retracted papers aren’t expunged. OpenEvidence does know about specific retracted works, including the one erroneously retrieved above and perhaps the most notorious retracted paper of our era — the Wakefield paper. All you have to do is ask, and it lays out the retraction steps and events — for the paper and to some extent for Wakefield:

You can even ask for the DOI:

As for the last question, no thank you. I’m pretty sure what OpenEvidence would tell me is not even half the story, which is another reason walled gardens like this don’t beat a smart human with a decent library, a search engine, and actual experience with the world.

OpenEvidence is a black box proprietary system with pieces of content running unreliably through inference engines linked up via neural nets in ways we don’t fully understand. If it does hew to standards, we don’t know how many or which ones. If it deviates, we don’t know where or why. It’s all a trade secret.

Verdict: Fail.

To summarize:

  • An LLM summarizing a retracted article didn’t reflect the retraction status in its summary and couldn’t process the existence of a retraction when asked
  • Another LLM didn’t correctly note the retracted status of one article, instead erroneously claiming another article with a shared author was itself retracted
  • A spokesperson represented that retracted articles are “expunged” from LLMs, which would violate scholarly publishing norms
    • This also appears not to be true, as infamous retracted articles can be identified, explained, and retrieved using the same LLM
      • This confirms these are “black boxes” even those adopting them do not understand and cannot explain

Corrections?

If you can’t accurately retract an article in the LLM environment, you can’t accurately correct an article, or put an Expression of Concern on it. There’s no stable, addressable CMS. A correction notice would just be another set of tokens in the system, competing with the uncorrected version.

  • Are we even writing corrections in a manner that would have any effect?
    • “In Fig. 4a of this Analysis, owing to an error during the production process, the year in the header of the right column was ‘2016’ rather than ‘2010’. In addition, in the HTML version of the Analysis, Table 1 was formatted incorrectly. These errors have been corrected online.”

More importantly, do NEJM and JAMA understand the machines chewing over their precious content? Other evidence related to NEJM and their “AI Companion” suggests they do not, and that neither publisher robustly tested these systems before promulgating them.

Like academia, publishers have rushed headlong into the AI race without doing the due diligence to ensure that this fits with their roles or jobs in society. LLMs appear to make scientific reports worse in many ways and add confusion to the scientific record by concatenating information using techniques no scientist can actually attest to. Now, it seems that efforts to mark portions of the record as problematic for whatever reason may be thwarted or undermined by LLMs, as well as efforts to make post-publication corrections.

Can we retract from an LLM? JAMA may find out soon enough.

Or will their entanglement with OpenEvidence make them flinch?

Getting involved with unproven AIs was something Joy cautioned against in August, but by then NEJM and JAMA were already too far gone . . .

Finally . . .

Rushed out and mostly vibe coded, LLMs are a bug-filled mess. As Chuck Wendig wrote in a recent open letter about the foolishness of all of this:

Publishers can and must avoid using generative AI and LLM AI. Publishers remain competitive by hiring and training real people to do real people jobs that support real people authors and real people readers. AI remains a broken foot. Bad for the environment, bad for writers, and also, generally doesn’t work well — it certainly doesn’t work as well, or as creatively, as actual humans! Remember, the AI is fed with the work of actual humans. Why do you think that is, exactly?

The big picture is sending strong signals this hype cycle is ending. As the always-astute Gary Marcus wrote last week, debt financing for the AI bubble is starting to dry up, skepticism about capabilities and reliability is mounting, and the entire fiasco is coming into focus for more and more people:

The technical problems are not new. And a trillion dollars or so of investment hasn’t remedied them.

What is new that they have at last become widely recognized, not just as the transitory “bugs” the industry wanted you to think of, but as inherent limitations that flow from the very design of LLMs they really are. This cold reality in turns undermines a large fraction of the use cases that people initially fantasized about.
  • For science, it’s perhaps worth contemplating that the $1 trillion of investment in LLMs which seems to have gotten us nowhere could have gone to funding 22 years of the NIH research budget at the pre-RFK-gutting-science levels.

Worst of all is to see these top medical journals adopting LLMs or participating in these text extrusion/magic beans systems. Doing it so violates so many norms, which one researcher believes we need to call out as “weird” — unscientific, untethered from the effort to identify truth and facts, and ultimately abnormal and, well, weird. NEJM and JAMA have created trusted, proven brands by not being weird, but here we are — something weird has happened inside those houses.

Let’s hope these identified problems with LLM systems cause the smart, non-weird people at these top journals to rethink and retrench in a more science-focused manner, with no further illusions about how limited LLM-based technologies truly are and always will be.

  • The problems are structural at the LLM tech level, not at the content level.
    • The problems I uncovered really didn’t take much effort to identify — writing them up did, but finding them was child’s play, absolute child’s play.

What a fine mess we’ve made of science by subjugating it to tech.



Subscribe to The Geyser

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe