Is the Web Dying?

Already under assault and damaged, will new search tools deal the open web a death blow?

Casey Newton is one of the best-informed writers about all things digital, and a recent issue of his “Platformer” newsletter opened with this:

If the web is going to die, it won’t die all at once. 

Instead, it will die a little bit at a time: with a venerable old publication being folded into another; with a well funded upstart sputtering into nothing in under a year; with a subscription business abandoned after it struggled to find enough good writers to keep it going.

The death of digital media has many causes, including the ineptitude of its funders and managers. But today I want to talk about another potential rifle on the firing squad: generative artificial intelligence, which in its capacity to strip-mine the web and repurpose it as an input for search engines threatens to remove one of the few pillars of revenue remaining for publishers. . . . the future of the web depends on what products get built for it. On what labor is funded, and what labor is not.

The Internet has a long history of companies interposing themselves between users and the open web — CompuServe and AOL spring to mind from the 1990s. Social networks morphed into social media in order to claim the same territory, and they did so much more successfully/subversively, providing the layer at no cost while extracting data and manipulating distribution in unseen ways.

Now, we’re entering a new era of intermediaries — this time, it looks like they’ll be based on LLMs.

If there were any doubt that LLMs interposed between experts and sources isn’t a notion coming to our space, here is some text from a sketchy email I received just this morning, from an individual claiming to be able to have:

. . . solved a 200 year old problem in academic research and accessibility using AI and LLM’s. . . . by removing all interfaces between researchers and anybody in the world who can benefit from their work and enabling mass scale personalised chats using AI agents.

In other words, we will scrape the literature and interpret it ourselves, and then give it away in an extractive manner — extracting content from sources, extracting data from users. But the real problem is epistemic — as anyone with experience comparing what people claim a paper says to what it actually says when you read it carefully, relying on summaries is a chilling prospect.

Newton describes two LLM-based (I refuse to call them AI-based) search tools being developed — Perplexity and Arc Search — which are designed to provide answers rather than deliver audiences to sources through source discovery.

Platforms delivering “answers” and removing incentives to visit the sources of information has real implications for the business models of publishers of all kinds — after all, if the only users are bots, what chance do publishers have to sell the ads, subscriptions, or site licenses that keep them in business?

Google has been driving down this road for years, but now it looks like others are taking the wheel.

The potential for misinformation is growing, as humans are sidelined at various levels — evaluation of inputs, evaluation of outputs.

Newton also notes that the human-written Quora content farm has been degraded by a reliance on bots and algorithms to what one tech journalist describes as:

. . . a never-ending avalanche of meaningless, repetitive sludge, filled with bizarrenonsensicalstraight-up hateful, and [LLM]-generated entries along with a slurry of all-caps non-questions. . . .Whereas once you could Google a question about current events and find links to thoughtful Quora answers near the top of the results, you’re now more likely to come upon, say, a bunch of folks asking . . . whether the consistently racist Donald Trump is, in fact, racist. Or, maybe, the featured Google snippet will tell you that eggs can melt, thanks to a nonsense Quora answer caught in the search crawler.

Meanwhile, all is not well in the early days of LLM search. While Perplexity was given rave reviews recently in the New York Times, my first use of it showed how half-baked it is currently. I asked it to describe this newsletter, and it conflated this with one from actual geological scientists:

This kind of confusion is common in LLMs, and not detected often because enthusiasts have too much faith in the technology. As Gary Marcus writes:

. . . [it’s surprising] how sensitive LLMs are to minor perturbations, like rearranging the order of answers in multiple choice, the insertion of special characters, or changes to the answer format. As it turns out, minor changes make a noticeable difference, enough to rearrange “leaderboard” rankings. If a system really had a deep understanding, trivial perturbations wouldn’t have such effects.

But let’s assume LLM-driven search improves and becomes reliable. What then? Where will Perplexity or its kin get information if all the incentives for conducting original reporting, vetting scientific claims for quality and community, writing insightful cultural overviews, and investigating social inequities go away?

Already, 2024 has seen layoffs across the media landscape as cost-cutting occurs due to the threats/promises/delusions that LLMs have created, amid the trouble they’ve already stirred up.

Worse, technologists aren’t the only ones thoughtlessly disrupting the media space. They’re doing it from below, via the infrastructure, while the assault on the superstructure — the gatekeepers, copyright, and scalable business models — has been relentless for more than two decades as self-appointed disruptors, propagandists, and unrestrained oligopolists have prevailed again and again over craven and flat-footed incumbents.

In our world, we have transgressives actively pushing to defund many elements of the human and expert infrastructure associated with making quality digital scientific and scholarly content people can trust.

From cOAlition S to publishers like Frontiers, the push for less scrutiny and more free doggerel for LLM search engines to scrape, ingest, and synthesize for its sole use — likely without compensation given the predilection for CC-BY content licenses — has been radical, relentless, and thoughtless. With thousands of bad papers already known to exist, and many thousands more likely, what is being synthesized exactly?

Recently, the monotonous Jessica Polka pushed out yet another bit of cut-and-paste preprint propaganda, arguing essentially that OSTP and others should work to defund many roles important to quality filtration and community linkages, describing publishers as “siphoning” taxpayer dollars away, and attacking the idea of “reasonable costs” in a bad faith manner. She apparently is unable to process that the laws in these matters are clear, and publishers are within their rights here (Constitutional and legal).

Unintentionally, Polka is arguing that qualified professional people should shutter their expertise and lose their jobs so that major technology-driven corporations can further dominate and strip-mine the cultural landscape.

  • Note: Even Polka might be getting tired of hearing her own backwash, as ASAPBio is now advertising for her successor.
    • Or maybe she senses that when the LLM search engines come, OA publishers will be the first to be assimilated, unable to modulate their CC-BY shields fast enough to resist.

In the coming world of LLM-driven search solutions and apps, preprint platforms may just feed free content of good-enough quality to LLM-driven search tools. Many authors have learned to use preprint servers not for actual preprinting but as a place to distribute their post-review, post-acceptance AAMs just prior to actual journal publication. As a result, of the 45-65% of preprints that actually end up with a companion journal article, a growing proportion of these are becoming post-review, post-acceptance journal articles available for free, allowing a well-trained LLM search engine to bypass the publisher entirely. If publishers don’t wake up to this fact, and its implications, Newton has this to say:

The death of digital media has many causes, including the ineptitude of its funders and managers.

Playing this forward, if students, practitioners, and researchers begin to use a reliable LLM search to identify a good source, will OA content finally have an actual citation advantage? And will that advantage be due to economics, technology, or relevance and quality?

Which advantage should be privileged in a knowledge economy based on truth-seeking?

To me, this appears to be an acceleration and culmination of the algorithm economy, in which platforms demonetized creators, made a backward-oriented cultural space, sowed confusion without accountability, and caused excess deaths in various ways (amplifying anti-vax information, intensifying teen depression and suicidal ideation, and profiting off extremism and social divisions of various kinds).

Thanks to the power of these protected (by 230) content mediators and their inevitable move to disintermediate everyone, the open web may be dying as a vibrant, accessible, creative medium with shared commercial upside and the ability to bring new talent, good information, and valid sources to light.

The upcoming legal, technology, and cultural battles around LLMs aren’t just about jobs, revenues, copyright, and sustainability for content producers and creatives — although such stakes alone are massively important. They are about preserving a vibrant open web that isn’t dominated by a few algorithms, a small cadre of technology platforms, and a cultural strip-mining operation that seems destined to stifle science, commerce, and the useful arts.

These upcoming battles are about whether we are going to continue to be forward-oriented, truth-seeking, accountable, and rigorous about knowledge and facts.

And it all “depends on what products get built for it. On what labor is funded, and what labor is not.”