By Kent Anderson — Jun 28, 2023

LLMs’ Risks Begin to Manifest

James Vincent on the Verge has been documenting the amount of change LLMs are driving on the existing public Internet, providing a host of useful links in the process:

Google is trying to kill the 10 blue links. Twitter is being abandoned to bots and blue ticks. There’s the junkification of Amazon and the enshittification of TikTok. Layoffs are gutting online media. A job posting looking for an “AI editor” expects “output of 200 to 250 articles per week.” ChatGPT is being used to generate whole spam sites. Etsy is flooded with “AI-generated junk.” Chatbots cite one another in a misinformation ouroboros. LinkedIn is using AI to stimulate tired users. Snapchat and Instagram hope bots will talk to you when your friends don’t. Redditors are staging blackouts. Stack Overflow mods are on strike. The Internet Archive is fighting off data scrapers, and “AI is tearing Wikipedia apart.” The old web is dying, and the new web struggles to be born.

While Vincent portrays this as the death of the current Internet and the birth of a new Internet, it’s unclear that the nascent version is benign, manageable, or even desirable. It may be the technology version of Rosemary’s baby.

MIT Technology Review reports on the explosion in content farms generated using LLMs, and illustrates how these are drawing programmatic advertising away from legitimate sites for more than 140 major brands. (Each brand represents a company with more than $500 million in annual sales.) Suspect sites drawing off these ads are publishing dozens or hundreds of content pages every day, a strong indicator that an LLM is at work. One site averaged 1,200 articles per day during a one-week period. Other sites generated more than 5,000 articles per week.

The source study comes from NewsGuard, a site devoted to detecting and guarding against misinformation. NewsGuard also sells an exclusion list of content farm sites to help publishers avoid placing ads there, so they have an interest here — one I’d consider legitimate. They don’t identify the brands involved as they feel these are the victims, but describe them as:

. . . a half-dozen major banks and financial-services firms, four luxury department stores, three leading brands in sports apparel, three appliance manufacturers, two of the world’s biggest consumer technology companies, two global e-commerce companies, two of the top U.S. broadband providers, three streaming services offered by American broadcast networks, a Silicon Valley digital platform, and a major European supermarket chain.

LLMs may be undermining scientific research in unexpected ways.

Academic researchers using crowd worker sites like Amazon’s Mechanical Turk may be receiving artificial results. A preprint on arXiv (yes, I know, but this seems plausible and poses an interesting hypothesis) suggests that 33-46% of workers asked to annotate data were using LLMs to increase their efficiency — i.e., the amount of work they got paid for. By utilizing LLMs, the data annotations are not being made by a human, but rather by a machine in what is probably a recursive fashion. Another arXiv preprint suggests that LLMs set up to become recursive on themselves begin to “forget” as their reality narrows, and they ultimately collapse.

The speed with which LLMs have taken over the consumer web and even the practices of workers makes prohibitions against use of LLMs or generative AI in research studies harder to implement or enforce. After all, will humans asked to keep a health diary ask an LLM for help writing entries? Will data analysts under pressure to meet a publication deadline ask an LLM to describe a complex data set? Will everyone know or want to disclose their use of an LLM?

LLMs are taking over the public Internet and many of its manifestations with unprecedented speed and effects. Where will it lead? Casey Newton, who writes the “Platformer” newsletter which covers developments in Silicon Valley, is worried about humans losing their bearings in a rising information fog, writing:

. . . the glut of AI text will leave us with a web where the signal is ever harder to find in the noise. Early results suggest that these fears are justified — and that soon everyone on the internet, no matter their job, may soon find themselves having to exert ever more effort seeking signs of intelligent life.

Subscribe to The Geyser