Building meaning at scale

 

Building meaning at scale


The way large language models leverage text on the internet has upended how we think about and access all kinds of data. But, if used well, these models might also offer new possibilities for maintaining rich and searchable archives not currently possible. Take journalism as an example. LLMs, with their need for large corpora of high-quality text, have driven a wave of concern around access to news articles. They’re not nearly as abundant as Reddit threads and fan fiction, but they are professionally written and edited. And, for the move toward using LLMs as chat agents (especially Google’s push for AI-driven search), encoding current events and information in that high-quality text is an attractive proposition. This is a new paradigm in thinking about journalism — not as individual records, not as an aggregate representation of important events, but purely as a pile of text.Contrast this view with a resource like the Internet Archive, which has also amassed an enormous corpus of news articles. The Internet Archive preserves as much as possible the original shape of the record — how the website where the article was published looked, how the story elements were laid out, the metadata (headline, byline, publication date, etc.), and even how the individual record changed over time. This is a mode of replication that is not intended for large-scale ingestion. There is a computational cost associated with replicating web pages faithfully, and the idiosyncrasies of these records make them difficult to compile in large volume.These two views — -news as training data and news as historical record — -are at opposite ends of the spectrum of how news data might be persisted. They each have a clear-cut use case. Obviously LLMs use news articles as training data, folded in along with lots of other text from across the internet. An archive ensures access on a smaller scale.

Post a Comment

0 Comments