Ever since the internet’s big-bang moment, archivists have struggled to keep up with its burgeoning content. It’s time to change the rules. By Mark Pesce.
Where does everything on the internet go?
Twenty-eight years ago this month, I got onto the world wide web. First among my friends – and among the first few thousand web users anywhere in the world – I explored a tiny universe of content.
But how did web surfers find anything before search engines such as Google – a tool that came along several years later?
In the beginning was the list: just a page of names and links to websites, sitting on the CERN website – the first website, and the birthplace of the web.
I worked my way down that list, methodically clicking on link after link, exploring the website that loaded into my “browser”, and going on to the next. By the time I’d finished that list, another list had appeared, this one hosted at the National Center for Supercomputing Applications or NCSA. Almost as famous as CERN, NCSA gave the world the first widely used browser, Mosaic. NCSA’s list had a lot of overlap with the list at CERN, but a few new sites would pop up on the bottom of the list every day, so I spent another day or two visiting all the sites on that list that I hadn’t already visited.
In seven days, I’d finished. I’d surfed the entire web.
For a few months, I managed to keep up with the list of new websites as they popped up on the NCSA list, priding myself on staying current with this amazing new technology. But by the end of February 1994, more sites went onto that list every day than I could find time to explore. Not long after that, the list’s maintainer threw in the towel – the web’s exponential growth meant that no archivist could hope to keep pace with it.
In early 1994, two enterprising Stanford University students set up Jerry and Dave’s guide to the world wide web, a part-time project that rapidly grew into the first of the internet “unicorns”: Yahoo! Created by David Filo and Jerry Yang, Yahoo! took a librarian’s approach to the too-much-good-stuff of the early web, and asked you to choose your category, then your subcategory, and possibly even your subsub and subsubsubcategories, leaving you with a curated list of websites that you could examine at your leisure, each dedicated to your subsubsubtopic of interest.
It took 18 months for exponential growth to overwhelm Yahoo!’s category search; every subcategory produced a list of sites too long to explore. At this point, I started keeping lists of links – “bookmarks” – like a breadcrumb trail to guide me back to my favourite sites. When that list got long enough, I curated the best of the best, gathered them into a list named Stones, Stars and Gold, and put them on a page on my own website.
Visiting that list today and working methodically from top to bottom, only about one-fifth of the links load the pages they pointed to back in 1995. Most of them go to nothing at all, or to something that has the same name, but is completely different. In less than a generation, my snapshot of the early web – very personal, specific and meaningful – has nearly rotted away.
The term “link rot” may not be new – the concept dates back to the first decade of the web – but most people won’t know that the web had been designed to do its best to prevent the untimely death of links. The uniform resource locator, or URL, had been defined by web creator Sir Tim Berners-Lee as “immutable” – it must not change. A URL gets assigned once – a pointer to a page or a photo or a podcast – and that’s it. That URL always points to those bits. That’s the theory, at least. Unfortunately, immutable URLs immediately went into the too-hard basket. From that moment, the rot set in.
Brewster Kahle saw the problem almost immediately. In 1996, the co-inventor of WAIS (Wide Area Information Server) founded the Internet Archive, and began a methodical backing-up of the entire web. “How can you cite a document if the documents go away every 44 days?” he asked, using his back-up of the web to power something named the Wayback Machine – a technology intended to quell the rot. Pop a dead URL into the Wayback Machine and it will show you all of its back-ups of that web page, all the way back to the beginning of its first scan, 25 years ago.
Using the Wayback Machine on my 80 per cent dead list of favourite links from 1995, I find that many, likely most, of those websites can be recovered. The links themselves may be dead, but the pages and images once pointed to by those links continue to persist. If I wanted to, I could re-create my page with links that leveraged the Wayback Machine, breathing life back into the list. Yet that may not be enough to prevent a more pernicious form of link rot.
A recent paper from a group of United States-based researchers shows that even a good back-up of the web might miss the point. “Where did the Web Archive go?” details the fate of four web archives – the Internet Archive, fortunately, not among them – that changed their own URLs during 14 months from 2017 to 2019. Though well intentioned, those changes broke many of the URLs pointing to the content within those archives. An archive is great – certainly better than losing data. But an archive that doesn’t provide immutable URLs for its data, well, that’s peak link rot.
We all generate so much data all the time now – on smartphones and wearables and Zoom calls and so on – that archiving is no longer a luxury. Without an archive, we lose our connection to our digital past. I learnt this when I sought any online resources concerning the First International Conference on the World-Wide Web, held at CERN in May 1994, and which I attended. There’s very little documentation, and only a few photos, for one of the most important events in the history of computing: the web’s “big bang” moment. Why? The answer is almost too obvious: the conference took place before the web took off. The medium we now use to record, commemorate and share our experiences simply didn’t exist. It was only subsequently brought into being by the 300-plus researchers who attended the conference.
The shadow cast by that absence showed me that if we aren’t very careful, we could lose our connections to our past. The data may remain somewhere but could be so difficult to locate that most people would simply resign themselves to a kind of perpetual digital amnesia. In Nineteen Eighty-Four, George Orwell wrote: “Who controls the past controls the future.” I’d suggest that those who forget the past don’t have much of a future.
All of the archives we add to daily – in the photos we share on Facebook, movies uploaded to YouTube, diatribes posted to Twitter, and so on – mean this threat touches nearly all of us. What can we do? We can demand immutability in perpetuity. Any organisation that publishes to the web should guarantee that even when they revise their systems, existing data will remain available and accessible forever, via the same URLs. We cannot let our history rot away. It doesn’t need to happen, and it shouldn’t. Not if we want to be able to understand how we got here, and where we’re going.
This piece was produced in collaboration with cosmosmagazine.com.
This article was first published in the print edition of The Saturday Paper on Oct 30, 2021 as "Search history".
A free press is one you pay for. In the short term, the economic fallout from coronavirus has taken about a third of our revenue. We will survive this crisis, but we need the support of readers. Now is the time to subscribe.
Letters & Editorial