Finding New Narratives in Old Stories

For the last couple years, I’ve been thinking a lot about the future of journalism. Not so much about “Journalism,” the field/profession/fourth estate, but rather the individual pieces of reporting and explanation that we all continuously chuck into the ever-expanding sea of “content.” What will become of them? Will this webpage still be interesting or useful for a reader a month from now? In a year? What about ten years from now?

This is not how most newsrooms think, and I can’t really blame them. Most articles/blog posts/videos/stories have a very narrow window for capturing audience attention, which is inevitably followed by an exponential decay in interest (the very rare exceptions are articles that become canonical–as a top reference for a popular Wikipedia article, for instance, or as a top result for a common Google search). There’s little money (or glory) to be made by optimizing for a handful of readers from the future who might happen to stumble across the link. Furthermore, in a rather twisted feedback loop, the ever increasing supply of journalism has pushed publishers to increase their own production just to maintain market share (or mind share in a marketplace of ideas), and in hopes that at least some of it will be noticed.

Yet, long before digital publishing incentivized this churn, journalism was already treated as a disposable commodity with an expiration date. Yesterday’s newspaper is meant to be thrown away. A nightly newscast goes into the station vault and is unlikely to be seen again. A magazine may linger longer, but still won’t earn a permanent spot on the bookshelf (unless it’s part of one of those National Geographic collections people used to keep).

But just because most people throw away old news artifacts doesn’t mean the information they contain is worthless. For historians, access to old news via libraries or commercial databases like LexisNexis can provide a treasure trove of research materials. When archived properly, journalism really can be the “first draft of history” (a useful cliché, regardless of who said it first).

Web publishing has been a great improvement for accessing this type of archival material, because the journalism itself no longer needs to be disposable. The cost to store text is trivial, so there’s no need to delete yesterday’s news to make room for today’s story. And having access to modern full-text search makes it possible to actually find the stored information you need (although most search engines leave lots of room for improvement when it comes to historical research).

In fact, with web publishing tools, there’s no longer any technical reason for us to let news get stale in the first place. The fact that it’s trivial to update existing stories means that digital newsrooms, for the first time, have the ability to do journalistic maintenance, making sure existing articles are always canonical and up to date (Wikipedia has demonstrated that this is actually achievable). But most have not embraced this opportunity (even when it’s prominently mentioned in the founding manifesto)–there are powerful cultural and business reasons that publishers and editors would rather create new material than maintain and curate an existing corpus.

Just as the web made it easier to create and save new information, it also made it easier to lose stuff. Entire sites can disappear due to turned-off servers or expiring domain registrations. The work done by folks like The Internet Archive is absolutely essential, but I also don’t think they should be solely responsible for archiving digital journalism. Too many publications don’t bother to migrate URLs properly or maintain servers for interactive and special content. As Steve Buttry and others have pointed out, news archives are a huge missed opportunity. I’d like to see news organizations commit real thought and effort to the stewardship of their own archives.

To that end, I’ve been working on a couple of projects that involve mining old articles for new insights. My goal is to identify and illuminate some of the meta-themes and narratives that connect the day-to-day beat coverage. I’m also interested in investigating ways to extract structure from unstructured articles and blog posts. And I hope that using beat coverage as a historical dataset will influence the way we create and structure future stories (I’m keeping a close eye on Structured Stories).

I plan to keep track of the progress and hurdles we run into on this blog. But for now, here’s a brief description of the two datasets I’m working with:

What We’ve Learned From A Decade of Software Failure

  • I’ll be working with the entire archive of the Risk Factor blog, where Robert Charette has been cataloging software failures and glitches for several years.

Bangalore Traffic

  • I’ll be looking at local English-language newspaper articles, blog posts, and social media postings (collected by my wife, Erica Westly to see how official data about road injuries and deaths in the city diverges from the qualitative experience that citizens are documenting.

If you’ve worked on similar projects I’d love to hear from you and learn from your experiences. In fact, I’d love to talk to anyone who shares an interest in this topic, so get in touch: @joshuarrrr or josh [at] joshromero [dot] com.


Right after I published this post, I saw Columbia Journalism Review’s new article about how the archival websites of two alt weeklies–the Boston Phoenix and the Providence Phoenix–are no longer usable. Both sites crashed just two years after the newspapers ceased operations, despite the publisher’s best, if not necessarily well-informed, efforts. (From the article: “We assumed that as long as we kept the servers working they’d be ok.”) It’s disappointing to learn that the Boston Phoenix actually rejected offers of help for archiving their digital material.

Groups like the Reynolds Journalism Institute are working to prevent these sorts of outcomes. RJI held the first Newspaper Archive Summit in 2011. Next came Dodging the Memory Hole 2014; RJI helpfully posted videos of all the talks from that meeting on YouTube. The follow-up conference took place just last week, but I’ve yet to see much publicly available material posted. An exception is Ben Welsh’s presentation of “Time Travel,” a new service for browsing web archives, and a nice companion to his previous work with “PastPages“.

Finally, few people show as much enthusiasm for digital archiving as Jason Scott, as evidenced by his latest effort to collect the once-ubiquitous AOL CD-ROMs.