Minding the Open Data Gap

30 Aug 2015

Mind the Open Data Gap

Over the past several years, it has been so encouraging to see government embrace and deploy open data sites. This shift towards open data is monumental and signals a change in how these governmental organizations connect with those they serve. Sites like Data.gov, Oregon Open Data Portal, and Data Driven Detroit are becoming more and more the norm rather than the exception.

Code for America, local advocacy groups, and innovative companies like Mapbox and Azavea have pushed this movement even further. There is even an industry forming around providing open data solutions for organizations. Newer companies like Socrata have stepped into this space with long standing industry behemoth’s like Esri also offering an open data product.

This is all great, and beyond my wildest dreams from when I first started working in the open data space. But since I’ve started working at Stanford University Libraries, I’ve come to believe we need to do even more to provide long-term access to this data. If an organization removes access to a dataset in an open data portal, does the general public even notice?

The lack of interest, the disdain for history is what makes computing not-quite-a-field.
— Alan Kay

People place inherent value on published or physical items. Much of the data published to open data portals is treated ephemerally. One day you may be able to access a data set and the next you may not. And just because a dataset is listed on Data.gov, that doesn’t necessarily mean you can download or use it. Loss of access to open data can happen for a lot of reasons including:

Web and file services are no longer maintained (due to cost, organizational challenges, loss of institutional knowledge, etc.)
Decisions are made to withdraw data that is not being frequently accessed
Historical data is overwritten to make room for current business needs only
Data is not adequately described or curated

Loss of access to this data may not cause any immediate negative ramifications to a government agency, or even its users who are trying to access it. But this loss of data is perpetuating a growing data gap for digitally created data. To illustrate, someone who comes across a dusty old map in a storage closet might think to themselves “Hey this could be valuable”. They may even go as far as checking with a local library or museum to see if they would like the map rather than throwing it away. But with digital only content thought of preservation is rarely considered. The dusty data represented in bytes is much more frequently created and deleted without a second thought. Many libraries and museums are not equipped to preserve such digital content even if they were contacted.

Data that seems as if it should be accessible in an authoritative way, a lot of times isn’t.

Anyone know an open dataset of 2012 US presidential election results? Values in http://t.co/2CDk32h9r5 seem off. Thanks! #followerpower
— Anita Graser (@underdarkGIS) August 23, 2015

So what is the solution? Government agencies, open data advocacy groups, and libraries all have a role to play and should be working together. If enough thought has gone into publishing the data in an online portal, that same data should be preserved in perpetuity. What we need is a distributed Internet Archive for data.

At Stanford we are already preserving all of the data that we serve out through our spatial data infrastructure and our discovery portal, EarthWorks. But one, or even a handful of universities doing this isn’t enough. A coordinated effort and between organizations is needed to provide near and long term access to this huge amount of content.

Who’s up for this?