By Joshua Tauberer (http://razor.occams.info), who runs the congressional transparency website www.GovTrack.us. Thanks to Gunnar Hellekson (RedHat, Inc.) for help with the initial version, and to all others who provided feedback.
Originally published May 19, 2009. This is version 1.5 dated January 29, 2011. See the Document History at the end for a list of changes since the original.
Government data, also called "public sector information" internationally, is a valuable resource to society when it is public and open. While there is a theoretical side to making public government data widely available—creating a well-informed public is a core value of representative government—there are many practical applications for public government data in and out of civic life.
This is not a new idea, and already many government entities have begun to embrace these ideas. A 2009 report for the U.S. Office of Science Technology Policy described the importance of shared data from science and technology research:
"The power of digital information to catalyze progress is limited only by the power of the human mind. Data are not consumed by the ideas and innovations they spark, but are an endless fuel for creativity. A small bit of information, well found, can drive a giant leap of creativity. The power of a data set can be amplified by ingenuity through applications unimagined by the authors and distant from the original field." ("Harnessing the Power of Digital Data for Science and Society", a working group report to the Office of Science Technology Policy, Executive Office of the President, U.S.A. government)
It is perhaps easier to imagine how government scientific data can improve lives than government records: government funded research in the life sciences helps us to cure disease, for instance. The case for sharing public government records is no different. From all types of public government data we can find many examples of where sharing the data with the public has spurred advances in everyday life, promoted our economy, and truly made our lives safer. The U.S. Office of Management and Budget has taken a similar position:
"Government information is a valuable national resource. It provides the public with knowledge of the government, society, and economy -- past, present, and future. It is a means to ensure the accountability of government, to manage the government's operations, to maintain the healthy performance of the economy, and is itself a commodity in the marketplace." (Circular A-130)
The European Union Directive 2003/98/EC, commonly called the Public Sector Information Directive, makes a similar case:
"(2) The evolution towards an information and knowledge society influences the life of every citizen in the Community . . . (3) Content production has given rise to rapid job creation in recent years and continues to do so . . . (5) Public sector information is an important primary material for digital content products and services and will become an even more important content resource with the development of wireless content services."
Open government data, on the other hand, is public government information such as government records that is shared with the public digitally, over the Internet, in a way that promotes analysis and reuse. (A more formal definition is given below.) W. David Stephenson calls the end game "democratizing data", where data goes through a life cycle of being collected, structured, syndicated to allow for timely distribution, shared with the public for oversight purposes, and shared with civic groups and entrepreneurs. Sharing digital information is, to be sure, no small matter. Legal and fiscal constraints guide the use of technology in government, and new questions arise. Does open data make government vulnerable? Is it responsible to share data without appropriate context? What data should be openly shared? These are policy questions mostly beyond the scope of this guide, which primarily addresses the question of how to make data open.
These rest of this document is split into three parts:
Open data can make the world a better place. This is no exaggeration, and the reason is that information is a crucial driving force in innovation. Information is a unique kind of resource. It is a so-called "public good": consumption of information by one individual does not reduce the availability of the information for others. This is why the benefit of information can extend far beyond its initial purpose. Governments, as a major producer of information, are therefore in a strong position to spur innovation by promoting open government data. (See Sean Gorman's "Information as a Public Good".) It is perhaps easier to imagine how government scientific data, such as for life sciences research, can improve lives than government records. But from all aspects of government data we can find many examples of where sharing the data with the public has spurred advances in everyday life, promoted our economy, and truly made our lives safer.
Environmental data has been one of the leading areas of open public data in the United States. The U.S. Environmental Protection Agency (EPA) shares a large number of raw data sets with the public which have gone on to new and unexpected uses. The EPA's initiative to share data goes beyond its core missions of research, regulation, and education. Because of the availability of this data, the private sector has been able to promote healthier living. For instance, the American Lung Association used the EPA's Air Quality System database to create a State of the Air report, an important assessment of environmental quality with ramifications especially for those with respiratory diseases such as asthma or emphysema. "Six out of ten people (61.7%) in the United States population lives in counties that have unhealthful levels of either ozone or particle pollution," the report found. This is a prime example of how government data became a valuable public resource with a use out of the scope of the government agency that created it.
Public safety is often monitored through public records. Records of mine safety inspections came to focus in 2010 after the largest coal-mining disaster occurred in 40 years in the country. Because there was an active mine-safety beat reporter covering the area, it stands to reason that the disaster might have been averted had the safety inspection documents been made thoroughly available to the public as they were produced (Froomkin 2010).
The U.S. National Weather Service (NWS) sets an example as the largest federal agency on the web in terms of data flow in most months. What is striking about NWS is not that they provide weather information to the public, which is of course a part of their mission, but that they do so in a way that promotes innovation. Edward Johnson, the director of strategic planning and policy for NWS, said (pers. comm.), "We make an enormous amount of data available on a real time immediate basis that flows out into the U.S. economy." Both no-cost distribution systems as well as specialized high-reliability and high-bandwidth services set up on a limited cost-recovery basis are a crucial foundation for daily weather programming and weather warnings in newspapers and on television. The "health of the environmental information enterprise", meaning the wider community involved in weather information including commercial providers, the media, and the academic community, is very important to the NWS, Johnson said, and that the NWS can fulfill its mission more effectively by considering the public, and especially the media, partners in disseminating weather information in an open way. But beyond the convenience and safety provided by the NWS, Johnson said weather is a part of an important "information industry." He said he thinks that
"the U.S. government actually gains more from the stimulation of economic activity by making all this information available, widely, easy to use, and inexpensively than we would gain if we were to charge for it."
That open government data is a boon to a national economy is supported by a NWS report (Weiss 2002) which compared the relative openness of data in the United States to the more prominent cost-recovery and licensing model in Europe, in which the public must generally purchase government data at a price much greater than the marginal cost to distribute the information and then may not legally redistribute the data. The marginal cost of distribution is the cost of making one more copy. In the physical world, the marginal cost is the cost of processing, paper, and postage. In the online world, the marginal cost is the cost of transferring the bytes of the document over the Internet, which is so low as to make the marginal cost of distributing most government records essentially zero. The report noted that the weather risk management industry is 50 times larger in the United States than in Europe, attributed in part to the higher commercial value of weather data made available at marginal cost and without licensing restrictions in the United States compared to the weather data in European nations which is costly to buy and more tightly controlled. The report also indicates that open government data may even be more fiscally beneficial to government programs than cost recovery because more applications built on open data may translate into higher corporate taxes for the government. The report concludes that "charging marginal cost of dissemination for public sector information will lead to optimal economic growth in society and will far outweigh the immediate perceived benefits of aggressive cost recovery."
Open government data supports a wide variety of businesses. A survey of the state of public sector information in the United Kingdom (Mayo & Steinberg 2007) noted, "Geographic information underpins an impressive [part of] the UK economy. It is easy to see that without good-quality mapping, postcodes or land ownership information, large parts of the economy would be unable to function at all (i.e. anything that required delivery, or sale, rental or purchase of property)," and, "the Office of Fair Trading argued that there was some £500 million of untapped economic value in the whole UK public sector information market, on top of the £590 million currently generated." Mapping is a common use of open government data. In 2009, Google ended its long-time relationship with private-sector cartography company Tele Atlas in favor of public map data of the United States from the U.S. Census Bureau, U.S. Geological Survey, and the USDA Forest Service (Google 2009). In fact, most online maps in the United States have relied on open government data all along: Tele Atlas began as value-add on top of some of the same information Google is now turning to directly. That a major company such as Google would replace private with public data is a strong assertion for both the economic and societal value of public mapping data. Much more discussion on the economic value of open government data can be found in Uhlir (2009).
Though the economic benefit is a secondary consequence of an open data policy at the NWS, sometimes economic goals are the primary concern. The treatment of business information by the U.S. Securities and Exchange Commission (SEC) is a prime example of how data benefits the economy. The SEC requires publicly traded companies to disclose financial statements, and these statements are passed on to the public. "The result of this information flow," the SEC notes on its website, "is a far more active, efficient, and transparent capital market that facilitates the capital formation so important to our nation's economy." The information flow is supported by the use of technology. Since the 1990s many filings have been electronic, and today the SEC relies heavily on XML and a newer, also non-proprietary open standard called eXtensible Business Reporting Language (XBRL), distributing these documents to the public at no charge, with no license restricting use, and in bulk, to make the most out of the information flow.
There is a long tail to how open government data is used. Websites like www.iLive.at, a winner in the 2008 Apps for Democracy contest run by the Washington, D.C. municipal government, can transform city operational data into useful tools, in this case providing a peek into neighborhood life for someone that might be moving to the area. (See the Washington, DC Data Catalog.) Open government data can also be inspirational, such as NASA’s photos of Earth from space and digital versions of historical artifacts including the founding documents made available by the Library of Congress. An article in Politico (Lovley 2009) reviewed several entrepreneurial uses of government data: FlyOnTime.us helps travelers find the most on-time flights (note: I have since contributed to the development of the site), and Cornell University scientists are using recordings at the Library of Congress to vet sightings of the thought extinct ivory-billed woodpecker. Lovley wrote, "every fact matters to someone."
Better governance is finally perhaps the most important component to open government data. This comes about in two ways. The first is basic civic education, that is, keeping the public informed about the workings of government. The U.S. Congress makes its voting records available as a database to the public, and this database has been used by newspapers (The New York Times and The Washington Post) as well as start-ups (my own www.GovTrack.us, for instance) to aid the public in understanding what decisions were being made in the Congress. In this example, it would be not appropriate for there to be a single, government-run website that monopolizes voting record information. Similarly, OpenRegs.com is an independent window into the executive branch regulatory process. The public's ability to find the best ways to educate itself, using different forms suitable for different people and purposes, is a foundation of transparency and accountability, as well as public trust.
The second way open government data contributes to better governance is by making the public more self-reliant, which reduces the need for government regulation. This point, which I have only seen made by Mayo & Steinberg (2007, page 39), suggests that the more information the public has at its disposal to evaluate goods and services, the better functioning markets will be and the less need there will be for government interference. The cost of sharing or even collecting the information necessary to make a more self-reliant public may be less than the cost of regulation.
Numerous examples of “open PSI at work” are listed in Fioretti (2010) with examples in the categories of geographical data, local transportation, demographics, election support, energy production and consumption, budgets and taxes, local economics activities, real estate, environmental data and pollution measurements, aggregate health or health-related data, security and legal, education, waste management, and water management.
Really just starting this year (2009) we've seen a greater take up of the idea of open government data within the United States Government. This year we've seen at the federal level the creation of data.gov, a directory of data sources from the executive branch, and a directive in Congress to explore bulk data access to legislative records (Zetter 2009). The New York State Senate has launched APIs for access to legislative records (http://open.nysenate.gov/) and the New York City Council held a hearing in June on open data standards. The City & County of San Francisco is now investigating creating a platform for open govenrment data (http://apps.sfgov.org/opendata).
Of course, the idea of open government records as part of government transparency isn't itself new. Federal law has long codified some aspects of the government's responsibilities with respect to information openness. The Freedom of Information Act (1966; 5 USC 552) mandated public access to government records; the Paperwork Reduction Act (1980; 44 USC 35) “ensure[d] the greatest possible public benefit from and maximize the utility of information created, collected, maintained, used, shared and disseminated by or for the federal government”; and the E-Government Act (2002) expands on "using Internet-based information technology to enhance citizen access to Government information and services." (See CRS Report 97-71, 2009 for a summary.) But technology has changed fast and today's culture surrounding open government data did not exist when these laws were created.
Likewise, many government agencies have already embraced these concepts before they were well understood.The Census Bureau and the Federal Elections Commission, for instance, have long been fantastic at sharing with the public as much as they can. New York City has had its NYCStat directory of city operational data (http://www.nyc.gov/html/ops/nycstat) which was expanded in 2008 with performance statistics. Baltimore launched CitiStat in 2001 and lead to the state's StateStat in 2007 (http://statestat.maryland.gov).
Mayo & Steinberg (2007) list many developments in the United Kingdom both in terms of the availability of public sector information as well as how the public has used the information to innovate. Steinberg was himself an early leader in this field and today runs the nonprofit MySociety.org. A thorough and more recent review can be found in Davies (2010), whose timeline of UK OGD events notes as early as 2003 an Advisory Panel on Public Sector Information to encourage “opportunities in the information industry for greater re-use of public sector information”.
A report from Australia's Department of Broadband, Communications, and the Digital Economy (2009, page 12) lists recent developments in open government data in the country:
"The combination of public sector information with mapping technology can also increase our understanding of important public policy issues, such as health. . . . The Western Australian Data Linkage System links Western Australia’s core population health data sets and provides de-identified, trend data to research, planning and evaluation projects which aim to improve the health of Western Australians. . . . The [Australian Broadcasting Corporation] now releases most of the data on its website under a Creative Commons Attribution 2.5 Australia license.. . . Three Australian cultural institutions . . . have joined the Flickr Commons project, which consists of museums and libraries from around the world publishing their images under a 'no known copyright restrictions' tag to increase access to publicly-held collections and provide a way for the general public to contribute information and knowledge."
No doubt many more examples will be included in this section in the future.
Trends in Austria, Finaland, France, Germany, Iceland, Italy, Norway, and Sweden are covered briefly in Fioretti (2010).
What information should governments prioritize when opening data? The Association of Government Accountants' 2009 principles succinctly capture two important guiding rules:
"4. Provide information that helps make decisions."
"8. Be conscious of the dollar cost of transparency, and invest wisely in it. Set priorities for disclosure, and strive for the best return on investment."
Governments have a special responsibility to provide the public with information that supports the public in their role in government oversight. Sunlight Foundation's Principles for Transparency in Government recommends several priorities for what information should be shared with the public with the most useful application of technology:
W3C, the standards body that has guided most of the development of the Web, published in the working draft Publishing Open Government Data (Bennett and Harvey 2009) the following recommendation for prioritization:
"First, publish all data that is already available in other formats (such as print), or already being collected and shared with the public in other ways. Publish both highly structured data and text documents. Publish laws, regulations and codes, and public documents related to government agency mission, vision, values and goals."
But new data on its own can have unintended consequences that should be kept in mind when prioritizing data release. Michael Gurstein wrote, “Newly available access to land ownership and title information in Bangalore was primarily being put to use by middle and upper income people and by corporations to gain ownership of land from the marginalized and the poor.” Gurstein points out that not all data yields an “effective use” of the data.
As we've seen above, shared knowledge and in particular open data is a valuable resource to society, but having a website or posting a report online may not be enough to meet the expectations of openness in the 21st century. In this section we explain that data's value depends not only on its subject, but also on the format in which the information is shared. Format determines the value of the resource and the extent to which the public can exploit it for analysis and reuse. This section makes two key points:
Although the benefit of information is to people, machines play a fundamental role in our own abilities to consume information. For instance, the information landscape was expanded in the past by machines including the printing press, radio, television, and computer with the Internet. These tools lowered the cost of distribution of information, which in turn allowed people to consume more of it. But distribution is just one side of how machines can help us. On the consumption side, machines like pens and pencils allow us to take notes, which allows us to summarize and to save information for later. When looking to capitalize on information, we must consider in what ways we can take advantage of tools to make the information more valuable.
The computer is also a tool on the consumption side of information. Its uses can be summarized as searching, sorting, and transforming information. Simple examples are a web search to find information, sorting a spreadsheet to understand expenses, and creating charts and graphs as visual representations of numerical information.
But a computer can't meaningfully search, sort, or transform just any file given to it. We might say that prose is poetry to a computer. Computers can't make heads or tails of English. Let's take an example. Give a computer a schedule of meetings written out in plain English:
Example: In two Thursdays, the Reform committee will pick up where we left off. The Commerce, Science, and Transportation committee will meet on the eighth in the Rayburn Senate Office Building to resume unfinished business from the November 3rd meeting...
The computer's task is to collect all of the meeting notices and list them in chronological order. Computers don't deal with human languages very well at all. In fact, no one knows how to program a computer to understand those meeting notices as well as you or I can. A computer will inevitably foul up "the eighth": did you mean December or January? Was November 3rd the date of that upcoming meeting (perhaps in 2007) or a reference to one that's already past? It wouldn't make sense to publish public information like voting records as a haiku, and likewise if we want computers to help us out, we have to give them a little bit of help by publishing information in a language they can work with.
Giving structure to information means putting the information in a precise format that one can instruct a computer follow. A table with a column of dates in a consistent format, such as YYYY-MM-DD, is machine-processable. A computer can be told to maintain a list in chronological order once it knows where in the file to look for dates and how to interpret the date text as a year, followed by a month, followed by a day. In comparing a free-form report with a machine-processable table, the difference in value is immense. The former limits itself to being read. The latter opens up new possibilities for analysis, from helping the public to keep up with scheduled hearings or allowing them to perform oversight or educating their peers.
Particular choices can be made, however, it how to encode information in a machine-processable way. These choices determine what aspects of the information is machine processable, and it is usually a trade-off in terms of specificity and cost. A calendar file might or might not encode the latitude and longitude coordinates of a meeting. The availability of geographic coordinates would certainly add the ability for machines to process location information in the schedule reliably --- it would help users obtain driving directions, for instance --- but the added cost of adding such coordinates to the data may not be worthwhile. At a higher level, one often chooses a "file format" that is most appropriate for what is needed to be represented in a machine processable way. Word processing document files are machine-processable data files that instruct the computer where to put text on the page and in what format. Web pages, to a computer, are machine-processable "HTML" files that indicate how to present text on the screen, what should happen when a link is clicked, etc. News websites and blogs make use of a machine-processable file format called "RSS" which can be used to aggregate headlines in applications called feed readers. The XML and CSV file formats are most often used for tabular data or other types of records. There is no general machine processable file format that is applicable for every application.
Information can often be expressed in several data formats, and the choice has both technical and legal ramifications. For instance, with documents one has a choice between Microsoft Word .doc format, Adobe Portable Document Format (PDF), and the less-commonly known OpenDocument Format (ODF), among others. Word documents are easily modified and revised, while PDF documents are seen as more suitable for printing: these are some of the technical ramifications of the choice of document format. We elaborate on some of the ramifications in this section.
A primary technical concern of the choice in data format is that of obsolescence which may prevent archival access to the information. Reliance on a single vendor's tools and formats increases the likelihood of obsolescence. Though all formats are at risk, some formats are more susceptible than others. The Cornell University Library's Digital Preservation Management guide notes that "proprietary, closed specifications" are the most at-risk for obsolescence. These formats are at risk because they are tied to a single product or company. The ability for files in these data formats to be used decades later hinges on the format's product or company still existing at that time, a risky bet given the rate that technology changes. (Proprietary but open specifications reach a middle ground. These data formats can often be used with several software programs, rather than being tied to a single vendor, but remain at risk because a single vendor can "close" later revisions of the standard.)
The least at-risk formats are non-proprietary, open formats. These tend to promote a wide range of uses, backward and forward compatibility, and an independence from short-term commercial interests. "Open" here means freely available and public documentation of the standard exists that may be legally re-implemented by archivists to access otherwise lost records. These data formats are generally implemented in open source software, software that can be freely distributed and modified. Before 2008, the Microsoft Word .doc format was a closed proprietary format. Because of consumer pressure, today both the Microsoft Word .doc format and the Adobe PDF formats are proprietary but open formats. The OpenDocument Format (ODF) is a non-proprietary, open document format implemented in the open-source program OpenOffice.
Technical choices can also create unintended usage limitations. Data formats for the same type of information are often meant for different purposes, so it is important to choose data formats that support the intended uses of the information. In some cases, the format of disclosed data can prevent the data from being made useful to the public. For instance, because Adobe PDF documents are intended primarily for printing, the format trades-off the ability for third-parties to readily extract information from a PDF, such as with copy & paste, with their ability to print it as it was intended to appear. When supporting the public's ability to cite and extract information, Adobe PDF may not be the most appropriate format. The use of "cloud computing", i.e. third-party software services such as YouTube, poses a similar problem. While these services often provide free public access to resources for viewing, they generally do not allow for resources to be downloaded and reused by the public. The particular usage limitations that may arise vary from data format to data format.
These problems are exacerbated by the need to obtain software licenses in order to open media in some proprietary formats. Formats encumbered by licensing, royalty, and patent restrictions present additional problems. If archivists and the public must pay licensing fees to access public data, the data is no longer public in any practical sense. Though these issues may arise for any proprietary format, they have traditionally been most often applied to audio/visual media formats such as the MPEG formats. Each device that plays MPEG format audio or video (i.e. music player or computer) and each transmission of an MPEG format title must be licensed by the patent holder. Though consumers are generally unaware of the royalties involved in the devices they purchase because they have been paid for by the manufacturer, these royalties make many uses of MPEG media prohibitively expensive. License terms are also subject to change, and often do change. Unfortunately, the companies that own the most popular operating systems, Microsoft and Apple, have used their position to promote patent-encumbered media formats. As a result, the patent-encumbered formats are also the most convenient for consumers. Nevertheless, this unfortunate market situation does not alleviate governments' responsibilities to promote the dissemination of information that does not require anyone's permission to access. The Ogg Vorbis and Ogg Theora media formats are, today, the primary non-patent-encumbered media formats --- they are open and non-proprietary formats as well.
In summary, it is important that data is disclosed in formats that:
In our world of limited resources, an incremental plan for achieving best practices is needed. We recommend the following order of priorities:
From a policy side, there are two directions a government agency can take to ensure or encourage openness and compliance with technological best practices:
The most obvious but often impractical direction is to legislate openness and technological best practices by enumerating what information should be covered or establishing a presumption of openness, setting how it should be made available, and perhaps specifying a means of public redress to ensure compliance through the legal system.
The Association of Government Accountants surveyed CFO's, executives, and oversight leaders in U.S. and Canadian federal, state, and local government agencies for their Recovery and the Transparency Initiative 2009 publication. From the survey they developed eight principles of financial transparency, perhaps the first definition of transparency developed from the thoughts of government executives. Two of their principles support the approach of a presumption of openness:
"3. Be as open as possible without creating risk. The default setting for disclosure is anything that does not violate security or the law."
"5. Do not just react to requests—active outreach is important."
The Office of Management and Budget used the presumption of openness in Circular A-130 which set guidelines for executive branch agencies' dissemination of digital information and their Policies for Federal Agency Public Websites which set requirements for all executive branch agencies including:
"Provide all data in an open, industry standard format permitting users to aggregate, disaggregate, or otherwise manipulate and analyze the data to meet their needs.” (#5D)
The second direction is to encourage openness and technological best practices through periodic review. A review requires fewer policy changes but, especially when conducted by an independent agency, adds pressure to pursue technological advancement. One recommendation for review, based on Robinson et al. 2009, is given in the USA's Webcontent.gov best practices ("Provide Appropriate Access to Data"):
“One benchmark for determining whether data is made sufficiently available is whether the public has all of the data needed to replicate any searching, sorting, and display functionality provided on the agency's own website.”
The recommendations in this section address how to make public government data open, starting not with what should be open but what it means for data to be open. The scope of these recommendations is public and digital government information, meaning reports, audio/visual media, databases, transcripts, and other government records and products in digital form that have no privacy, security, copyright, or other restriction on distribution governed by law. Government records that are not public, by law, are simply out of the scope of these recommendations.
Defining open government data is not an easy task and many have tried their hand at this and similar definitions. The Open Knowledge Foundation's Open Knowledge Definition (OKD) at opendefinition.org sets up a very minimal notion of openness for information (not necessarily government information). It requires access at no more than the marginal cost of reproduction and requires the data be able to be redistributed and reused. However, the OKD sets a standard far too low for government data by permitting other license terms that may restrict use of government data. The American Library Association (ALA)'s Key Principles of Open Government broadly addresses the government's most basic role in sharing information with the public. They note the importance of properly funding information dissemination and the role of depository library programs.
No one set of principles seems to have completely captured the spirit of open government data, so we have elected to form a synthesis here of ideas found elsewhere. The first principle we cite for open government data is adapted from Sunlight Foundation's Principles for Transparency in Government and the OKD's "access" requirement:
1. Information is not meaningfully public if it is not available on the Internet for free. Today, the first place many people turn for information is the Web and expect to find government information that way. If information can be obtained only by request through postal mail, or, worse yet, in person, the information is essentially unavailable to the vast majority of citizens. Likewise, any fee for access greatly limits the availability of the information.
What constitutes an appropriate fee for reuse of government information varies from culture to culture, and we admit that this principle may be biased toward U.S. culture. In the United States, expectations are particularly high for the government. Fees beyond the marginal cost of reproducing a document are viewed with suspicion, as if the fee is designed to impinge on the public's ability to oversee its government. Fortunately there is essentially no marginal cost of online distribution for most government records, and so in the United States "public" means "online" and "free". But this varies from country to country. EU PSI Directive sets a much lower standard: "(14) Where charges are made, the total income should not exceed the total costs of collecting, producing, reproducing and disseminating documents, together with a reasonable return on investment" (emphasis mine), although the directive does go on to recommend the marginal cost.
We continue our recommendations with those published by the Open Government Working Group (opengovdata.org), convened November 2007. Their Eight Principles of Open Government Data specify a working definition or check-list for what it means for public government data to be open. We list their recommendations 2-8 with elaboration (their "Complete" principle is omitted here). Data should be:
2. "Primary: Primary data is data as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms." One often finds that the only open access to audio, video, and images are at low resolutions for the purpose of making them suitable to viewing on the Internet. While this is an important use case, publishers of open data have an obligation to make the full-resolution information available in bulk, possibly in addition to a low-resolution format, to support additional applications such as the creation of professional media and archiving. Likewise, numeric or tabular data should not be aggressively aggregated for one particular application at the cost of throwing away other public information that could be used.
3. "Timely: Data are made available as quickly as necessary to preserve the value of the data." In other words, data is not open if it is only shared after it is too late for it to be useful to the public. What is a reasonable level of timeliness depends on the nature of the data set. As we noted in the priorities above, data relevant to an ongoing policy debate require higher standards. Timeliness is not just that data is available, but that data users can find updates quickly. Syndication feeds can help notify users of new content, and data should explicitly include a list of recent changes to the format and content.
4. "Accessible: Data are available to the widest range of users for the widest range of purposes." Data must be made available in formats that support both intended and unintended uses of the data. Data must be published with current industry standard protocols and formats when available to ensure that unintended uses are not restricted, as well as alternative protocols and formats when necessary to promote wide availability for more narrow use cases. As we noted above, the choice of data format has wide implications for what applications can be built on top of the data, what usage restrictions may result from data format patents, and whether archived data is likely to be usable in the future. If the data is accessible through an interactive interface, it must be possible to download the complete data set in raw form through an automated process, i.e. a bulk data download. Data that is public should be provided with sufficient documentation so that the public understands the nature of the data. This principle is also related to the OKD's "access" and "absence of technological restriction" requirements.
5. "Machine processable: Data are reasonably structured to allow automated processing." In some cases, the format of disclosure can prevent the data from being made useful to the public. When choosing a data format, it is important to understand the technical consequences of the choice. Free-form text is not a substitute for tabular and normalized records. Images of text are not a substitute for the text itself. When publishing documents, it is important to avoid scanned images of printed documents which are not machine processable. Adobe PDF is a common data format for documents but may not enable a satisfactory level of machine processability if the documents contain systematic data that might be subject to analysis, or if the PDF contains only scanned images of documents.
For government records, use XML. XML is an industry standard generic data format. If not XML, then publish a raw database dump in the most convenient format.Text files are simple to consume, parse, and analyze. For scheduling information, use feed formats such as RSS, Atom, and iCal. For audio/video media, use a combination of widely deployed formats at low-resolution suitable for streaming (e.g. Flash) and high-resolution open formats (e.g. Ogg-based formats).
6. "Non-discriminatory: Data are available to anyone, with no requirement of registration." Anonymous access to the data must be allowed for public data. This principle is also related to the OKD's "no discrimination" requirements.
7. "Non-proprietary: Data are available in a format over which no entity has exclusive control." Proprietary formats add unnecessary restrictions over who can use the data, how it can be used and shared, and whether the data will be usable in the future. Use of proprietary formats may also constitute a form of endorsement that may create a conflict of interest. While some proprietary formats are nearly ubiquitous, it is nevertheless not acceptable to use only proprietary formats. On the other hand, the relevant non-proprietary formats may not reach a wide audience. In these cases, it may be necessary to make the data available in multiple formats.
8. "License-free." Dissemination of the data is not limited by intellectual property law such as copyright, patents, or trademarks, contractual terms, or other arbitrary restrictions. While privacy and security restrictions as governed by existing law may reasonably limit the dissemination of some government data, any restrictions should be the exception rather than the rule. Every effort should be made to make non-restricted portions of otherwise restricted documents available under these principles. This principle is a stronger version of the OKD's "redistribution" and "reuse" requirements.
As with what constitutes appropriate fees, appropriate license terms vary from culture to culture. In the United States, the ideal of "free speech" places a considerable restriction on the government to not use the law to prevent the dissemination of information, especially information related to the government. For instance, by and large government-produced documents are exempt from copyright restrictions. This principle, too, may be biased toward U.S. culture. The EU PSI Directive notes that licenses for government data may consider "liability, the proper use of documents, guaranteeing non-alteration and the acknowledgment of source". Certainly the fewer restrictions on reuse, the more open a document is. In countries that impose a government copyright (such as crown copyright), open government data should be explicitly dedicated to the public domain (see May & Steinberg 2007 p33).
To these eight principles we add a ninth that is nearly implicit but must be stated:
9. Permanence: Data should be made available at a stable Internet location indefinitely. Providing documents with permanent web addresses helps citizens share documents with others by allowing them to point others directly to the document, rather than having to provide instructions on how to find it, or distributing the document separately themselves. A common format for permalinks to documents, which is used at most newspaper websites, is "www.youragency.com/year/month/day/name.doc". Web addresses of this form give a clue about the date and nature of the document, which helps users verify that they have the right link. The League of Technical Voters proposes that web addresses be used to help distinguish document versions by having a different but related web address for each document version, as well as help identify paragraphs within documents (see http://citability.org).
We also add several more recommendations from the Association of Computing Machinery's Recommendation on Open Government (February 2009):
10. Promote analysis: "Data published by the government should be in formats and approaches that promote analysis and reuse of that data." Although we have discussed this throughout, it is worth emphasizing that the most critical value of open government data comes from the public's ability to carry out its own analyses of raw data, rather than relying on a government's own analysis.
11. Safe file formats: "Government bodies publishing data online should always seek to publish using data formats that do not include executable content." Executable content within documents poses a security risk to users of the data.
12. Provenance and trust: "Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity." Digital signatures help the public validate the source of the data they find.
When applying these principles to media such as documents and audio/visual recordings, one must consider the dual role of media and records: On the one hand, as part of the agency's website they are a component of the agency's communications with the public. Because of this, web media must be available in formats suitable for display in a web browser and should be easily locatable through search. But web media is often also a government record. Reports, for instance, are of interest not just to visitors to the agency's website but also to journalists and technologists who may want to analyze the information in the reports in ways not supported by the agency's website. In this role the web media must also be available in other formats suitable for purposes beyond viewing on the web.
The above 12 principles essentially define open government data in its ideal form, but more can be said about the process of opening up government data. How should government agencies decide what to open and how to do it?
13. Public input: The public is in the best position to determine what information technologies will be best suited for the applications the public intends to create for itself. Public input is therefore crucial to disseminating information in such a way that it has value. As the Association of Government Accountants' principles state, "Understand the information that people want, and deliver it. They may not be sure what they need, so help them define it."
14. Public review: The Association of Government Accountants' principles also note that not only should the data itself be open, but the process of creating the data should also be transparent: "Have a process for ensuring that data you disclose are accurate and reliable, and show that process to users."
15. Interagency coordination: Interoperability makes data more valuable by making it easier to derive new uses from combinations of data. To the extent two data sets refer to the same kinds of things, the creators of the data sets should strive to make them interoperable. This may mean developing a shared data standard, or adopting an existing standard, possibly through coordination within government across agencies. The use of open data formats often, but not always, entails interoperability. However, we recognize that interoperability can come at a cost. Governments must weigh the advantages of distributing non-interoperable data quickly against the net gain of investing in interoperability and delaying a release of the data.
16. Technological choices can be a type of endorsement. Endorsements of technology created or controlled by the private sector can create a conflict of interest when regulating that sector, and creates an incentive for endorsed corporations to be involved in policymaking. Other things being equal, technological choices should be avoided that essentially endorse a single profit-making entity.
Also see Mayo & Steinberg 2007 for specific recommendations for the United Kingdom.
It is hard to take these recommendations down to a more concrete level. What formats to use or what is an acceptable level of timliness will vary from case to case. Others, however, have published recommendations for how to publish web pages and more specific notes for databases. These recommendations address more narrow concerns and are no substitute for applying the principles listed above, but I list them here for completeness's sake.
Google has made several recommedations from the point of view of web search (Marsh 2009). The ability for the public to find government information is a crucial part of government information being open. Their first recommendation is to use the Sitemaps protocol, which helps search engines crawl websites more deeply and efficiently. Their second recommendation was to review whether search engines are blocked from parts of agency website by a robots.txt file, which describes the agency's policy regarding automated access to their website. A robots.txt file should be used sparingly so as not to limit the public's ability to gather data from the agency or gather data about the agency. As noted by Webcontent.gov, restricting acccess with a robots.txt file may be contrary to an Office of Management and Budget memorandum in the United States.
Permanent web addresses is a part of a larger picture of using globally unique identifiers (GUIDs). This concept, important on the world wide web, is that any document, resource, data record, or entity mentioned in a database, or some might say every paragraph in a document, should have a unique identification that others can use to point to or cite it elsewhere. A web address is a globally unique identifier. Any web address, such as the address of this page "http://razor.occams.info/pubdocs/opendataciviccapital.html", refers to this document and nothing else, and this reliability promotes the dissemination of the document as it provides a means to refer to and direct people to it. GUIDs that persist across database versions allow users of the database to process the changes more easily. If two datasets use a common set of GUIDs to refer to entities, such as campaign donors, then the value of the two datasets becomes more than just the sum of their parts. The connections between the databases adds great value to how they can be used. An easy (and accepted) way to choose GUIDs is to piggy-back off of your agency's web domain, which provides a space of IDs for you to choose from that won't clash with anyone else's IDs. For instance, you may coin verbose GUIDs for entities such as "www.youragency.gov/guids/john_smith", rather than a simple, opaque, and non-globally-unique numeric ID "12345".
The use of GUIDs in the form of web addresses is a part of a technological movement called Linked Open Data (LOD, see linkeddata.org). Promoted by the creator of the Word Wide Web, Tim Berners-Lee, the LOD method for publishing databases achieves data openness in a standard format and the potential for interconnectivity with other databases without the expense of wide agreement on unified inter-agency or global data standards. LOD is a practical implementation of Semantic Web ideas, and several tools exist to expose legacy databases and spreadsheets in the LOD method. I have promoted the use of LOD for government data (see my rdfabout.com and Tauberer (2009)), as has more recently Berners-Lee (2009).
The W3C working draft Publishing Open Government Data (Bennett and Harvey 2009) provides a similar summary of these practical best practices.
Several other issues remain that are unique to the government's role in information technology:
Open government data is a valuable public resource for its ability to fuel innovation in areas far beyond the mandate or resources of government. Several examples were listed above that benefit public health, safety, business and the economy, and especially civic engagement, transparency, accountability, public trust, and digital inclusion. These benefits come from the ability for computers to sort, search, and transform data into new purposes that can't often be predicted before they are discovered.
But making government data open government data is no simple task. With the specific recommendations above for what makes government data open, plus a list of priorities and associated issues, it is clearly not possible to satisfy every constraint. Governments must weigh the advantages of open formats against the cost of providing data in multiple formats, and must weigh the concern of endorsement against popular demand and the state of a technological industry. The development of interoperability standards often comes at the cost of timeliness in the dissemination of the information. No government could make all data open in all of these respects.
Nevertheless, it is hoped that these principles are taken as a guide, offering a direction for higher standards at the intersection of technology, transparency, and government-fueled innovation.
American Library Association. Key Principles of Open Government. Accessed April 4, 2009.
American Lung Association. State of the Air: 2009. 2009.
Association of Computing Machinery. Recommendation on Open Government. February 2009.
Association of Government Accountants. Recovery and the Transparency Initiative (Annual CFO Survey). July 2009.
Bennett, Daniel and Adam Harvey. Publishing Open Government Data (W3C Working Draft). September 8, 2009.
Berners-Lee, Tim. Putting Government Data Online. June 24, 2009.
Davies, Tim. 2010. Open data, democracy and public sector reform: A look at open government data use from data.gov.uk. August 2010.
Department of Broadband, Communications, and the Digital Economy, Australian government. Australia's Digital Economy: Future Directions. July 14, 2009.
Congressional Research Service. Access to Government Information In the United States. August 31, 2009. Report 97-71.
Cornell University Library. Digital Preservation Management. Accessed April 4, 2009.
Froomkin, Dan. 2010. Was lack of transparency a factor in mine deaths? April 14, 2010.
Google, Inc. "Your world, your map," in Google Lat Long Blog. October 7, 2009.
Gorman, Sean. "Information as a Public Good" in Intergovernmental Solutions Newsletter: Transparency and Open Government (Spring 2009), p1. 2009. GSA Office of Citizen Services and Communications, U.S.A. government.
Gurstein, Michael. Open data: Empowering the Empowered or Effective Data Use for Everyone?. September 2, 2010.
League of Technical Voters. Citability.org. Accessed June 24, 2009.
Lovley, Erika. The government has a database for most everything. Politico. June 24, 2009.
Marsh, Jennifer. Our recommendations for increasing citizens' access to government information. Google Public Policy Blog. June 22, 2009.
Mayo, Ed and Tom Steinberg. The Power of Information: An independent review. June 2007.
Office of Management and Budget, U.S.A. government. Circular A-130. February 8, 1996.
Office of Management and Budget, U.S.A. government. Policies for Federal Agency Public Websites (M-05-04). December 17, 2004.
Office of Science and Technology Policy, U.S.A. government. Harnessing the Power of Digital Data for Science and Society: Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. January 2009.
Open Government Working Group convened in November 2007 in Sebastopol, California, USA. Eight Principles of Open Government Data.
Open Knowledge Foundation. Open Knowledge Definition. 2006.
Robinson, David G., Yu, Harlan, Zeller, William P. and Felten, Edward W. Government Data and the Invisible Hand. 2009. Yale Journal of Law & Technology, Vol. 11, p. 160.
Securities and Exchange Commission, U.S.A. government. http://www.sec.gov, accessed April 4, 2009.
Sunlight Foundation. Principles for Transparency in Government. February 2009.
Tauberer, Joshua. Building a Civic Semantic Web. August 2009. Nodalities.
Uhlir, Paul F. The socioeconomic effects of public sector information on digital networks: Workshop Summary. U.S. National Committee for CODATA Board on Research Data and Information Policy and Global Affairs. 2009.
W. David Stephenson. "Unfettered Access to Data Can Transform Government" in Intergovernmental Solutions Newsletter: Transparency and Open Government Spring 2009), p6. 2009. GSA Office of Citizen Services and Communications, U.S.A. government.
Washington, DC Office of the Chief Technology Officer. Washington, DC Data Catalog.
Webcontent.gov, U.S.A. government. Provide Appropriate Access to Data, accessed April 4, 2009.
Webcontent.gov, U.S.A. government. Search Engine Indexing and Robots.txt Files, accessed June 25, 2009.
Zetter, Kim. An API for federal legislation? Congress wants your opinion. Wired.com Thread Level. March 5, 2009.
2009-05-19. Initial Draft.
2009-07-20. Added references to . . .
And other changes: