Open Government Data Maturity Model

Some draft ideas. First up on March 19, 2013, tweaked a lot since then.

This document replaces the prioritization principle in my book. I believe there is strong consensus in the #opengovdata community on the vertical axis, so I am specifically seeking feedback on the horizontal axis.


Governments should proceed from top-left to bottom-right when prioritizing open government data programs. Roughly. There’s some explanatory text below the chart. Scroll down!

The Maturity Model

Law Services Structure Operations Public Data
Reading the laws that govern us, including statutes, regulations, and case law. Improving the citizen experience and government service delivery. Understanding how government works and how it is structured. Understanding how decisions were made and how money is being spent. Leveraging data citizens already paid to collect as a public good.
FOI
Establishes a legal right.
Freedom of Information laws create a legal right to government documents and a presumption of openness, but it is an antiquated and reactive system.
Online & Accessible
Lives on the web. Findable. Accessible.
Information is not meaningfully public if it is not 1) available on the Internet, 2) free (i.e. no charge), and 3) can be found in a reasonable way. Information dissemination should be pro-active, not reactive as with basic FOI.
Adapted from Sunlight Foundation’s Principles for Transparency in Government and the “access” requirements of the 8 Principles of Open Government Data, written by the Open Government Working Group convened by Carl Malamud in November 2007, and the Open Knowledge Foundation’s Open Knowledge Definition (OKD) at opendefinition.org.
Put the law online. Use the Internet for service delivery. List Members of Parliament and agency heads. Post meeting notices before they happen, and post minutes after. Share research results with the public.
Open
Meets 6 Principles/OKD.
Complete. Primary. Timely. Non-discriminatory. Non-proprietary. License-free.
For explanation, see my book, the 8 Principles of Open Government Data, and opendefinition.org.
Law cannot be monopolized through copyright. Post information for everyone to register to vote. Create a government organization chart. Voting records and spending data are posted in a timely manner. The public’s data should not require proprietary tools.
Structured Data
Analyzable. Processable.
Data is structured to allow automated processing.
For explanation, see my book and the machine-processable principle of the 8 Principles of Open Government Data.
Syntactic markup for resolving citations. Digital signatures. Capture the citizen experience in markup. Turn registration requirements into XML. Database of MPs. Structured markup of relations between government bodies. Voting records in XML, upcoming meetings in RSS, a spending database. Data files for census, public health, and transportation (for example).
Global IDs
Uses permanent URLs and URIs to name things.
Place documents at permanent URLs and assign globally-unique identifiers in the form of URIs to...
See my book for more explanation on permanence and identifiers.
Titles, chapters, sections, paragraphs. Government services and processes. Members of Parliament and government agencies. Contractors. Geographic locations, transit stops.
APIs
Provides random or write access.
Create web/REST-based dynamic access points for large datasets that answer real-world questions.
What is the text of 17 USC 1201? This is Government as a Platform. Let third-parties intermediate. How do I email my representative? What spending is near me? What is the weather forecast?
Linked Data
Lives on the semantic web. Uses SPARQL.
The semantic web turns the web into an interconnected database, revealing new insights that cross-cut information silos.
See LinkedData.org and Tim Berners-Lee’s 5 stars to open data, which provided some of the inspiration for the vertical axis of this chart.
Semantic markup of what law means. Re-use API protocols. Link agency organization to spending data. Link spending data to corporate ownership records. Re-use existing schemas.

Explanation

As in any field, a shared vocabulary is crucial for intellectual development. Past explanations of open government data (OGD), including my own, often focused on listing core concepts and definitions (e.g. the 8 Principles, the Open Knowledge Definition). Between 2007 and 2009 there was a need for a common definition of open data. Now in 2013 we need a common vocabulary around not what it is but how to implement it.

Map versus Maturity Model

The chart above could be interpreted as a map of the OGD field. Down the rows on the left side are the different technological strategies of OGD: freedom of information, using the Internet, principles of openness, structured data, global IDs, APIs, and linked data (the semantic web). Across the columns at the top are the different sorts of public information governments produce: laws, service-related data, data about the structure of government, operational data such as rulemaking dockets and spending records, and finally a catch-all column for other public data (sometimes produced incidentally to government functions).

There are distinct communities in most of the 36 cells in the table above. Historically, starting around the 1950s, the open government movement was what we call today the right-to-know or the freedom-of-information movement. It was based on the idea, promoted by journalists, that there is a legal right to information held by the government. And it was a disruptive change in 1966 when FOIA --- that’s the Freedom of Information Act --- was enacted. It is a cornerstone of the modern OGD movement, and it covers the first row of the table. Modern government accountability focuses on the application of structured data to government operational processes (for my work, the legislative process), which is one cell toward the center of the table. There is a government linked data community. Some in the modern free-law movement want global IDs applied to sections of the law, as a framework for citations. Practitioners who work in one cell may have significantly distinct goals than those working in another cell, and a map of the landscape of OGD can help us navigate our community. (I won’t get into the goals in all 36 cells here. The maturity model takes for granted that the goals in each cell are worthwhile.)

As a map, this chart expands on previous work by others in mapping the data and people in the OGD community. Yu and Robinson (2012) proposed a map with a horizontal axis from service delivery to public accountability and a vertical axis from inert data to adaptable data. Their horizontal axis appears in the columns above disguised as “Services” and, for accountability, “Structure” and “Operations”. But I have added to that axis new columns on either side. And I have, in a sense, divided their vertical axis into discrete technologies.

But it is not my intention for the chart to be a map. Rather, the rows and the columns in this chart have an order. In a world of limited resources, such as the world is, some OGD projects should take precedence over others. Rows above should come before rows below. Columns to the left should come before columns to the right. (At least, roughly.) That makes this a maturity model in the sense that it outlines what proper growth looks like for government programs that implement OGD. Proper growth starts with freedom of information for laws and ends (if it ends) with public data on the semantic web. Don’t run before you can walk.

The Vertical Axis: Technologies

The rows of the maturity model are ordered according to their technological complexity. Each successive technology makes data more adaptable, in the sense of Yu and Robinson (2012). But the choice of technological complexity as the order is not based on desired outputs but instead on the fact that many of the rows cannot be accomplished without the previous rows having been completed first. That is, while the maturity model is intended to be normative, the ordering of the rows follows partially from logical necessity.

The order is also not from cheap to expensive. In fact, it may be just the opposite: technology helps us reduce costs in the long-term. For instance, the total cost of all FOIA-related activities across the federal government in FY 2008 was $338 million, mostly for the 3,691 full-time-equivalent staff processing FOIA requests. FOI, the first row in the maturity model, does not come cheap. In the maturity model, it is the starting point of no technology.

The legal right of FOI creates a presumption of openness, but, as you know if you’re familiar with FOIA in the United States, the right is not pro-active, it’s reactive. If there’s data you want, and you can figure out which agency has it, you can petition for that information. And if you’re lucky, the agency won’t object and claim one of the exemptions, if you’re lucky the agency won’t make you pay much to have the data retrieved and copied, and if you’re lucky you’ll get it in about a year. Almost 50 years after FOIA was enacted, it’s pretty obvious we can do a lot better. The rows below build technology on top of the principle of freedom of information.

The first row after FOI is “Online and Accessible”. This principle says that while FOI provides a mechanism for making information public, information is not meaningfully public until it can be found on the Internet. Uploading is the first technological step in the maturity model.

Eight principles determine whether data can be considered “open”. The maturity model lumps accessibility principles in the previous row and leaves machine processability for a later row. The “Open” row refers to the remaining principles: open data is complete, primary, timely, non-discriminatory, non-proprietary, and license-free. (For a discussion on the meaning of these principles, see sections 5.1.1 and 5.1.3 of my book.) This row must follow the previous. The choice of file format is moot before data is made available. Only when data is made available can we refine the manner in which the data is shared. This row, as with FOI, is primarily a matter of policy. In this case, technology policy like timeliness and license restrictions impact the usefulenss of data.

The next row is “Structured Data”. This is the first row that is purely technical, and it refers to creating data in such a way as to make it searchable, sortable, transformable, or, to put it generally, machine-processable. Use spreadsheets instead of PDFs, use text instead of scanned images. Use XML. Break down fields into processable components. Most importantly, make the data adaptable. Applying structure to data requires an up-front technical investment but pays off by making the data more valuable. In this row, the open data that is published online is the original spreadsheet, an SQL database dump, or bulk XML data. (For more, see section 5.1.2 of my book.)

This is as far as most government agencies have made it on the technology of open government data. The remaining three rows guide future directions.

Global IDs, the next row in the model, is a type of structure that can be added to data. There are two benefits of global IDs. First, an ID makes information findable. For instance, a citation to a paragraph in the law (such as “22 U.S.C. 3301(b)(6)”) is a sort of global ID. The identifier uniquely pinpoints a paragraph in the United States Code. Modern global IDs are URLs (such as http://www.law.cornell.edu/uscode/text/22/3301#b-6) that not only identify but also provide enough information to find the resource on the Internet. (Such IDs are called dereferencable.) Second, when Global IDs are shared across data silos, they create connections and make the data more adaptable. This is especially important for government spending data, where contract awardees might also be campaign contributors. A shared Global ID for each corporation facilitates a connection between these two typically separate databases.

The combination of structured data and (dereferencable) global IDs is a read-only, web-based API. (The next row in the maturity model is “APIs”.) An API is defined by an agreement between a provider and a consumer about where and how to access a service. A dereferencable ID is nothing other than the where (a web address), and structure (a schema) provides the how. While an API cannot exist without structured data and a URL, APIs often provide much more functionality beyond a simple read of a resource. They often provide live (or on-demand) services such as sorting and filtering lists, joining tables, and transforming outputs into multiple formats. APIs may also provide transactional services (such as voter registration).

Because APIs are live, it is considerably harder to implement a properly functioning API than it is to implement structured data or global IDs. Structured data can be as simple as a file uploaded once. It is static. APIs are dynamic, are expected to have low response times, and are expected to have “high availability”, which means the service is expected to be running, and running fast, at all times. High-availability also makes changing the structure of data more difficult because the API must serve “version 1” and “version 2” API users simultaneously while the “version 1” API goes through a process of deprecation. All of this requires not only technical expertise from multiple sorts of technology professionals (now including systems administrators) and a large up-front cost in building the API, but it also requires indefinite ongoing operational costs.

The final row of the maturity model is “Linked Data”, which is little more than a thorough application of structure, global IDs, and APIs. Beyond this, linked data uses a particular file format called RDF and a particular API protocol called SPARQL. Linked data provides a high degree of interconnectedness across data silos in both the objects mentioned in the data (e.g. government contractors) but also in the concepts that relate the objects together (so-called predicates). As with structure, linked data requires careful work and an investment up-front, but it provides a basis, a unified framework, for answering complex questions that span data sources and even entire domains. This creates a level of adaptability far beyond what is possible in previous rows. But linked data is still an experimental technology.

(See also Berners-Lee (2010) which proposed a maturity model similar to the vertical axis here.)

The Horizontal Axis: Domains

The columns at the top of the maturity model cover the different sorts of public information governments produce. The columns are in a particular order from left to right. Whereas the order of the rows is based on a logical technological progression, the order of the columns is based on a set of normative values relating to the purpose of government. (Reasonable people may disagree on this order.)

The columns start on the left, where there is a moral imperative for the government data to be made available to the public, and end on the right, where access to public data creates additional benefit to society but for which there is no moral imperative to make the data available.

The leftmost column is “Law”, and here the maturity model asserts that access to the law is the most important function of the many purposes OGD serves. A moral imperative to promulgate the law in all of the ways that increase access stems from the principle that ignorance of the law is never a defense. The principle is quite a conundrum when the law is hard to find, difficult to understand, and, at times, illegal to share. The moral imperative is only a starting point. Access to law has wider implications, as Carl Malamud writes on law.resource.org: improved civics and law education in schools, deeper research in universities, innovation in the legal information market, savings to the government, reduced costs of legal compliance for small business, and greater access to justice. Free public access to legal materials isn’t intended to necessarily replace the expensive subscription services for legal professionals, but instead to open up legal materials to a new audience.

“Services” are next. Services are data produced in the furtherance of a government program. Weather data is an example. The National Weather Service is, or at least was at one time, the largest producer of public data in the government. The Census was one of the first agencies to put data on the web. Their data is another example of service data. For services there is no moral imperative, but there is a legal imperative to further a public policy goal. If an agency’s mission is to produce information, publishing that information as open data can help it further its mission.

The middle columns are Structure and Operations. This sort of data is information about how government works, how it is being run and how money is being spent. This is where government accountability looks for corruption, for instance. There is a moral imperative here too, rooted in the idea that only an educated public can hold their government accountable. Examples of structure include who is serving in government, what agencies make up goverment, and what they are supposed to do. Examples of operational data include voting records, rulemaking dockets, and spending records.

Last is the catch-all column “Public Data.” This is, for instance, some sorts of Medicare and Medicaid claim statistics. Or geographic data about the location of every single road in the country. This is data for which there is no moral imperative to make public, at least not the sort of moral imperative that exists for law data, and there is no legal imperative to pro-actively make it available either. In a resource-limited world, this sort of data is not a high priority for open data. But making the data open, structured, and so on produces value to society. It is civic capital. Entrepreneurs can build businesses around this data. (Think Google Maps and its predecessors, built originally off of government data and government GPS signals.)