February 10th, 2014
There comes a time in every dataset’s life when it wants to become an API. That might be because of consumer demand or an executive order. How are you going to make a good one?
When is an API appropriate?
There are certain datasets that are so large or volatile that downloading the whole thing and/or keeping it up to date becomes burdensome. An API is one strategy to lower the barrier to entry. As Ben Balter wrote:
Go to any agency website, and chances are you’ll find at least one dataset sitting idly by because the barrier to consume it is too damn high. It doesn’t matter how hard stakeholders had to fight to get the data out the door or how valuable the dataset is, it’s never going to become the next GPS or weather data.
A web-based, read-only API is a tool that in some cases can make it easier for consumers to use your data.
To put this in context, I assume here the data is already available as a bulk data download. As I’ve written ad nauseum elsewhere (such as at http://opengovdata.io/maturity/), an API is almost never a starting point. Posting open data, bulk data, and structured data and using good identifiers all come first, and probably in that order, before an API becomes useful. You can’t make a good API without working through all that first, and all of that addresses important and common use cases that APIs do not. So I assume from here on that bulk data is available and that the steps to make the data good data have already been done. So…
The term “API” is vague. It’s often used as short-hand to mean a web-based method for programmable access to a system. But “API” is just a way of saying “protocol”. There were APIs before there was an Internet. Merely having an “API” doesn’t mean an actual use case has been solved: you can make a protocol without it being any useful.
What makes an API good?
Let’s take the common case where you have a relatively static, large dataset that you want to provide read-only access to. Here are 19 common attributes of good APIs for this situation. Thanks to Alan deLevie, Ben Balter, Eric Mill, Ed Summers, Joe Wicentowski, and Dave Caraway for some of these ideas.
Granular Access. If the user wanted the whole thing they’d download it in bulk, so an API must be good at providing access to the most granular level practical for data users (h/t Ben Balter for the wording on that). When the data comes from a table, this usually means the ability to read a small slice of it using filters, sorting, and paging (limit/offset), the ability to get a single row by identifying it with a persistent, unique identifier (usually a numeric ID), and the ability to select just which fields should be included in the result output (good for optimizing bandwidth in mobile apps, h/t Eric Mill). (But see “intents” below.)
Deep Filtering. An API should be good at needle-in-haystack problems. Full text search is hard to do, so an API that can do it relieves a big burden for developers — if your API has any big text fields. Filters that can span relations or cross tables (i.e. joins) can be very helpful as well. But don’t go overboard. (Again, see “intents” below.)
Typed Values. Response data should be typed. That means that whether a field’s value is an integer, text, list, floating-point number, dictionary, null, or date should be encoded as a part of the value itself. JSON and XML with XSD are good at this. CSV and plain XML, on the other hand, are totally untyped. Types must be strictly enforced. Columns must choose a data type and stick with it, no exceptions. When encoding other sorts of data as text, the values must all absolutely be valid according to the most narrow regular expression that you can make. Provide that regular expression to the API users in documentation.
Normalize Tables, Then Denormalize. Normalization is the process of removing redundancy from tables by making multiple tables. You should do that. Have lots of primary keys that link related tables together. But… then… denormalize. The bottleneck of most APIs isn’t disk space but speed. Queries over denormalized tables are much faster than writing queries with JOINs over multiple tables. It’s faster to get data if it’s all in one response than if the user has to issue multiple API calls (across multiple tables) to get it. You still have to normalize first, though. Denormalized data is hard to understand and hard to maintain.
Be RESTful, And More. ”REST” is a set of practices. There are whole books on this. Here it is in short. Every object named in the data (often that’s the rows of the table) gets its own URL. Hierarchical relationships in the data are turned into nice URL paths with slashes. Put the URLs of related resources in output too (HATEOAS, h/t Ed Summers). Use HTTP GET and normal query string processing (a=x&b=y) for filtering, sorting, and paging. The idea of REST is that these are patterns already familiar to developers, and reusing existing patterns — rather than making up entirely new ones — makes the API more understandable and reusable. Also, use HTTPS for everything (h/t Eric Mill), and provide the API’s status as an API itself possibly at the root URL of the API’s URL space (h/t Eric Mill again).
Multiple Output Formats. Provide alternatives for the output format, commonly JSON, XML, and CSV, because different formats are best for different use cases. This is to the extent that you actually have users that want these formats. CSV is nice for researchers but not great for other developers; developers lately are moving away from XML toward JSON. See what formats your users want. A RESTful API (see above) will let the caller choose the output format by simply tacking a file extension to the end of the URL, or you can use content negotiation (h/t Dave Caraway).
Nice Errors. Error messages, either because of an invalid request from the user or a problem on the server side, should be clear and provided in a structured data format (e.g. JSON). A RESTful API (see above) additionally uses HTTP status codes where they apply, especially 200, 400, 404, and 500.
Turn Intents into URLs. An API won’t satisfy everyone’s use case, so pick the most important and make them dead-simple for the user. These use cases are also called “verbs” and “intents.” If a common use case is to get the latest entry added to the dataset, make an API called “/api/1/most-recent-entry.” Don’t make users add filtering, sorting, and paging to do common operations. It’s temping to build a kitchen-sink API that can do anything generically and nothing specifically, but it misses the point: As Ben Balter put it, ”APIs should absorb the complexities of using the data, not simply expose it in a machine-readable format.” Intents are also good for hiding implementation details, which gives you flexibility to make back-end changes in the future.
Documentation. This is incredibly important. An API without documentation is useless. Totally useless. Because no one will know how to use it. Documentation should cover why the dataset is important, what the data fields mean, how to use the API, and examples examples examples.
Client Libraries. Your users will be accessing your API through software. They’re going to have to write code. Provide re-usable, fully working, modular code for accessing the API in the most common languages that the developers will be using (usually Python, Ruby, and perhaps PHP). This code gives developers a head start, and since every developer will need to write the same basic API-accessing code you get a big win by taking care of writing it once for everyone. (h/t Alan deLevie)
Versioning. You will make changes to the API. Nothing is right the first time. Put a version into every API URL so that when it’s time for Version 2 you don’t disrupt the Version 1 users. The best version numbers are actually release dates. So your API URLs should look like: /api/2014-02-10/…. Using a date as a version can relieve anxiety around making updates.
High Performance. Your API should be fast. And while users will appreciate it, the most important reason is for you. Slow APIs create a risk that your server will be overloaded with use too quickly. Some users will inadvertently (if not maliciously) issue extremely slow and resource-intensive queries if you make such queries possible, and if they issue a lot or if too many users make those queries your API can come down hard. Try not to have that possibility. If you need long-running queries, make it hard for users to inadvertently start them. In addition, query results should be cached on the server side by URL (i.e. don’t put authentication in the URL!) and cachable, in principle, on the client side if the user chooses to so that repeated accesses to exactly the same query are lightning-fast.
High Availability. You don’t know when users will be using the API, so it needs to be available all the time. This is really hard. (It’s one reason bulk data is so much easier.) Basic precautions like rate limiting should be taken to reduce the risk that the API fails under high load. When updating the data behind the API, the API should never be left in a state where it provides incomplete answers. Maintenance windows should be short because they are incredibly disruptive to users, and notice should be posted ahead of time.
Know Your Users. Log what happens in your API and have some analytics so you can tell if anyone is using it and what they’re using it for, and whether the API is really addressing the use cases you want it to.
Know Your Committed Users More. Have a relationship with your committed users so you can alert them to upcoming maintenance and changes to the API, and so you can know who is making resource-intensive queries in case those queries get out of control. This is often done by having an API key (which is like a password for access — but it should be optional!! see the next section). Your system for issuing API keys should be automated and real-time so that developers don’t have to wait to get started. In the API, pass the API key in the HTTP authorization header (h/t Ed Summers). (Or consider another standard method of authorization like OAuth; h/t Ben Balter.)
Never Require Registration. Don’t have authentication on your API to keep people out! In fact, having a requirement of registration may contradict other guidelines (such as the 8 Principles of Open Government Data). If you do use an API key, make it optional. A non-authenticated tier lets developers quickly test the waters, and that is really important for getting developers in the door, and, again, it may be important for policy reasons as well. You can have a carrot to incentivize voluntary authentication: raise the rate limit for authenticated queries, for instance. (h/t Ben Balter)
Interactive Documentation. An API explorer is a web page that users can visit to learn how to build API queries and see results for test queries in real time. It’s an interactive browser tool, like interactive documentation. Relatedly, an “explain mode” in queries, which instead of returning results says what the query was and how it would be processed, can help developers understand how to use the API (h/t Eric Mill).
Developer Community. Life is hard. Coding is hard. The subject matter your data is about is probably very complex. Don’t make your API users wade into your API alone. Bring the users together, bring them to you, and sometimes go to them. Let them ask questions and report issues in a public place (such as github). You may find that users will answer other users’ questions. Wouldn’t that be great? Have a mailing list for longer questions and discussion about the future of the API. Gather case studies of how people are using the API and show them off to the other users. It’s not a requirement that the API owner participates heavily in the developer community — just having a hub is very helpful — but of course the more participation the better.
Create Virtuous Cycles. Create an environment around the API that make the data and API stronger. For instance, other individuals within your organization who need the data should go through the public API to the greatest extent possible. Those users are experts and will help you make a better API, once they realize they benefit from it too. Create a feedback loop around the data, meaning find a way for API users to submit reports of data errors and have a process to carry out data updates, if applicable and possible. Do this in the public as much as possible so that others see they can also join the virtuous cycle.
How do you build a good API?
Actually I don’t know yet, but here are some things that might be useful:
- API Umbrella, which is used at api.data.gov, provides API key management, rate limiting, and so on as a wrapper around an existing API. It was also some of the inspiration for starting this blog post.
- instant-api, by Waldo Jaquith, creates an API from static data.
- qu, by the CFPB, creates a platform for serving data
December 23rd, 2013
Everyone likes a good choropleth map, that is, a map with regions colored according to some variable. But when the variable is a function a population of highly unevenly distributed individuals — such as in maps of the United States — we know we can run into some problems:
Half of the population in the continental United States lives in just 1% of the continental U.S. land area. One-fifth of the population lives in just 0.28% of the land area. 95% of the population lives in just 27% of the land area.
There are at least two problems with veridical (regular geographic) choropleth maps. In a rasterized choropleth map (i.e. it has finite resolution), entire cities can get squashed into a single pixel with the result that information is lost. A substantial proportion of the information that the map is trying to show probably doesn’t even appear if the variation occurs where the people are.
And more familiar, veridical maps can misrepresent the aggregate. Individuals in low-density areas are given more space on the map than individuals with high-density areas, biasing aggregate inferences toward the value of individuals in low-density areas.
Coloring a map by district — like by county or congressional district — runs into the same problem. The smallest 50% of the 433 congressional districts in the continental U.S. occupy just 5% of the land area. Six congressional districts, all in New York City, are smaller than one pixel in a typically sized map! (Where “typically sized” is 650px by 410px.)
When map data falls below the resolution of the map itself one should be very concerned. It’s like tossing out arbitrary data because these data points really aren’t showing up at all. That’s considered academic fraud when the data is shown in the form of a table. I’m not sure why we think it’s okay in map form.
It’s also mostly the urban population that gets squeezed into a small area. This is particularly concerning for politically themed maps since the urban population leans left. All six of those too-small-to-be-seen New York districts are currently represented by Democrats, for instance. Republican-held congressional districts are on average 2.7 times larger than Democrat-held districts despite having equal weight in Congress and so take up disproportionate space in a verdical map. The same is likely true by county too if we were to look at presidential election results.
Considering how much space on a map is taken up by essentially unpopulated land, these maps are also inefficient representations of the data. They give space to meaningless geographies while skipping meaningful ones.
It’s really time we stop using veridical maps to show population data. I get that cartograms are hard to construct and hard to read, but I would rather have no map at all than a map that misrepresents the data it purports to show.
Here’s a table showing land area as a function of population:
|% of Population
||% of Land Area
For computing land area resided in by the population, I used the 72,246 Census tracts in the 2010 census that make up the continental United States, meaning I excluded tracts in Alaska, Hawaii, and the five island territories. For land area I used the ALAND10 value in the Census’s shapefiles. The total population and land area of the tracts used were 306,675,006 and 7,653,005 km^2, respectively.
For congressional districts, I used the 433 districts in the continental U.S. (that’s the states minus Alaska and Hawaii and including the DC district). Their “land area” is their 2D area after being projected into EPSG:2163, which is an equal-area projection, using this Census GIS data. The total “land area” computed this way came out to 8,064,815 km^2, the difference being areas of water. For which party holds the district, I filled in the two currently vacant districts (AL-01 & FL-13) with the party of their most recent congressman.
Thanks to Matt Moehr, Lisa Wolfisch, and Pat Grady for some tips on identifying census tracts via Twitter.
Updates: Keith Ivey pointed out that I included Hawaii in the definition of continental U.S. the first time around. Fortunately it’s land area is small enough that it only barely affected the numbers. Instead of 74,003 tracts and 435 congressional districts there are 72,246 and 433; 60% of the population lives in 1.71% and not 1.70% of the land area. Other numbers are unchanged.
I also changed the projection used to compute the land area of congressional districts from “web Mercator” to an equal-area projection, but the numbers (e.g. 50% cover 5% of land) didn’t change. While I was there, I also changed how the Republican/Democrat distortion was measured. I originally wrote “Republican-held congressional districts cover 3.2 times more land area than Democrat-held districts despite Republicans only having 1.2 times as many seats in Congress” but I think the way it’s phrased now is clearer.
December 12th, 2013
Eric Mill, Jonathan Gray, and I have updated Best-Practices Language for Making Data “License-Free” which also has a new home at http://theunitedstates.io/licensing/.
We’re also adding a slew of new endorsements, bringing the list to the Sunlight Foundation (read Eric’s blog post), the Open Knowledge Foundation (read their blog post), CDT, EFF (read their blog post), Public Knowledge (read their blog post), the Free Law Project, the OpenGov Foundation, Carl Malamud at Public.Resource.Org, Jim Harper at WashingtonWatch.com, CREW, and MuckRock. Thanks go out to our contacts at all of those organizations and especially to Eric who spearheaded the effort to issue this update.
Our guidance is for federal agencies and is related to the recent open data memo from the White House, M-13-13. That memo directed agencies to make data “open” but told agencies the wrong thing about what open data actually means. We’re correcting that with precise, actionable direction that, to summarize, says “Use CC0.” As we write in our guidance:
It is essential that U.S. federal government agencies have the tools to preserve the United States’ long legal tradition of ensuring that public information created by the federal government is exempt from U.S. copyright and remains free for everyone to use without restriction.
Why It Matters
Why does it matter? Imagine if after FOIA’ing agency deliberative documents The New York Times was legally required to provide attribution to a contractor, or, worse, to the government itself. If the government doesn’t like the article, maybe they take the Times to court on the grounds that the attribution wasn’t done correctly. There’s a reason we don’t let our government control access to information.
What We Recommend
In short what we say is “Use Creative Commons Zero” (CC0), which is a public domain dedication. We provide recommended language to put on government datasets and software to put the data and code into the world-wide public domain, which means anyone can use the information without any capricious restriction. In a way, it’s the opposite of a license.
I previously wrote about this in August when we issued the first version of our guidance. Since then, our document has been effective in guiding the use of “open” in three government projects:
- OSTP’s Project Open Data re-licensed its schema for federal data catalog inventory files. It had been licensed under CC-BY because of non-governmental contributors to the schema, but now it uses CC0.
- CFPB followed our guidance and applied CC0 to their qu project…
- …and their eRegs platform.
Our advice was also already followed by HHS for its ckanext-datajson project and the Council of the District of Columbia’s Unofficial Code (disclaimer: I was involved in both of those projects already). We’re glad to see that our guidance has already been useful and we hope it continues to be useful as agencies work on compliance with M-13-13.
In this updated version, we cleaned up the suggested legal language, we noted that our recommendations apply equally well to software code as well as to data, and we improved the introductory text, among other changes.
How We Wrote It
The process of updating the guidance was done mostly openly through github. Feel free to open an issue with questions or create a pull request with suggested edits at https://github.com/unitedstates/licensing/issues.
November 19th, 2013
I can remember when 640×480 was a typical screen resolution. Desktop monitors have gotten a bit better since then, but just how much? Every few years I look at the breaks in common screen resolutions to see what I should be designing for when building desktop-oriented websites.
TL;DR: The 1024×768 displays are finally disappearing. The fall-off after 1280 horizontal pixels will be around for a while though.
Using Google Analytics data from GovTrack.us (which has a primarily desktop audience and probably is skewed a little toward older machines), I compared screen resolutions in April 2011 with screen resolutions from the last 30 days (~November 2013).
Each bar shows the percentage of users with at least the horizontal screen size on the right side of the bar. So the first bar goes up to 100% because almost everyone has at least 400 horizontal pixels. Only about 50% of users have a horizontal resolution of at least 1366.
No real surprise, but mobile is getting more popular, and that’s shown by the big 5-10% drop-off after about 450 horizontal pixels. All of the lost users in the 450-1024 pixel range are now using hand-held devices with less than about 450 horizontal pixels.
The 1024-divide is closing: In early 2011 there was a 20% drop-off after 1024 horizontal pixels. While 90% of users had at least 1024 pixels, only about 70% had a higher-resolution display. That drop-off has shrunk considerably, now at only a 10% drop (from 80% to 70%). Maybe we can finally start designing for 1280px!
And 1366px? The next big drop is between 1280 and 1366. Despite the shift to mobile, the resolutions starting at 1366px are gaining ground. That’s closing the 1280-1366 gap. But not by much. It’s still about 20%: 70% of users today have resolutions at least 1280px, but only 48% have resolutions at 1366px or above.
November 6th, 2013
The Freedom of Information Act was enacted twice, and the one that we know and celebrate is, technically, not the one that became law. This early history of FOIA provides an interesting case study in the complexities of the codification of our federal statutes.
What we commonly consider the Freedom of Information Act, S. 1160 in the 89th Congress, was signed by President Johnson on July 4, 1966. It became Pub.L. 89–487 / 80 Stat. 250. Its effective date was one year later on July 4, 1967, and in fact it never became law: it was repealed before its effective date. More on that below.
What came before?
During the period between the enactment of the act and its effective date, Title 5 of the United States Code happened to be enacted into positive law. This means that while Title 5 existed before, it was essentially an official compilation of laws but importantly not the law itself. The enactment of a title of the U.S. Code into positive law reorganizes existing laws to make them easier to follow.
The enactment of Title 5 created the section of the code that would soon be the home of FOIA (5 USC 552). The statute that would actually become FOIA would replace the text of 5 USC 552 with similar provisions plus modern FOIA as we know it.
Pub.L. 89–554 enacted on September 6, 1966 was the law that enacted Title 5 into positive law. The new Title 5′s 5 USC 552 contained a brief public records law: (see 80 Stat. 383)
§552. Publication of information, rules, opinions, orders, and public records
(a) This section applies, according to the provisions thereof, except to the extent that there is involved—
(1) a function of the United States requiring secrecy in the public interest; or
(2) a matter relating solely to the internal management of an agency.
(b) Each agency shall separately state and currently publish in the Federal Register—
(1) descriptions of its central and field organizations, including delegations of final authority by the agency, and the established places at which, and methods whereby, the public may obtain information or make submittals or requests;
(2) statements of the general course and method by which its functions are channeled and determined, including the nature and requirements of the formal or informal procedures available and forms and instructions as to the scope and contents of all papers, reports, or examinations; and
(3) substantive rules adopted as authorized by law and statements of general policy or interpretations adopted by the agency for public guidance, except rules addressed to and served on named persons in accordance with law.
A person may not be required to resort to organization or procedure not so published.
(c) Each agency shall publish or, in accordance with published rule, make available to public inspection all final opinions or orders in the adjudication of cases (except those required for good cause to be held confidential and not cited as precedents) and all rules.
(d) Except as otherwise required by statute, matters of official record shall be made available, in accordance with published rule, to persons properly and directly concerned, except information held confidential for good cause found.
You can see in there some of the ideas that were precursors to the actual Freedom of Information Act. In (a) you’ll see what became the national security and personnel records exemptions.
Now, 5 USC 552 was actually not new when it was enacted in September 1966. But I’ve got to unpack that. “5 USC 552″ existed before the enactment of Title 5, but it was about an unrelated subject and the enactment of Title 5 in 1966 moved that to another part of the US Code. But, that’s not what I mean by it not being new.
The text of 5 USC 552 predated the enactment of Title 5 as well. Acts enacting titles into positive law aren’t supposed to create new law. They just rearrange existing law in previous statutes. The text of 5 USC 552 dates back at least to 60 Stat 238 (which Jon Roland of the Constitution Society had the foresight to digitize). That’s the original Administrative Procedure Act, Public Law 79-404 (60 Stat 237, originally S. 7 in the 79th Congress, in 1946).
The enactment of FOIA
The enactment at Title 5 was actually not relevant to the creation of the substance of FOIA, but it did create a conflict with the form of FOIA. At this point Congress’s lawyers were probably not happy. S. 1160 in the 89th Congress — the FOIA we celebrate — was written to amend the Administrative Procedures Act, which by September of the year FOIA was enacted no longer existed. The enactment of Title 5 had repealed it.
When a title is enacted into positive law, the title repeals the previously existing laws that it is consolidating into a new title. Section 8 of Pub.L. 89–554 explicitly repealed the original Administrative Procedure Act:
SEC. 7. (a) The legislative purpose in enacting sections 1-6 of this Act is to restate, without substantive change, the laws replaced by those sections on the effective date of this Act. Laws effective after June 30, 1965, that are inconsistent with this Act are considered as superseding it to the extent of the inconsistency.
SEC. 8. (a) The laws specified in the following schedule are repealed except with respect to rights and duties that matured, penalties that were incurred, and proceedings that were begun, before the effective date of this Act and except as provided by section 7 of this Act.
The Administrative Procedure Act — 60 Stat. 237 — is included in a long list of repealed laws following the text above. (You may notice I switched from “60 Stat. 238″ to “60 Stat. 237″. These “Stat” citations indicate the volume and page number in the United States Statutes at Large on which something appears. The origin of 5 USC 552 was on page 238 of the 60th volume of the Statutes at Large, the second page of the Administrative Procedure Act, which began on page 237, and so has the citation 60 Stat. 237.)
To fix this problem, S. 1160 was scuttled before its effective date ever arrived.
On June 5 of the next year — now we’re in 1967 — Pub.L. 90–23 / 81 Stat. 54 (originally H.R. 5357 in the 90th Congress) was enacted. This was “An Act to amend section 552 of title 5, United States Code, to codify the provisions of Public Law 89-187.” This law literally repealed the original Freedom of Information Act before that act became effective and enacted a new one. And this time, rather than it amending the Administrative Procedure Act, it replaced 5 USC 522.
Here is a comparison between the two FOIA statutes. Strike-out indicates text only in the 1966 statute and underline indicates text only in the 1967 statute. The most important difference is in the first line of each right at the top.
AN ACT To amend section 3 of the Administrative Procedure Act, chapter 324, of the Act of June 11, 1946 (60 Stat. 238), to clarify and protect the right of the public to information, and for other purposes.
AN ACT To amend section 552 of title 5, United States Code, to codify the provisions of Public Law 89-487.
Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled. That section 3, chapter 324, of the Act of June 11, 1946 (60 Stat. 238), 552 of title 5, United States Code, is amended to read as follows:
§552. Public information; agency rules, opinions, orders, records, and proceedings
“SEC. 3. Every (a) Each agency shall make available to the public the following information information as follows:
Every “(1) Each agency shall separately state and currently publish in the Federal Register for the guidance of the public —
(A) descriptions of its central and field organization and the established places at which, the officers employees (and in the case of a uniformed service, the members) from whom, and the methods whereby, the public may secureobtain information, make submittals or requests, or obtain decisions;
(B) statements of the general course and method by which its functions are channeled and determined, including the nature and requirements of all formal and informal procedures available;
(C) rules of procedure, descriptions of forms available or the places at which forms may be obtained, and instructions as to the scope and contents of all papers, reports, or examinations;
(D) substantive rules of general applicability adopted as authorized by law, and statements of general policy or interpretation of general applicability formulated and adopted by the agency; and
(E) every each amendment, revision, or repeal of the foregoing.
Except to the extent that a person has actual and timely notice of the terms thereof, no person shall a person may not in any manner be required to resort to, or be adversely affected by any a matter required to be published in the Federal Register and not so published. For purposes of this subsection paragraph, matter which is reasonably available to the class of persons affected thereby shall be is deemed published in the Federal Register when incorporated by reference therein with the approval of the Director of the Federal Register.
“(b) (2) AGENCY OPINIONS AND ORDERS.—Every agency shall Each agency, in accordance with published rules, shall make available for public inspection and copying (A) final opinions (, including concurrening and dissenting opinions ), and all as well as orders made in the adjudication of cases; (B) those statements of policy and interpretations which have been adopted by the agency and are not published in the Federal Register; and (C) administrative staff manuals and instructions to staff that affect any a member of the public, unless such the materials are promptly published and copies offered for sale. To the extent required to prevent a clearly unwarranted invasion of personal privacy, an agency may delete identifying details when it makes available or publishes an opinion, statement of policy, interpretation, or staff manual or instruction Provided that in every However, in each case the justification for the deletion must be fully explained shall be explained fully in writing. Every Each agency also shall maintain and make available for public inspection and copying a current index providing identifying information for the public as to any matter which is issued, adopted, or promulgated after the effective date of this Act and which is July 4, 1967, and required by this subsection paragraph to be made available or published. No A final order, opinion, statement of policy, interpretation, or staff manual or instruction that affects any a member of the public may be relied upon on, used, or cited as precedent by an agency against any private party unless a party other than an agency only if—
(i) it has been indexed and either made available or published as provided by this subsection or unless that private party shall have paragraph; or
(ii) the party has actual and timely notice of the terms thereof.
(c)(3) AGENCY RECORDS. Except with respect to the records made available pursuant to subsections (a) and ( b ) , every agency shall, upon under paragraphs (1) and (2) of this subsection, each agency, on request for identifiable records made in accordance with published rules stating the time, place, fees to the extent authorized by statute and procedure to be followed, make such shall make the records promptly available to any person. UponOn complaint, the district court of the United States in the district in which the complainant resides, or has his principal place of business, or in which the agency records are situated, has shall have jurisdiction to enjoin the agency from the withholding of agency records and to order the production of any agency records improperly withheld from the complainant. In such cases a case the court shall determine the matter de novo and the burden shall be upon is on the agency to sustain its action. In the event of noncompliance with the court’s order, order of the court, the district court may punish the responsible officers for contempt the responsible employee, and in the case of a uniformed service, the responsible member. Except as to those causes which the court deems considers of greater importance, proceedings before the district court as authorized by this subsectionparagraph shall take precedence on the docket over all other causes and shall be assigned for hearing and trial at the earliest practicable date and expedited in every way.
(d)(4) AGENCY PROCEEDINGS.—Every Each agency having more than one member shall keep maintain and make available for public inspection a record of the final votes of each member in every agency proceeding and such record shall be available for public inspection.
(e) EXEMPTIONS.—The provisions of this section shall not be applicable
(b) This section does not apply to matters that are
(1) specifically required by Executive order to be kept secret in the interest of the national defense or foreign policy;
(2) related solely to the internal personnel rules and practices of any an agency;
(3) specifically exempted from disclosure by statute;
(4) trade secrets and commercial or financial information obtained from any a person and privileged or confidential;
(5) inter-agency or intra-agency memorandums or letters which would not be available by law to a private party party other than an agency in litigation with the agency;
(6) personnel and medical files and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy;
(7) investigatory files compiled for law enforcement purposes except to the extent available by law to a private party party other than an agency;
(8) contained in or related to examination, operating, or condition reports prepared by, on behalf of, or for the use of any an agency responsible for the regulation or supervision of financial institutions; and or
(9) geological or geophysical information and data (, including maps ), concerning wells.
(f) LIMITATION or EXEMPTIONS.—Nothing in this section authorizes (c) This section does not authorize withholding of information or limiting limit the availability of records to the public, except as specifically stated in this section, nor shall this section be This section is not authority to withhold information from Congress.
(g) PRIVATE PARTY. As used in this section, ‘private party’ means any party other than an agency.
(h) EFFECTIVE DATE. This amendment shall become effective one year following the enactment of this Act.
Approved July 4 , 1966.
SEC. 2. The analysis of chapter 5 of title 5, United States Code, is amended by striking out:
”552. Publication of information, rules, opinions, orders, and public records.”
and inserting in place thereof:
”552. Public information; agency rules, opinions, orders, records, and proceedings.”
SEC. 3. The Act of July 4,1966 (Public Law 89-487, 80 Stat. 250), is repealed.
SEC. 4. This Act shall be effective July 4, 1967, or on the date of enactment, whichever is later.
Approved June 5, 1967,
There were differences between the FOIA we celebrate and the one that was enacted, but only minor ones. There usually are differences between bills that are drafted against non-codified laws (i.e. the Administrative Procedure Act) and bills that are drafted against positive law titles of the U.S. Code. This case exemplifies some of those differences.
Though there were no substantive differences, let us remember that our FOIA law that we actually have on the books was signed by President Johnson on June 5, 1967, not July 4, 1966.
Updated on Nov 8 to reflect that the enactment of Title 5 definitely repealed the previous Administrative Procedure Act and that that was almost assuredly the reason for enacting FOIA again.
August 19th, 2013
Today Eric Mill, Jonathan Gray, and I, with endorsements from the Sunlight Foundation and the Center for Democracy & Technology, are issuing guidance for federal agencies related to the recent open data memo from the White House (M-13-13).
— Read Our Guidance —
(We plan to open our guidance to community discussion.
Stay tuned for where that discussion will take place.)
The May 9 Memorandum On Open Data Policy instructs agencies as follows:
“[If] information as it is collected or created … are made public there [must be] no restrictions on copying, publishing, distributing, transmitting, adapting, or otherwise using the information”
Our guidance provides pre-vetted language that federal agencies can affix to data to meet these goals. For instance, for most federal government datasets copyright protection is not available (see 17 USC § 105). We recommend including the following in a README file for these datasets:
As a work of the United States Government, this package is in the public domain within the United States. Additionally, [Agency Name] waives copyright and related rights in the work worldwide through the CC0 1.0 Universal public domain dedication (which can be found at http://creativecommons.org/publicdomain/zero/1.0/).
We also offer language for four other common situations, including when the data is created by a contractor (copyright protections may apply). Our guidance also explains our rationale:
- “Open licensing” is not the same thing. (Licenses presume copyright protections.)
- Data is more valuable when its copyright status is clear through an explicit statement.
- Foreign copyright may apply to any government data.
- The Creative Commons CC0 public domain dedication is a widely adopted legal tool allowing a creator to dedicate his/her work to the world-wide public domain.
Our guidance came out of a conversation also with Timothy Vollmer and Puneet Kishor. We thank them, the Sunlight policy team, Joe Hall at CDT, and a lawyer friend of mine for their thoughts.
As I wrote previously, the memorandum was confusing, if not misleading, by asking agencies to implement the “no restrictions” policy using “open licensing.” The term “open licensing” was improper for two reasons:
First, you can’t license what you don’t own. Most federal government data is in the public domain, and this data is born “license-free.” The memorandum should have been clearer by explicitly directing the policy toward data created in the performance of a contract, which is often subject to copyright protection. (h/t Gavin Baker for the concise way to word that.)
Second, “open licensing” does not mean the same thing as “no restrictions.” Almost every standard open license is “some rights reserved,” meaning some privileges are granted but other rights in the work are retained. Copyright protection is precisely how the terms of these licenses are enforced. The GPL’s so-called virality clause and Creative Commons’s attribution licenses rely on the threat of a copyright infringement lawsuit to enforce their requirements for reuse.
If the memorandum really meant “no restrictions” the best way to accomplish that is not through an open license but rather the CC0 public domain dedication, as we’ve used in our guidance.
What’s the bottom line? Our guidance does not say that all government data must be open and license-free, nor do we attempt to say which government data should be open and license-free. We recognize the practicalities of contract work. However we strongly suggest that license-free become the norm, and we are providing tools for agencies to achieve that.
July 27th, 2013
This was updated twice since first posting, as indicated below.
In a Wired article yesterday Lawmakers Who Upheld NSA Phone Spying Received Double the Defense Industry Cash, the author said that based on an analysis by MAPLight “defense cash was a better predictor of a member’s vote on the Amash amendment than party affiliation.” That suggests there’s evidence defense cash had something to do with the vote. There isn’t.There isn’t much.
Everyone who’s been following the Amash vote already knows that the vote was not along party lines in the least. Take a look at the seating chart diagram on the GovTrack vote page:
Liberal Democrats and conservative Republicans happened to form a coalition in opposition of NSA data collection (an “Aye” vote), while moderates in both parties voted to reject the amendment. (The seating chart arranges representatives by their GovTrack ideology score.) So, first, the fact that defense cash was a better predictor than party is not very interesting.
A better question is whether defense cash is a better predictor than a legislator’s pre-existing personal convictions, as measured by our ideology score.
Defense cash’s prediction
To make this quantitative, let’s make the prediction like this. Since we know the vote was 205/217, let’s put the 217 legislators who received the most defense cash into one group and the bottom 205 legislators into another group. How well do those groups match the vote outcome? Here’s the breakdown by counts:
In other words, this prediction is right for 123+135 = 258 legislators, or just 61% of the time.
We can do a similar analysis based on the ideology score. The idea is that the further from the center a legislator is, the more likely he or she was to vote for the amendment. So let’s make groups for the 205 legislators with scores furthest from the median ideology score (“extreme”) and the 217 closest (“moderate”). Does that match the vote?
A little better.
This prediction is right for 131+143 = 274 legislators, or 65% of the time. That’s a little better than defense cash, but let’s call it a draw.
[update: added 7/29/2013]
We have two predictors for the vote — personal conviction and campaign contributions — that are about equally good, and both are equally plausible. In the absence of other data, there’s no reason to prefer one explanation of the vote over the other.
Votes are often mostly along party lines. That is, vote and party are often extremely highly correlated. That also means that to the extent money is highly correlated with votes, it’s then necessarily highly correlated with party affiliation too. That makes it very difficult, or impossible, to try to separate the influences of party and money.
But the Amash vote presents a uniquely interesting case because ideology (distance from the center) and defense dollars are not really correlated at all (r=-.05). That means ideology is good at predicting 60ish% of the votes and defense dollars are good at predicting a slightly different 60ish%. Maybe we can put them together to predict more than either can predict alone?
Let’s start with the predictions from the ideology score. We know we got 35%, or 148, of the votes wrong. So let’s swap the 74 congressmen in the ‘extreme’ group with the highest defense cash (call them the A group) with the 74 representatives in the ‘moderate’ group with the least defense cash (call them the B group). If money has any effect, we’d predict these to be the representatives most likely to be affected. Here’s how those representatives voted:
Note that the by ideology alone, we predicted the As to be Aye voters and the Bs to be No voters, which was right 35+36=71 times. After the swap, we make the reverse predictions, which is right 39+38=77 times. The swap improves our predictions for 6 votes, or 1.4% (6 out of 422 aye and no votes).
The predictors are better together. That means there is room for an influence of defense dollars on the vote, even for a skeptic like me that prefers an explanation in terms of ideology first. But it’s a small effect in absolute terms. And this effect goes both ways. The 6 votes extra are split between 4 additional no-votes due to defense-dollars and 2 additional aye-votes due to lack-of-defense-dollars.
So let’s boil this down to one number. Out of the 422 votes, maybe about 4 no-votes were due to the influence of defense contractor campaign contributions. Even in a tight vote like this, that wouldn’t have affected the outcome. And it’s still a big maybe. This is a miniscule correlation that is probably due more to random chance than any actual influence of money.
(In a linear regression model, the adjusted r-squared roughly doubles when we put the factors together.)
[end of update]
What does it mean?
Since we have two predictors that are about equally good, and one has nothing to do either with defense or money, there’s no reason to think that defense cash had anything directly to do with the outcome of this vote.
There’s obviously a role of campaign cash in our political system. In particular, only candidates who can raise cash can run for office. I’ve written about that in my book if you want to know what I think in more detail.
But if you want to relate industry cash to a particular vote, you’re going to have to at least beat other explanations that aren’t based on that industry’s cash.
So, here’s the thing, it’s important that we actually tell truthful stories, not just ones that we can spin to match our beliefs.
[update: added 8/19/2013] Ben Klemens, a statistician, has turned this data into an interesting logit model and quantifies in a better way the effect of money on the vote: post 1, post 2. [end of update]
After merging the vote and ideology data from GovTrack with the campaign contributions aggregated by MAPLight into a single table (download), I ran the following script in R:
data = read.table("table.csv", header=T, sep=",")
# There were 205 Aye-votes.
num_ayes = sum(vote=='Aye')
# Group legislators by how much defense contractor money they received.
# Call the bottom 205 legislators the 'Less$' group, and the other half
# the 'More$' group.
defense_dollars = ifelse(rank(contribs) <= num_ayes, 'Less$', 'More$')
# Group legislators by how far their GovTrack ideology score is from
# the House median. Call the most extreme 205 legislators the 'Extreme'
# group, and the other half the 'Moderate' group.
distance_from_center = abs(ideology - median(ideology))
is_extreme = ifelse(rank(-distance_from_center) <= num_ayes, 'Extreme', 'Moderate')
cat("cor(contribs, distance_from_center) =", cor(contribs, distance_from_center),"\n")
swap_size = 74
group = ifelse(is_extreme=='Extreme', '0', 'Z')
group[is_extreme=='Extreme'][rank(-contribs[is_extreme=='Extreme']) <= swap_size] = 'A'
group[is_extreme!='Extreme'][rank(contribs[is_extreme!='Extreme']) <= swap_size] = 'B'
July 17th, 2013
The House Appropriations committee passed up another chance to advance core transparency practices in Congress. In a draft report published this morning for FY2014 appropriations, the committee makes no mention of legislative data. And in the Bulk Data Task Force’s finally-released recommendations, the Library of Congress gets all worked up over something no one has been asking for.
Here’s the short of it. Can we get a spreadsheet simply listing all bills in Congress? Is that so hard? I guess so.
After last year’s legislative branch appropriations bill report said the committee was “concerned” that the public would misuse any bulk data downloads, The Washington Post covered how the public uses this sort of data for good, and House leadership formed a Bulk Data Task Force to consider if and how to make bulk legislative data available. That task force submitted recommendations to the House Appropriations committee last December, but it was only made available to the public last week (see this, page 679).
In the recommendations, the task force noted that it had begun several new transparency projects. One is the Bill Summaries project, in which the Library of Congress will begin to publish the summaries of House bills written by the Congressional Research Service (CRS) in some structured way. The Library of Congress’s report to the task force has some choice quotes:
“some groups may try to leverage this action to drive demand for public dissemination of CRS reports” (Note that “CRS reports” are different from “CRS summaries.” That’s a whole other can of worms.)
“CRS could find itself . . . needing to clarify misrepresentations made by non-congressional actors”
“if there is an obligation to inform the general public to the risks of non-authoritative versions of the information, it has not been included in the estimates”
These CRS summaries have already been widely distributed… on GovTrack… for nearly a decade. (And, I’m sorry, but what risks am I causing?) And while I wouldn’t mind having the summaries easier to get from the Library, I certainly am not gunning for them. I want data like the list of cosponsors, what activities bills have gone through, or just a simple list of bills. If the Library thought this wasn’t a great place to start with bulk data, well, I couldn’t agree more!
Some of the other projects mentioned in the recommendations are indeed very useful (some of which I wrote about here). Others, however, touted bulk data success without making any new data available. In the recommendations’s meeting minutes in the appendix, the task force wrote that it discussed “what data is available on GovTrack compared to what would be available through the proposed GPO project.” Quite a bit! That proposed GPO project turned into the one that made no new data available. In their next meeting they met with me and folks from other groups (Sunlight, Cornell LII, and so on), but I don’t recall them asking me the question they posed the week before, oddly.
The other projects mentioned in the bulk data task force recommendations are:
- Congress.gov, THOMAS’s upgrade, which is explicitly not providing any bulk data (except perhaps through the new Bill Summaries Project)
- Member Data Update: The Clerk’s list of Members of the House now includes Bioguide IDs, which is fantastic and very helpful.
- A new House History website launched or will launched. See, I don’t even know. Again, not bulk data.
- Docs.House.Gov: Committee schedules and documents have been added. (Great! I’m using that data on GovTrack already.)
- New XML data for House floor activity. (This is pretty interesting but a little disorganized. I would rather scrape THOMAS than use this XML data.)
- The Clerk is launching a Twitter account. (No data here.)
- HouseLive speaker search. (Searching videos. Data? Who knows.)
- Stock Act public data disclosure.
- Legislative Data Dashboard (not quite sure what this is).
- Converting the United States Code to XML. (This is a big and commendable project.)
- A contest to get the public to convert bills to the Akoma Ntoso XML data format. (Does not count as open government data if the public has to do the work.)
- Replacing MicroComp (an old bill/report text drafting tool?).
- Positive Law Codification (when did that become in scope for this task force?).
- Editorial Updating System (no idea what this is).
So while the recommendations support the use of legislative data generally, it made no long term goals for broad access to the legislative data on THOMAS. And as for the only data in motion now, the Library of Congress appears not to be happy about making it widely available.
The committee report for the annual legislative branch appropriations bill, which kicked off the task force last year, has been an important document for legislative transparency in the past. Besides last year’s step backwards, in 2009 the report indicated the House supported “bulk data downloads” for the bill status information on THOMAS.gov. Though nothing came of it. This year the committee said nothing, so, well, I guess nothing will come of it too.
May 9th, 2013
TL:DR: The new E.O. and memorandum are good for transparency and lock in almost all of the generally accepted notions of open government data. But it misses the mark on the requirement of “open licenses.”
With an executive order and a new Memorandum on Open Data Policy today, the focus on entrepreneurship remained at the forefront of federal data policy. This focus began with last year’s Digital Government Strategy, and these days weather data and GPS signals are the examples of choice. That said, the policies set in the new memorandum are quite good for the classic use of this data (transparency, accountability, and civic education) even if “transparency” is only barely mentioned in passing.
Defining Open Data: How well does it do?
This new Open Data Memorandum presents the most detailed definition to date of “open data” by the federal government. It included many of the principles that our community has reached consensus on, but it gets one severely wrong.
As I wrote many years ago, the 2009 Open Government Directive itself already adopted some of the principles of open government data including: online, primary, timely, public input, and public review. It also added two principles of its own: being pro-active about data release and creating accountability by designating an official responsible for data quality.
Comparing to my list of open government data principles in my book, the new memorandum’s definition of open data covers:
- Principle 1: Information should be online (to quote the Memorandum: “retrieved, downloaded”)
- Principle 2: Primary (the Memorandum even uses language from the 8 Principles; interestingly the memorandum places this under the heading of “Complete,” which was a different principle from the original 8 Principles).
- Principle 3: Timely.
- Principle 4: Accessible (the Memorandum repeats the language from the 8 Principles, “available to the widest range of users for the widest range of purposes” and the use of “multiple formats” where necessary, and for documentation says the data should be “described”).
- Principles 5 and 10: Analyzable (“machine readable”).
- Principle 6: Non-discriminatory
- Principle 7: Non-proprietary (open) data formats
- Principle 14: Public review (“A point of contact must be designated to assist with data use and to respond to complaints about adherence to these open data requirements.”)
Its definition also states that open data has a presumption of openness. (Principles 2-7 and 14 are from the 8 Principles of Open Government Data. Principle 1 is from the Sunlight Foundation.)
Elsewhere in the memorandum it addresses:
- Principle 13: Public input (“engage with customers” for prioritizing what data should be made available and how to make it available)
- Principle 15. Interagency coordination (“interoperability”)
It also asks agencies to create data catalogs to include datasets “that can be made publicly available but have not yet been released” at agency.gov/data URLs. And it says agencies must consider the needs of open data at all stages of the information collection lifecycle. In other words, data should be collected in such a way as to promote public dissemination of open data later on.
The Memorandum misses the principle that data should be license-free, which is a core principle and a grave mistake. It also misses the peripheral principles of permanence, the use of safe file formats, and practices of provenance and trust (e.g. digital signatures). (These last two are ACM principles.)
“Open licenses” presume access is closed by default!
Rather than requiring open data to be license-free, which was a core part of the 8 Principles of Open Government Data, it instead promotes the use of “open licenses.” This is a subtle but important distinction. Licenses presume data rights. Open licenses, including open source licenses and Creative Commons licenses, create limited privileges in a world where the default is closed. These licenses create possibilities of use that do not exist in the absence of the license because copyright law, or other law, creates an initial state of closedness.
Most open licenses only grant some privileges but not others, and some privileges come along with new requirements. The GPL and Creative Commons Attribution License, for instance, rely on copyright law so that restrictions on data use intended by the open license (GPL’s virality clause, or the restriction that users must attribute the work to the author) are enforceable in court.
Federal government data is not typically subject to copyright law, and in this case a license is not needed for the data to be open. Thus the application of a license suggests a change from the open-by-default state of this data to a closed-by-default state where a license is required to open it up. While the memorandum requires “an open license that places no restrictions on their [the dataset's] use,” the term “open license” is typically understood to presume a default closed state. This policy opens the door (so to speak) to agencies applying licenses (i.e. new contractual agreements) to data that serve only to restrict use.
Federal government data not subject to copyright cannot be free if a license is applied. The license-free principle of the original 8 Principles says open government data cannot be limited in this way.
When data may be subject to copyright protection (copyright law is murky and there are many gray areas), or when copyright law definitely applies (such as to documents produced originally by federal government contractors), then a public domain dedication such as the Creative Commons CC0 statement or the Open Data Commons Public Domain Dedication and License (PDDL) (both of which combine a waiver and a license) is appropriate. A public domain dedication differs from an open license in that it disclaims copyright and other protections, whereas, again, an open license implies that such a limitation on use is already present. The CC0 statement was successfully used by the Council of the District of Columbia to disclaim copyright over data files containing the DC Code.
What’s the definition used for?
While the definition of open data is otherwise quite strong, the definition is used just once in the whole memorandum. The memorandum does not mandate that government data be open data under its definition, at least as far as I could see. The only use of the open data definition is in its request for agencies to create roles for staff to ensure data released to the public are open. That is, staff should promote open data, but open data itself is not required.
Although the definition itself is not used much, there are independent provisions that repeat some of the same principles. Agencies must use “machine-readable and open formats,” existing standards, and metadata. And information collection should be done in a way to support information dissemination: “[A]gencies must design new information collection and creation efforts so that the information collected or created supports downstream interoperability between information systems and dissemination of information to the public.”
It also requires the use of open licenses:
“Agencies must apply open licenses, in consultation with the best practices found in Project Open Data, to information as it is collected or created so that if data are made public there are no restrictions on copying, publishing, distributing, transmitting, adapting, or otherwise using the information for non-commercial or for commercial purposes.”
As I mentioned, federal-government-created data needs no license to be open, although the memorandum implies that all agency data should have an open license. (That’s either legally impossible or it means something usual.) For other data, it appears that the memorandum intends to create a public-domain-like state. But it is qualified, for contracts may only use “existing clauses” (i.e. standard contract terms already approved by OMB) to implement terms of open licensing. Looking over those terms, I don’t see the necessary legal framework to do it. And a nearby footnote confusingly says that a data user who modifies the data “is responsible for” describing the change. Does that mean an “open license” can require users to describe modifications? The qualifications make it very difficult to know what an acceptable implementation of open licensing looks like.
While the goals of the Memorandum in defining open data and using open licenses are laudable, the implementation does not meet the 8 Principles’s requirements of open government data, at least under the usual understanding of “open license,” and the use of the definition to promote open data is very limited.
PS. As Derek Willis points out over Twitter, the “mosaic effect” paragraphs in the memorandum are also somewhat concerning. The mosaic effect is hard to quantify and therefore difficult to limit, and this creates a big hole for keeping data government out of public reach.
UPDATE 5/10/2013 #1:
Rufus Pollock points out that the Open Data Commons Public Domain Dedication and License (PDDL) is similar to CC0 and would also be appropriate. I agree.
Eric Mill notes that for data already in the public domain, the Creative Commons Public Domain Mark, which is basically an icon/badge, would be appropriate. Agencies should definitely mark public domain data as such.
UPDATE 5/10/2013 #2:
I added a few paragraphs to the section now called “What’s the definition used for?”.
April 4th, 2013
This morning DC’s legal code went online as open data. I’ve worked with government before on open data, but never have I worked with a government body that moved so deftly through the technical, policy, and legal issues as the DC Council’s Office of the General Counsel. So, before anything else, thanks to the general counsel V. David Zvenyach and his staff for their time and expertise on this.
The TL;DR version goes like this:
Tom MacWright wanted to build his own version of the DC Code website. The DC Council couldn’t share its electronic copy of the Code because it contained intellectual property owned by West. This became a little and very geeky controversey (spurred by Carl Malamud). But Zvenyach — the general counsel — recognized the value of making the law open and did it. He removed the West IP from their electronic copy of the Code (I helped), posted the file on the Council’s website, and even included a CC0 public domain dedication
The last bit all happened within a matter of days, and it was one of the easiest open data success stories I’ve been a part of. Tom recapped the events here and began hacking the code immediately. He held a hacakthon on April 14 which he wrote about here (and Eric Mill wrote about here).
Here’s the longer version:
This all began a few months ago when DC-based civic hacker Tom MacWright took an interest in making local law more accessible. Intending to import the DC Code into Waldo Jaquith’s State Decoded project, he ran into a small problem: he couldn’t get a complete copy of the law. Intellectual property issues prevented the DC Council from simply emailing over their copy of the Code.
Many states, like the District, contract out the codification and code-publishing work to a third-party like West (owned by the Canadian-owned Thomson Reuters) or Lexis (owned by the Amsterdam-based Reed Elsevier). DC had previously contracted out to West, and last year switched to Lexis. Neither likes to share. DC’s official website to read the Code — which has been run by West — is free to the public, but copying any part of the Code off of that website might violate West’s copyright or terms of service, or both. Sharing the law might have been illegal.
In the case here in DC, the DC Council had Word documents containing the Code, given to them by their contractor West, but the documents contained West’s logo. The DC Council could not share the documents with West’s logo intact. And it wasn’t easy to take those logos out (more on that later). Informally speaking, West owned the DC Code.
I had met Zvenyach, the general counsel, before. He is very technologically savvy and has been trying to modernize the office he took over only a few years ago. We had even talked about holding a hackathon to help him do it. (As a DC resident, I’m also interested in DC law.) But his office, like all of government, is bound by limited resources and much work to do. When Tom brought the issue onto Zvenyach’s radar, I don’t believe there was any point at which Zvenyach didn’t want to make the files available. It was, as far as I’ve observed, merely a matter of time and resources.
Tom wrote more about the intellectual property issues here
. Coincidentally, on Monday Ed Walters of Fast Case gave a great talk
on the issue of who owns the law at Reinvent the Law — I highly recommend watching it. He’s also written extensively about it
Tom asked Carl Malamud to get involved. Carl has been working on this issue in other states, like in Oregon, where the State of Oregon itself claimed copyright over their laws. Carl bought (for quite some money) a physical copy of the DC Code, digitized it, and mailed thumb drives in the shape of famous presidents containing the digitized code to various important people. This was a spin on a tactic that Carl began in the 1990s when he opened the SEC’s corporate filings data: get the data online, pressure the government to put the data online themselves, and then help the government take over that responsibility.
The media and bloggers caught on, beginning I think with Corey Doctorow on March 27, followed by DCist on March 28, The Washington Times on March 31, Steve Schultze on April 1, and Think Progress on April 3. The files themselves went up on April 4, so little more than a week from the first media blog post about it, and the decision to put the files up with a CC0 license was made in any case some days earlier. It really did not take much pressure at all. (Tom also wrote a post on Greater Greater Washington on March 19.)
Carl had noticed early on that the DC Council asserted copyright over the Code. Some of the media reports focused on that. As Zvenyach explained in The Washington Times article, the rationale was to protect DC from West, by making sure West could not claim copyright over the same Code, not to limit access to the law. Whether or not state codes can be copyrighted was mostly besides the point, and the focus on this issue turned out to be a red herring. It was resolved quickly with the choice of the Creative Commons CC0, a public domain dedication.
I went in to Zvenyach’s office on April 3 to help them take West’s logo out of the Word documents. There was one document per title of the Code, or about 50 documents, many in the 50-megabyte size range. The West logo was in the header, but the header was specified independently for each section of the code, so in reality there were thousands of logos to take out. We also took out a DC copyright line from the documents, which was also repeated in each section. It took about 4 hours for Microsoft Word to process all of the files, and 1 hour for us to figure out how to do it so “quickly.”
When I left Zvenyach’s office that evening, Zvenyach pointed out the presidential thumb drive still sitting on his desk that he received from Carl — unfortunately I forget if it was a little George Washington or a little Abraham Lincoln. I have a feeling that thumb drive will be around for a while.
Now, there is a bigger issue here. There’s no plan for updating the public files. DC’s contract with Lexis going forward doesn’t require Lexis to provide DC with an electronic copy of the code. Perhaps after this they’ll refuse to do so. But we’ll tackle this another time.