June 29th, 2014
In a Facebook study published this week, Facebook manipulated
all of their U.S. many of their users’ News Feeds by omitting 0-90% of posts containing either positive or negative content over the course of a week in 2012. They reported that those users wrote fewer positive and negative words (respectively) in their own posts, concluding that Facebook is a medium on which emotions spread, a case of “emotional contagion” using their technical term.
Here’s what you need to know:
On average, no emotion actually spread
The number of positive words in their average user’s posts decreased from 6 words to… 6 words.
The first major omission in the study is the lack of individual-level statistics. While they reported aggregate numbers such as having analyzed ”over 3 million posts” totaling “122 million words” made by their “N = 689,003″ users, and the study’s implication for “hundreds of thousands of emotion expressions,” they omitted any discussion of whether and how individuals were affected in any meaningful way.
From their numbers, the average user wrote 4.5-5 posts totaling 177 words during the experimental week. Only 3.6% of those words — so about 6 words — were “emotional,” and they found that by omitting about half of emotional posts from user’s News Feeds that percentage would go down by 0.1% or less. A 0.1% change is about 2/10ths of a word.
For most of their users, there was not even close to a measurable effect.
(The study did mention the Cohen’s d statistic of ’0.02′ which is another way to say that there was an aggregate effect but basically no individual-level effect.)
The study has no test for external validity (was it about emotions at all?)
An important part of every study is checking that what you’re measuring actually relates to the phenomenon you’re interested in. This is called external validity. The authors of the Facebook study boasted that they didn’t think of this.
The paper quixotically mentions that “no text was seen by the researchers” in order to comply with Facebook’s agreement with its users about how it will use their data.
They didn’t look at all?
That’s kind of a problem. How do you perform a study on 122 million words and not look at any of them?
Are the posts even original, expressive content? The users might be sharing posts less (sharing is sort of like retweeting) or referring less to the emotional states of friends (“John sounds sad!”). The words in a post may reflect the emotions of someone besides the poster!
To classify words as “positive” or “negative” the study consulted a pre-existing list of positive and negative words used throughout these sorts of social science research studies. This comes with some limitations: sarcasm, quotation, or even simple negation completely cut out the legs under this approach. I actually think in aggregate these problems tend to go away, but only when you have a large effect size.
The whole of Facebook’s reported effect on emotion could be due to one of the many limitations of using word lists as a proxy for emotion. They needed to demonstrate it wasn’t.
This study is not reproducible. While most research isn’t ever reproduced, that it could be provides a check against the fabrication of results (and sometimes that’s how fabricators are caught). Facebook provides the only access to a network of this size and shape. It is unlikely they would provide access to research that might discredit the study.
The study also uses a strange analysis. Their experimental design was 2 X 9-ish (control or experiment X 10-90% of posts hidden), but they plugged the two variables into their linear regression in two ways. The first became a binary (“dummy”) variable in the regression, which is right, but the second become a weight on the data points rather than a predictor. That’s an odd choice. Do the results come out differently if the percentage of posts hidden is properly included in the regression model? Did they choose the analysis that gave the results they wanted to see? (This is why I say “about half of emotional posts” above, since the analysis is over a weighted range.)
Finally, there’s the problem of informed consent. It is unethical to run experiments on people without it. The paper addresses legal consent, in the sense that the users agreed to various things as a pre-condition for using Facebook. Though being manipulated was probably not one of them (I don’t know what Facebook’s terms of service were in early 2012 unfortunately).
Certainly the consent didn’t reach the level of informed consent, in which participants have a cogent sense of what is at stake. There’s a great discussion of this at Slate by Katy Waldman.
Facebook’s users have a right to be outraged over this.
Keep in mind though that there are different ethical obligations for research versus developing a product. It could be ethical for Facebook to manipulate News Feeds to figure out how to increase engagement while at the same time being unethical for a research journal to publish a paper about it.
June 20th, 2014
Last week I noticed that the sunset aligned unusually well with my cross-street, Newton St NW, and it made me wonder if we have any Manhattanhenge-like events in DC. DC can one-up Manhattan — we’ve got a double-henge, if you’ll let me coin a phrase.
Here in Columbia Heights we have a unique street pattern. Two roads — Park Rd and Monroe St. — come to an apex on 14th St. They go north both to the east and west of 14th St. On a few days a year — centered on May 15 and July 29 — the roads point east toward sunrise and west toward sunset. Click the links to see on suncalc.net. (The alignment isn’t exact, so the effect spans a few days.)
All the henges
Like Manhattan, DC’s grid lines up with sunrise & sunset. It’s on the equinoxes, so we get a boring double-henge on those days too.
Some of the state avenues are kind of close to the solar azimuths on the solstices, but the peak days are a few days off. In the summer it is on the same days as the Columbia Heights Doublehenge. On those days the avenues parallel to New York Avenue line up with sunrise and the avenues parallel to Pennsylvania Avenue line up with sunset. Around the winter solstice — Nov 5 and Feb 6 — the avenues parallel to Pennsylvania Avenue line up with sunrise and the avenues parallel to New York Avenue line up with sunset.
I wondered for each day of the year, what was the DC road that best aligns with sunrise and sunset. If you’re driving these would also be the roads to avoid (h/t @knowtheory). Here’s a table for the next year. The links will show you where exactly it is:
June 8th, 2014
Last month the Administration posted perhaps the first github pull request to change federal policy. One WIRED writer was quick to call success, writing here, “By opening up the revisions and the discussions behind them, the White House is making its thinking clear.” But no.
It’s easy to be fooled into believing that a new medium also signals new substance.
No substantive policy change in this pull request
The pull request is a proposed change to a federal memorandum on github regarding open data. The change clarifies when agencies should openly license their data. The memorandum originally said that federal agencies should always use open licensing. But as I pointed out when the memorandum was issued a year ago, that’s not legally possible. Most federal data is not subject to copyright in the first place, and works that are in the public domain can’t be licensed.
The proposed update to the memorandum fixes the Administration’s mistake by adding at the top:
“In instances where government data . . . does not fall squarely within the public domain . . .”
clarifying that open licensing should only be used where copyright applies. Mainly that means when the data was produced by a government contractor. There is no substantive change made in this pull request though. It clarifies the only sensible meaning the original memorandum actually had.
Omits discussion of the substantive issues
If this were the only issue in the paragraph being edited, then I too would call it success. But late last year 14 organizations backed a statement supporting the public domain for government data — not open licensing — and several of us who wrote the letter met with the Administration about the issue. The absence of any mention of the substantive issue in that paragraph should be a red flag for thinking the pull request represents open dialog.
The substantive issue is that the policy condones the copyrighting of any government data, much of which might be used to create or enforce government policy. That’s a serious First Amendment concern. It means that even if journalists can get a hold of some data, they might only be able to share it on terms set by a government agency or even a government contractor. As a broad government policy, the notion of copyrighting government data is ridiculous and flies in the face of our country’s traditions and values. (Note: Forget national security, privacy, etc. This could be data about any mundane policy.)
The pull request omits discussion of this issue, as well as other issues that I and others have discussed with the Administration (as I noted in my reply to the pull request).
Where was the dialog?
There was dialog on these issues, but it wasn’t on github. It was in private in-person meetings, as these things usually are. I and others met with Administration staff in private meetings in August 2013, April 2014, and May 2014. Our discussions each time were thoughtful and productive.
There was plenty of good dialog, but it wasn’t online. I first raised the licensing issue on github a year ago in issues #5 and #64, to which the Administration replied only that they would look into it. The issue was picked up against in issue #257, but again there was no participation in the github issue by the Administration. (There is a lot of dialog in that github repository, but it is about data standards and not policy, and most of the participants in those discussions are government employees or contractors (including myself, in those conversations) — which is a good thing, but not the subject of the WIRED article.)
The pull request posted last month represents the end of a year-long process in which discussions were taking place off-line, and proof that even with github most dialog will still continue to take place off-line.
Lest journalists get confused let’s just be clear that there wasn’t any discussion of substance on github. It was elsewhere, off-line, like normal.
Now I’m just going to be a jerk and red-line the WIRED article because it got a lot of details wrong:
This White House GitHub Experiment Could Help Fix Government
BY ROBERT MCMILLAN
While many of our nation’s problems are quite clear, the way our government addresses them is too often a black box—opaque and closed to all but insiders and lobbyists.
But the White House has taken a remarkable–if small–step toward bringing greater transparency to the legislative process. (“legislative” refers to the legislative branch of government, i.e. Congress. This is an executive-branch memo and thus not related to the legislative process.) For the first time, it has used the GitHub social coding website as a forum for discussing and ultimately changing government policy. With one GitHub “pull request,” it modified (The document has not yet been modified.) theProject Open Data policy document, which spells out how government agencies are supposed to open up access to their data. This represents the fusion of open source software and government policy that open-government advocates have long predicted (#notalladvocates predict this). And it might be a sign of things to come as others—the city of San Francisco, and the New York state senate, to name a couple—bring collaborative government into the light.
‘We’re taking a well-known page from the open source playbook: that developing policy in an open and iterative way will create a stronger, more effective product.’
Late last week, Haley Van Dyck at the Office of Management and Budget submitted a pull request that suggested small changes to Project Open data that clarify how agencies think about open source and public domain software (The memo does not cover software. It is about data.). Pull requests are a Silicon Valley innovation. They’re typically used by software developers on GitHub to suggest and discuss changes to code. But they’re also a good tool for tracking changes to complex legal documents, even government regulations.
While Van Dyck’s changes weren’t big, it’s important that these issues were raised and addressed in a public forum where anyone can suggest language for the policy document.(Anyone can, but no one did. The pull request was submitted by the Administration to the Administration’s own document. Let’s wait until they accept a pull request submitted by the public to a policy document.) “We’re taking a well-known page from the open source playbook: that developing policy in an open and iterative way will create a stronger, more effective product. The more we can involve the community, the better that product will be,” said Van Dyck—a senior adviser to the U.S. Chief Information Officer—in an email to WIRED.
The White House will wait a few weeks to review comments to the pull requests, but then Van Dyck’s changes become official government policy with the push of a button. This is open source government: The tonic that could cure the back-room deal. (Most government policy-making involves public comments, review periods, and pushing a button to upload the final policy to the Internet. There is absolutely nothing more open-source about this than the usual agency rule-making process.)
By opening up the revisions (there is no policy-making in our government that doesn’t involve posting revisions) and the discussions behind them (as I mentioned, there was no discussion on github), the White House is making its thinking clear, and there’s an added bonus: The changes are easier to read and understand. Compare Van Dyck’s revisions here, to Rep. Lou Barletta’s proposed changes to existing law in his Emergency Unemployment Compensation Extension Act of 2014. In the GitHub document, you can see the old text struck-through in red and the new additions in green. Congressional bills like Barletta’s, on the other hand, read like uncompiled source code, detailing all the changes to be made but giving the reader no idea what the finished product will look like.(That’s not what uncompiled source code looks like. And ‘compiled’ source code certainly looks no better.)
That makes some bills unreadable, as far as the average citizen is concerned. (This isn’t an apples-to-apples comparison. Modifying 200-year-old statutory law is going to be harder for the “average citizen” to read than modifying a memo written last year.) “The thing that is actually voted on is the edits,” says Ben Balter, GitHub’s government evangelist. He has been working with the feds for years, convincing them to use more open-source software and adopt more of an open-source attitude. “The open government community has been talking about doing stuff like this, but it’s never reached fruition because there weren’t enough stakeholders in government.”
That’s begun to change, Balter says. He says he’s spending more time explaining to federal employees how they can use open source tools and methods. Two years ago, he was still convincing them to give open-source a shot. Now he’s watching the White House merge pull requests.
June 1st, 2014
The Department of Education has an RFI due tomorrow on the Use of APIs in Higher Education Data and Student Aid Processes. I submitted the following response:
The RFI asks how APIs for higher education data and programs can achieve policy goals of the Administration. As an expert on open government data, I am submitting this comment to address when APIs are useful.
Modern methods of information dissemination and service delivery recognize the long-standing role of mediators in facilitating citizen-government transactions. The media, educational institutions, and many others have long played a crucial role in helping citizens make use of information about higher education produced by the government and enroll in government services. The function of electronic standards for information dissemination and service delivery is to make mediation more efficient and therefore able to reach a wider audience. These new methods are a force multiplier for policy objectives.
Do Open Data First
An API is one of the two modern methods of information dissemination and service delivery specifically sought after by the Administration. Besides building APIs, creating open data — also called bulk, raw, and structured data — is also now an Administration goal as outlined in the White House’s Memorandum on Open Data (M-13-13).
It is important to understand when open data or an API is the right technology for a particular database or service.
Open data, when possible, is always both less costly to implement and more powerful than a “read API”. Here is a summary of why:
* Open data is static but APIs are dynamic. That means that APIs require long-term maintenance to ensure that the API remains continuously and indefinitely available. Open data is released once and updated periodically as needed.
* Open data provides a complete database but APIs provide only a small window into the data. That means that while open data can be used to build any application, an API can only be used to build applications that require a small amount of the data at a time.
* A *good* API requires that the agency do everything that good open data requires plus much more, including the creation of RESTful services, building around use cases, and creating “client libraries”.
A “read API” must do everything that open data does, plus much more. Therefore agencies should walk before they run. Build good open data first, validate that it meets the needs of users, learn how to do that well, and only after validation and learning invest in building an API to address additional use cases.
Open data should always be available before a “read API” is considered. The few cases were open data is not possible for information dissemination (when data changes in real time, like in the stock market, or the data is extremely large) are not likely to apply to data about higher education.
For an example of open data, the Census Bureau and the National Weather Service have been providing open data since the mid 1990s. The practices of open data have a 25-year history.
I advise against the implementation of any read APIs for a dataset before open data for that dataset is available and validated with stakeholders.
Not all open data is created equal. Well-designed open data will prove to be most useful for mediators — and thus the consumers. For more information on open data, please see:
* My book, Open Government Data: The Book, at http://opengovdata.io/
* Best practices for open data licensing, at http://theunitedstates.io/licensing/
When to build APIs
That said, the above advice applies only to information dissemination. Read/write APIs are an excellent strategy for the enrollment in or participation in government services. In a read/write API, unlike a read-only API, the external user is submitting information — such as form values — in a transactional process. A read/write API decouples the customer’s experience from the business logic so that mediators can create new experiences but still be in compliant with the agency’s business logic.
Just as with information dissemination, mediators can be valuable during transactions. Different audiences might respond best to different ways in which the transaction occurs (off-line, on an iPad, in large print, in plain language, or using jargon when targeting domain experts, etc.). Using a read/write API, mediators can create new and tailored methods of performing the same transaction and best reach audiences that the agency alone could not best reach.
Since transactions are by their nature dynamic, open data would not meet this need.
Not all APIs are created equal. Exceptional APIs lower the barrier to entry and the ongoing costs for mediators. Poorly designed APIs could result in helping no one.
A well-designed API provides granular access, deep filtering, typed values, normalized tables, RESTful interfaces, multiple output formats, useful validation messages, use-case or intent-oriented URLs, documentation, client libraries, versioning, fast results, high uptime, easy on-boarding, interactive documentation, and a developer community hub. The best APIs are used by the agencies themselves inside their own products and services.
For more information on what makes a good API, please see my blog post “What makes a good API?” at http://razor.occams.info/blog/2014/02/10/what-makes-a-good-api/.
I am the founder of GovTrack.us, a legislative transparency website that has been at the forefront of open government data since 2004, and I am the author of Open Government Data: The Book (opengovdata.io). I formerly consulted for the Department of Health and Human Services on HealthData.gov and currently consult for the Council of the District of Columbia. I can be reached at firstname.lastname@example.org.
April 29th, 2014
Here’s a little data exploration. Although there are far more small-money contributions to congressional candidates than large-money contributions, the large-money contributions more than make up for the difference in aggregate totals.
I’ve binned all reported individual contributions (types 15 and 15E) to congressional candidates (‘H’ and ‘S’ committees) in the 2011-2012 election cycle so that the bucket means are close to most common contribution amounts ($250, $500, $1000, and $2500), using FEC data. Amendments to filings were ignored. (Source Code)
Nothing new or exciting here. Just useful to have numbers.
Not included here are contributions to non-candidate PACs (issue PACs, leadership PACs, etc. , except when those contributions were earmarked) and unreported contributions (<$200), but we can extrapolate from this data that while there may be a lot of unreported contributions they likely don’t add up to much.
|up to $400
|$1,400 or more
(The bins overlap because I’ve apportioned some of those contributions to both sides in order to make the mean contribution of each bin close to a nice number.)
April 15th, 2014
Our family has gone through many haggadahs over the years at our Passover seders as my grandparents and relatives have tried to modernize (if not also shorten) our Passover experience. This year a reading conspicuously inserted into our haggadahs by my grandmother had relevance not only for the Passover story but also for civic hacking.
The traditional four children of the seder are a rhetorical device by which we remember the purpose of the Passover seder. The “wicked” child is said to ask, “Why do you do this seder?” The response, we read, is that God freed us from bondage: us, not you. The implication is that with that attitude, God would have left the wicked child behind.
Last night we had four new children on a printout from this page by the American Jewish World Service. If you’re not familiar with Passover, you’ll need to know that the story of Passover is of how God righted the grave injustice of the slavery of the Jews in Egypt, lead the Jews to Israel, and punished the Egyptians with plagues for committing the injustice. Here were the new four children:
The Activist Child: “The Torah tells me, ‘Justice, justice you shall pursue,’ but how can I pursue justice?”
Empower him always to seek pathways to advocate for the vulnerable. As Proverbs teaches, “Speak up for the mute, for
the rights of the unfortunate. Speak up, judge righteously, champion the poor and the needy.”
The Skeptical Child: “How can I solve problems of such enormity?”
Encourage her by explaining that she need not solve the problems, she must only do what she is capable of doing. As we read in Pirke Avot, “It is not your responsibility to complete the work, but neither are you free to desist from it.”
The Indifferent Child: “It’s not my responsibility.”
Persuade him that responsibility cannot be shirked. As Abraham Joshua Heschel writes, “The opposite of good is not
evil, the opposite of good is indifference. In a free society where terrible wrongs exist, some are guilty, but all are responsible.”
For the Uninformed Child who does not know how to ask a question:
Prompt her to see herself as an inheritor of our people’s legacy. As it says in Deuteronomy, “You must befriend the stranger, for you were strangers in the land of Egypt.”
These are the questions I hear often about the usefulness of hackathons and the purpose of meetups like Code for DC, and this provides some useful answers. I will be thinking about how to incorporate these thoughts into my own civic hacking.
April 3rd, 2014
I’ve been following Carl Malamud’s work on public access to state codes, and, if I may say, the response by the State of Georgia is absolutely infuriating. Citizens ought to have a right to read the law and teach others about it without the threat of a civil suit or criminal penalties, but Georgia’s top law lawyers don’t see it that way.
TL;DR: Did you read Georgia’s state code on an iPad? The feds could throw you in jail.
Background & the Copyright Claim
Georgia publishes its statutory law in something called the “Official Code of Georgia Annotated.” You can either get it online for “free” or in 40 physical books, at a cost of $378 when purchased through the legal publisher LexisNexis.
In May 2013, Carl bought the books, scanned them, and put them online to improve public access. The Georgia Code Revision Commission, which is the government body that publishes the code, replied, in big capital letters, “CEASE AND DESIST ALL COPYRIGHT INFRINGEMENT.” The commission’s claim is that while the law itself is not copyrighted, it is inextricably intertwined with other explanatory material called annotations which don’t have the force of law, and thus are copyrightable.
Carl disagrees with their claim (page 2), and he has a point.
But I’m not a lawyer, so let me put the copyright question aside and just look at this part ethically. (Update: This and the preceding paragraph were updated post-publication.)
Even, even, if we accept their position that the annotations are copyrighted, the commission is still in a major pickle: The Official Code of Georgia Annotated doesn’t delineate what parts are the actual law, what parts are copyrighted by the commission, and what parts are copyrighted by the publisher LexisNexis. A legal expert can guess that what they’re referring to is the part printed in a smaller font size. An average citizen who wants to read the law — because he is responsible for knowing all of it — does so at his own risk. Should he copy & paste a page into his blog, he risks infringing on the copyright of two possible authors (the commission and Lexis), and thus risks a costly lawsuit and statutory damages.
While the probability of a lawsuit is low, I admit, the commission has an ethical responsibility — if not a legal one — to provide the public with access to the law in a way that doesn’t put the public at risk of a lawsuit.
But it gets worse…
(Note that the discussion below about terms of service is not related to Carl’s situation, since Carl used the printed volumes. It’s a hypothetical for the rest of us. Update: This paragraph added shortly after publication.)
If you think, like any sane person, that you might find what you’re looking for in the code faster using a computer, the Georgia commission noted in their response to Carl that free, online access to Georgia’s statutory laws is already available at www.legis.ga.gov.
They don’t charge money for access, true. But they do require assent to a contract. It’s a trade. You get to see the law, and in return you give up the right to do certain things. It’s Free With An Asterisk
. Terms & conditions apply.
Here’s what happens. When you follow the link to www.legis.ga.gov and then click Georgia Code, you’re presented with contract #1:
The Official Code of Georgia Annotated (O.C.G.A.) is copyrighted by the State of Georgia. By using this website, the user acknowledges the State’s copyright interests in the O.C.G.A. Neither the O.C.G.A. nor any portions thereof shall be reproduced in any form without written permission from the Georgia Code Revision Commission, except for: (1) fair use under the copyright laws of the United States; or (2) those limited portions that are in the public domain (statute text and numbering).
Use of this website and the downloading or copying of any material there from shall be subject to the Terms and Conditions of LexisNexis®, which is the official publisher of the O.C.G.A. [ . . .]
Then you click “OK – Close”, and you are taken to a LexisNexis website. Now you click “Terms and Conditions” to read contract #2, from LexisNexis, which is a four page Word document. Here are some excerpts:
You are hereby granted … the rights to use the Research Service on one single-user personal computer.
Is an iPad a personal computer? I’m not sure. And if you share your iPad with your significant other, don’t go reading the law on it!
[Y]ou may not, nor may you permit others to … copy all or any portion of the Research Service
You may not, nor may you permit others to … make any portion of the Research Service available to third parties through … the Internet
Read the law you may, but if you want to tell anyone what you read — be careful!
The technology used … in the Website … is a trade secret, and you will … not disclose that information or permit that information to be disclosed to any person or entity.
Did you just agree to an NDA in order to read the law?
You hereby represent and warrant that all use of the Research Service will comply with this Agreement and all federal, state and local laws, statutes, rules and regulations.
Surely you don’t plan to break any laws by reading the law, but what if you do? Not only do we not withhold the law from those convicted of crimes, we actually provide the law to prisoners to assist in their own defense. This paragraph, even though innocuous sounding, is antithetical to public access.
You’re probably thinking: that’s civil, not criminal. But online the distinction between civil and criminal is wiped away. Violations of website terms of service agreements like these can lead to federal felony charges, meaning possibly jail, under the Computer Fraud and Abuse Act. It happens. People are trying to fix this. So it is criminal too.
Forcing your citizens to enter into a contract with a private (and foreign) company in order to read the law, with the threat of a civil suit, criminal charges, a felony conviction, and jail for violating that contract, is not free access.
Note that Carl scanned the print edition, not the website, so the discussion in this section is not relevant to Carl’s particular situation. But one way to resolve the iPad question is to resolve the copyright question that Carl raised. There needs to be at least one avenue where the commission accepts free and unencumbered copying of the code, and right now each avenue is a problem (copyright in print and terms of service online). (Update: This paragraph added a few minutes post publication and revised a few times.)
This situation is complicated, but it’s not intractable. I’ve worked with the Council of the District of Columbia on addressing some of these issues for DC’s laws. I also recognize many challenges in actually resolving the situation in Georgia that I haven’t mentioned here. But the State of Georgia is acting in a manner that is most disrespectful toward the public.
For more on why public access to the law is important, see my previous post about the DC Code.
Finally, if this aggravates you at all, please support Carl Malamud’s work.
In Delaware, making a copy of Title 8 of the Delaware Code can result in a $500 fine and 3-month imprisonment, per 8 Del. C. 1953, § 397, h/t Carl again.
March 31st, 2014
Though I’ve lived in DC for getting on 4 years, I still feel lost in local DC issues. So in preparation for tomorrow’s primary, and with the help of some Code for DC members, I collected some of the endorsements from around the web.
Tuesdays election is a closed primary, which means voters get different ballots depending on which party they are registered as. The list of candidates running is in DCBOEE’s Election Guide. There are no contested offices in either the Republican or Libertarian primaries. The D.C. Statehood Green Party has one contested office. But DC is basically a one-party Democratic state, so all of the action is in the Democratic primary.
For five of the six contested offices, the endorsements from Greater Greater Washington, The Washington Post, DC for Democracy, and Jews United for Justice were all in agreement:
- For Council Chair, Phil Mendelson, the incumbent.
- For Council At-Large Member, Nate Bennett-Fleming, a challenger.
- For Ward 1 Council Member, Brianne Nadeau, a challenger.
- For Ward 5 Council Member, Kenyan McDuffie, the incumbent.
- For Ward 6 Council Member, Charles Allen. (The incumbent, Tommy Wells, is running for mayor, so this is an open seat.)
(Eleanor Holmes Norton is running unopposed for Delegate to the U.S. House and Mary Cheh is running unopposed for Ward 3 Council Member.)
There was disagreement on the mayoral candidates. In fact, many of the organizations couldn’t decide on an endorsement. The two organizations that weighed in opted for a different challenger:
- The Washington Post: Muriel Bowser
- Greater Greater Washington: Tommy Wells
Code for DC member Greg Bloom forwarded the endorsements of Janelle Treibitz, who he called “one of the sharpest local activists around.” Treibitz endorsed either Andy Shallal or Tommy Wells. Perhaps that’s a tie-breaker.
Keith Ivey, another Code for DC member, is the chair of DC for Democracy, a local all-volunteer grassroots progressive group focusing lately on campaign finance and ethics reform, on improving wages and conditions for workers, and on progressive taxation.
The Washington Post’s endorsements (and rationale) are here. Thanks to Brian Brotsos who sent this to me.
Jews United for Justice were mentioned by several Code for DC members. From their endorsements, their goals are “economic and social justice, high ethical standards, and a real chance of winning.”
Greater Greater Washington’s endorsements are here.
March 24th, 2014
Over the last year I’ve had the opportunity to work with the DC Council on improving public access to DC’s laws. Today I join DC officials and the OpenGov Foundation on the Kojo Nnamdi radio show here in DC to talk about it, and in preparation for that I wrote this up as some notes for myself.
Civic hacking is a term for creative, often technological approaches to solving problems in our civic lives. It’s often about improving our interaction with government, and so building an app to get more people to register to vote would be an example of civic hacking. You might be surprised that that’s what it means. “Hacking” is a homonym, it is a word that has multiple meanings. We’re all familiar with how it can mean cyber crime. But just like how words like mouse, gay, fluke each have totally unrelated meanings, hacking is like that. The two meanings of hacking each have their own distinct communities. In my hacking community, we have organizations like Code for America and Code for DC trying to solve problems.
Codification is the process of compiling many statutes into an orderly, compact rendition of the law as it is today. Codification of laws began in 6th Century BC Athens. It wasn’t civic hacking. It was elites trying to protect their property. The Visigothic Code, written in Western Europe around 650 AC, directed “bishops and priests” to give a copy of the Code to the Jews to educate them of their heresy. So it goes. Actually it wasn’t all bad. The Visigothic Code also set a maximum price that the Code itself could be sold for (four hundred solidi, maybe $100,000 or more today), which perhaps was a form of ensuring wider access to it. Modern open records laws began in 18th Century China, where public announcements of promotions and government spending were common. Sweden enacted the first law creating a right to government records in 1766. And lay citizens have indeed long been users of the law. According to Olson (1992), “Pennsylvanians annoyed with what they thought to be unfair practices on the part of flour inspectors in the 1760s confronted the inspectors with copies of the laws.” (more history in my book)
The most important reason governments make the law available to the public is that ignorance of the law is not an excuse, and without access to the law one cannot properly defend oneself in court. Governments have an ethical obligation to promulgate the law.
But that is by no means the only reason why promulgating the law is important and useful. As the Law.Gov authors wrote, there are these other reasons: Broader use of legal materials in education (e.g. to train better lawyers and better citizens with respect to how they interact with government) and in research (e.g. to better understand how government works so that we, as elected officials and advocates, can make our government operate better); “Innovation in the legal information market by reducing barriers to entry.”; “Savings in the government’s own cost of providing these materials”; Reducing the cost of legal compliance for small businesses; “Increased foreign trade by making it easier for our foreign partners to understand our laws.”
There are many dimensions to access. Access isn’t meaningful without understanding. There are a lot of reasons why one might not understand the law even if we have access to read the words. And that’s a hard problem. But it is not a reason to not provide access to it in the first place. Users of the law can’t learn how to understand it if they can’t see it, and it would be mighty paternalistic to write off any citizen as unable to learn how to understand it. We should promote understanding, but in the meanwhile we must still provide access.
An aspect of understanding is whether we are able to be taught by others, or, inversely, if we may only teach ourselves. Surprisingly, there are many reasons why it might be illegal to share the law with others to teach them about it. The two most common causes of this are website terms of service and copyright:
The only electronic source of the DC Code in early 2013 was a website run by the (foreign-owned) company Westlaw. Westlaw was under contract with DC to help with the actual codification process as well as providing electronic public access. But through its website’s terms of service agreement, anyone reading the law on the public website was granted access in exchange for giving up rights. The terms of service included: “[Y]ou will not reproduce, duplicate, copy, download, store, further transmit, disseminate, transfer, or otherwise exploit this website, or any portion hereof . . . [Y]ou will not use any robot, spider, other automatic software or device, or manual process to monitor or copy our website or the content, information, or services on this website.” (accessed Apr. 26, 2013
Reproducing the law, copying and pasting it into an email, is a crucial tool for being able to understand the law. Terms of service are contracts. Violating a contract could normally result in a law suit and a civil penalty, typically in proportion to the harm done. Violations of website terms of service agreements in particular though can be a felony and lead to jail time under the Computer Fraud and Abuse Act. Copying DC’s laws could lead to jail time. That’s not a good thing. And that problem exists in many other jurisdictions.
DC has solved this problem by making the Code available to the public without terms of service.
Copyright is also a problem. Some states assert that they have copyright over their laws. Georgia, Idaho, and Mississippi have demanded that the nonprofit Public.Resource.Org take down its electronic copies of the official laws of those states. (There is some disagreement over whether so-called annotations to the law are law or are copyrighted.) Public.Resource.Org is fighting a similar argument with nonprofit standards-writing bodies — i.e. the bodies that write public safety codes and building construction standards — because they claim copyright over standards that have been incorporated into law. Violations of copyright law come with stiff fines. There should be no copyright over law, and court cases have addressed this, but some states have taken a particularly narrow and short-sighted view on this.
DC has historically claimed copyright over the DC Code as well, but apparently in a defensive posture to prevent its contractors (West, Lexis) from claiming copyright over the law themselves. DC now solved this problem by making the Code available to the public with a copyright waiver called Creative Commons Zero (CC0). DC no longer claims copyright over the code. (I’ll note again that there are a number of court cases that say that edicts of government, i.e. the law, cannot be copyrighted. But no one wants to have to go to court to fight over this.)
Understanding of the law is magnified if we use tools. The Code of the District of Columbia has almost 20,000 sections. Search is crucial. So is good typography: think about access to the visually impaired, the older people among us, and anyone who doesn’t want to get a headache from the way the law is printed. For companies concerned about legal compliance, the ability to be alerted to how the law has changed — with “track changes”-style highlighting — is incredibly useful. So not only is access important, but electronic access is even more important so that we can use tools to help us understand it.
Lawyers, citizens, students, and other users of the law have different needs when it comes to reading it. Government bodies should create a website to provide public access to the law, but it is a shame if they provide the only access to the law. The law should be provided to the public in data formats that promote reuse so that the public — companies, citizen hackers, and so on — can build on it and create new public access websites that are tuned for a wider range of access. These websites might provide functionality that a government website could not, such as analysis (e.g. written by lawyers), inline definitions, links to related resources (for instance related laws in other jurisdictions), translations into other languages, email alerts to changes, and a place where citizens can propose changes to the law.
For background about how the DC Code got opened, see DC opens its “code”, embracing principles of open laws, my post from last April.
February 10th, 2014
There comes a time in every dataset’s life when it wants to become an API. That might be because of consumer demand or an executive order. How are you going to make a good one?
When is an API appropriate?
There are certain datasets that are so large or volatile that downloading the whole thing and/or keeping it up to date becomes burdensome. An API is one strategy to lower the barrier to entry. As Ben Balter wrote:
Go to any agency website, and chances are you’ll find at least one dataset sitting idly by because the barrier to consume it is too damn high. It doesn’t matter how hard stakeholders had to fight to get the data out the door or how valuable the dataset is, it’s never going to become the next GPS or weather data.
A web-based, read-only API is a tool that in some cases can make it easier for consumers to use your data.
To put this in context, I assume here the data is already available as a bulk data download. As I’ve written ad nauseum elsewhere (such as at http://opengovdata.io/maturity/), an API is almost never a starting point. Posting open data, bulk data, and structured data and using good identifiers all come first, and probably in that order, before an API becomes useful. You can’t make a good API without working through all that first, and all of that addresses important and common use cases that APIs do not. So I assume from here on that bulk data is available and that the steps to make the data good data have already been done. So…
The term “API” is vague. It’s often used as short-hand to mean a web-based method for programmable access to a system. But “API” is just a way of saying “protocol”. There were APIs before there was an Internet. Merely having an “API” doesn’t mean an actual use case has been solved: you can make a protocol without it being any useful.
What makes an API good?
Let’s take the common case where you have a relatively static, large dataset that you want to provide read-only access to. Here are 19 common attributes of good APIs for this situation. Thanks to Alan deLevie, Ben Balter, Eric Mill, Ed Summers, Joe Wicentowski, and Dave Caraway for some of these ideas.
Granular Access. If the user wanted the whole thing they’d download it in bulk, so an API must be good at providing access to the most granular level practical for data users (h/t Ben Balter for the wording on that). When the data comes from a table, this usually means the ability to read a small slice of it using filters, sorting, and paging (limit/offset), the ability to get a single row by identifying it with a persistent, unique identifier (usually a numeric ID), and the ability to select just which fields should be included in the result output (good for optimizing bandwidth in mobile apps, h/t Eric Mill). (But see “intents” below.)
Deep Filtering. An API should be good at needle-in-haystack problems. Full text search is hard to do, so an API that can do it relieves a big burden for developers — if your API has any big text fields. Filters that can span relations or cross tables (i.e. joins) can be very helpful as well. But don’t go overboard. (Again, see “intents” below.)
Typed Values. Response data should be typed. That means that whether a field’s value is an integer, text, list, floating-point number, dictionary, null, or date should be encoded as a part of the value itself. JSON and XML with XSD are good at this. CSV and plain XML, on the other hand, are totally untyped. Types must be strictly enforced. Columns must choose a data type and stick with it, no exceptions. When encoding other sorts of data as text, the values must all absolutely be valid according to the most narrow regular expression that you can make. Provide that regular expression to the API users in documentation.
Normalize Tables, Then Denormalize. Normalization is the process of removing redundancy from tables by making multiple tables. You should do that. Have lots of primary keys that link related tables together. But… then… denormalize. The bottleneck of most APIs isn’t disk space but speed. Queries over denormalized tables are much faster than writing queries with JOINs over multiple tables. It’s faster to get data if it’s all in one response than if the user has to issue multiple API calls (across multiple tables) to get it. You still have to normalize first, though. Denormalized data is hard to understand and hard to maintain.
Be RESTful, And More. ”REST” is a set of practices. There are whole books on this. Here it is in short. Every object named in the data (often that’s the rows of the table) gets its own URL. Hierarchical relationships in the data are turned into nice URL paths with slashes. Put the URLs of related resources in output too (HATEOAS, h/t Ed Summers). Use HTTP GET and normal query string processing (a=x&b=y) for filtering, sorting, and paging. The idea of REST is that these are patterns already familiar to developers, and reusing existing patterns — rather than making up entirely new ones — makes the API more understandable and reusable. Also, use HTTPS for everything (h/t Eric Mill), and provide the API’s status as an API itself possibly at the root URL of the API’s URL space (h/t Eric Mill again). Some more tips about the use of JSON in requests and responses, URL structures, and more, are in the Heroku HTTP API Guide.
Multiple Output Formats. Provide alternatives for the output format, commonly JSON, XML, and CSV, because different formats are best for different use cases. This is to the extent that you actually have users that want these formats. CSV is nice for researchers but not great for other developers; developers lately are moving away from XML toward JSON. See what formats your users want. A RESTful API (see above) will let the caller choose the output format by simply tacking a file extension to the end of the URL, or you can use content negotiation (h/t Dave Caraway).
Nice Errors. Error messages, either because of an invalid request from the user or a problem on the server side, should be clear and provided in a structured data format (e.g. JSON). A RESTful API (see above) additionally uses HTTP status codes where they apply, especially 200, 400, 404, and 500.
Turn Intents into URLs. An API won’t satisfy everyone’s use case, so pick the most important and make them dead-simple for the user. These use cases are also called “verbs” and “intents.” If a common use case is to get the latest entry added to the dataset, make an API called “/api/1/most-recent-entry.” Don’t make users add filtering, sorting, and paging to do common operations. It’s temping to build a kitchen-sink API that can do anything generically and nothing specifically, but it misses the point: As Ben Balter put it, ”APIs should absorb the complexities of using the data, not simply expose it in a machine-readable format.” Intents are also good for hiding implementation details, which gives you flexibility to make back-end changes in the future.
Documentation. This is incredibly important. An API without documentation is useless. Totally useless. Because no one will know how to use it. Documentation should cover why the dataset is important, what the data fields mean, how to use the API, and examples examples examples.
Client Libraries. Your users will be accessing your API through software. They’re going to have to write code. Provide re-usable, fully working, modular code for accessing the API in the most common languages that the developers will be using (usually Python, Ruby, and perhaps PHP). This code gives developers a head start, and since every developer will need to write the same basic API-accessing code you get a big win by taking care of writing it once for everyone. (h/t Alan deLevie)
Versioning. You will make changes to the API. Nothing is right the first time. Put a version into every API URL so that when it’s time for Version 2 you don’t disrupt the Version 1 users. The best version numbers are actually release dates. So your API URLs should look like: /api/2014-02-10/…. Using a date as a version can relieve anxiety around making updates. You could also version with an Accept header.
High Performance. Your API should be fast. And while users will appreciate it, the most important reason is for you. Slow APIs create a risk that your server will be overloaded with use too quickly. Some users will inadvertently (if not maliciously) issue extremely slow and resource-intensive queries if you make such queries possible, and if they issue a lot or if too many users make those queries your API can come down hard. Try not to have that possibility. If you need long-running queries, make it hard for users to inadvertently start them. In addition, query results should be cached on the server side by URL (i.e. don’t put authentication in the URL!) and cachable, in principle, on the client side if the user chooses to so that repeated accesses to exactly the same query are lightning-fast (e.g. use ETags).
High Availability. You don’t know when users will be using the API, so it needs to be available all the time. This is really hard. (It’s one reason bulk data is so much easier.) Basic precautions like rate limiting should be taken to reduce the risk that the API fails under high load. When updating the data behind the API, the API should never be left in a state where it provides incomplete answers. Maintenance windows should be short because they are incredibly disruptive to users, and notice should be posted ahead of time.
Know Your Users. Log what happens in your API and have some analytics so you can tell if anyone is using it and what they’re using it for, and whether the API is really addressing the use cases you want it to.
Know Your Committed Users More. Have a relationship with your committed users so you can alert them to upcoming maintenance and changes to the API, and so you can know who is making resource-intensive queries in case those queries get out of control. This is often done by having an API key (which is like a password for access — but it should be optional!! see the next section). Your system for issuing API keys should be automated and real-time so that developers don’t have to wait to get started. In the API, pass the API key in the HTTP authorization header (h/t Ed Summers). (Or consider another standard method of authorization like OAuth; h/t Ben Balter.)
Never Require Registration. Don’t have authentication on your API to keep people out! In fact, having a requirement of registration may contradict other guidelines (such as the 8 Principles of Open Government Data). If you do use an API key, make it optional. A non-authenticated tier lets developers quickly test the waters, and that is really important for getting developers in the door, and, again, it may be important for policy reasons as well. You can have a carrot to incentivize voluntary authentication: raise the rate limit for authenticated queries, for instance. (h/t Ben Balter)
Interactive Documentation. An API explorer is a web page that users can visit to learn how to build API queries and see results for test queries in real time. It’s an interactive browser tool, like interactive documentation. Lacking that, executable examples is a great alternative. Relatedly, an “explain mode” in queries, which instead of returning results says what the query was and how it would be processed, can help developers understand how to use the API (h/t Eric Mill). An API endpoint that gives a user his/her rate-limiting status is helpful too.
Developer Community. Life is hard. Coding is hard. The subject matter your data is about is probably very complex. Don’t make your API users wade into your API alone. Bring the users together, bring them to you, and sometimes go to them. Let them ask questions and report issues in a public place (such as github). You may find that users will answer other users’ questions. Wouldn’t that be great? Have a mailing list for longer questions and discussion about the future of the API. Gather case studies of how people are using the API and show them off to the other users. It’s not a requirement that the API owner participates heavily in the developer community — just having a hub is very helpful — but of course the more participation the better.
Create Virtuous Cycles. Create an environment around the API that make the data and API stronger. For instance, other individuals within your organization who need the data should go through the public API to the greatest extent possible. Those users are experts and will help you make a better API, once they realize they benefit from it too. Create a feedback loop around the data, meaning find a way for API users to submit reports of data errors and have a process to carry out data updates, if applicable and possible. Do this in the public as much as possible so that others see they can also join the virtuous cycle.
How do you build a good API?
Actually I don’t know yet, but here are some things that might be useful:
- API Umbrella, which is used at api.data.gov, provides API key management, rate limiting, and so on as a wrapper around an existing API. It was also some of the inspiration for starting this blog post.
- instant-api, by Waldo Jaquith, creates an API from static data.
- qu, by the CFPB, creates a platform for serving data