August 9th, 2014
Over on Mail-in-a-Box issue #129, @llchen223 noted that I hadn’t changed Dovecot‘s default setting of imap_idle_notify_interval. This setting controls the duration of IMAP IDLE pauses during which the mail client waits patiently for a new mail notification.
Here’s how it looks with K-9 mail (client) talking to Dovecot (server):
CLIENT> 9 IDLE
SERVER> + idling
SERVER> * OK Still here
. . . server waits imap_idle_notify_interval minutes . . .
SERVER> * OK Still here
. . . server waits imap_idle_notify_interval minutes . . .
SERVER> * OK Still here
The dance ends when the server reports something other than OK, such as the presence of new mail, or if the client decides to go back into normal IMAP command mode. Multiple of these may be actually happening simultaneously in different connections if the mail client is monitoring for new mail in more than one folder.
This is pretty efficient. The * OK Still here message is just 444 bytes (over SSL!).
But the concern is that with the default 2 minute delay, that’s 720 × the number of monitored folders possible times a day that a phone has to do something. Does waking the phone’s networking capabilities drain the battery? If multiple connections are open to monitor multiple folders Dovecot seems to smartly group the OKs together so that the phone is woken up just once — so at least there’s that.
None of this appears to have actually been a problem for my phone, and @llchen223 reports that having K-9 monitoring a mailbox has negligible impact on battery usage on his phone.
Peter Kieser suggested (three years ago) increasing imap_idle_notify_interval to be so long that the client checks in first (h/t @jkaberg). The IMAP IDLE standard says the server can hang up after 29 minutes if it hasn’t heard from the client, and so K-9 checks in after at most 24 minutes from the start of the IDLE by ending the IDLE (with DONE) and starting a new one (IDLE again). If imap_idle_notify_interval is more than 24 minutes, * OK Still here will never occur (I think).
This sounds great, but the longer the connection remains totally silent the higher the risk that some intermediate point on the connection will decide the connection is broken and reset it. In recording these sessions, I ran into socat’s timeout of 15 minutes.
Not all mail clients are as efficient as K-9. Mozilla Thunderbird restarts the IDLE after each server OK:
CLIENT> 7 IDLE
SERVER> + idling
. . . both sides wait imap_idle_notify_interval minutes . . .
SERVER> * OK Still here
SERVER> 7 OK Idle completed.
CLIENT> 8 noop
SERVER> 8 OK NOOP completed.
Immediately after the NOOP at the end the client issues a new IDLE command and the pattern repeats. It only looks like 93 bytes, but inside an SSL connection it takes 1,292 bytes. (That’s both sides of the connection.) With the default Dovecot setting of a 2 minute delay, that’s a little less than 1 MB per day (× the number of monitored folders).
The default settings of Dovecot and K-9 seem to be just fine both in terms of bandwidth and battery usage, and there’s no indication that increasing the interval will benefit phones running K-9. Mozilla Thunderbird is not as good at minimizing bandwidth, but I don’t expect many people are running Mozilla Thunderbird on a mobile broadband connection. Maybe an increase of imap_idle_notify_interval from 2 to 4 minutes would be prudent.
I recorded the session with socat by having it act as an OpenSSL server (on a new port, here 992) to terminate the encryption, log the unencrypted IMAP stream, and proxy the IMAP traffic to a new SSL connection to the Dovecot IMAP server (port 993):
socat -x \
This outputs the IMAP stream in hex, which happens to be a little cleaner than outputting the ASCII stream. socat’s default socket timeout was about 15 minutes, so I’ve also extended it to be able to handle the 24-minute IDLE length.
To measure the size of the encrypted traffic (including link-level headers and so on), I used tcpdump to monitor port 992:
tcpdump -q -nn -e -ttttt -U "port 992"
I’m using Dovecot 2.2.9, K-9 4.804, and Mozilla Thunderbird 31.0. The exact sizes of the encrypted IDLE-related messages probably depend on which protocol and ciphers happen to be selected for the connection, so socat will be affecting those measurements.
(While recording these sessions, I noticed that K-9 would also check the Drafts folder every 90 seconds if no Drafts folder exists. As soon as the first draft was saved, causing K-9 to create the folder, this poll stopped. So I’ll have to revise Mail-in-a-Box to create the Drafts folder by default.)
July 21st, 2014
Earlier today DC’s mayor issued a Transparency, Open Government and Open Data Directive (readable thanks to Alex Howard here). Much of it was adapted from the White House’s open government memoranda, including those memoranda’s faults.
There are many things to like about the Directive, including the mention of a potential new Chief Data Officer position, the use of open formats, and the goal of promoting reuse. The framing in terms of transparency, participation, and collaboration — lifted from Obama’s 2009 open government memo and adopted in the Mayor’s 2011 memorandum on transparency and open government – is good. (Though not great. The White House never managed to actually execute the collaboration part.)
But much of it is also undercut by a new notion of conditional access to government data that is becoming the norm.
Having their cake and eating it too
What I mean is that while the directive explicitly and clearly states that there will be
no restrictions on copying, publishing, further distributing, modifying or using the data [in DC's data catalog]
it simultaneously explicitly describes a number of restrictions that there will or may be on use of the data. (It’s clear DC copied language from the White House’s 2013 open data memo (“M-13-13″), which I’ve blogged about before here and here, including their mistakes.)
“No restrictions” is what we want. It is, by community consensus, a core and defining quality of open government data.
If there are capricious rules around the reuse of it, it’s not open government data. Period. Restrictions serve only to create a legal lever by which the government can put pressure on things they don’t like. Imagine if the DC government took legal action against Greater Greater Washington to stop an unflattering story on the basis that GGW didn’t properly cite the DC government for the data used in a story. This is what the future of open data in DC looks like when there are restrictions on reuse.
Okay so specifically:
“Open license” does not mean “no restrictions”
So first it says that the data catalog will accomplish this goal of “no restrictions” by making the data available through an “open license.” The usual meaning of open license does not mean “no restrictions,” however. Most open licenses, including open source licenses and Creative Commons licenses, only grant some privileges but not others. Often privileges come along with new requirements, such as GPL’s virality clause, or the restriction that users must attribute the work to the author. Under the Open Definition, “open” means reusable but potentially subject to certain terms.
In guidance I co-wrote with Eric Mill, Jonathan Gray, and others called Best-Practices Language for Making Data “License-Free”, we addressed what governments should do if they really want to create “no restrictions.” They should use CC0, a copyright waiver. This is really the only way to achieve “no restrictions.”
(This was one of the confusions in M-13-13 as well. It’s clear the directive took the open licensing language from M-13-13.)
“Open license” presumes the work is copyrighted
Facts cannot be copyrighted. To the extent that DC’s data catalog contains facts about the District, about government operations, and so on, the data files in the catalog are likely not subject to copyright protections. (What is and isn’t copyrightable is murky.) Open licensing, as normally understood, presumes the work is copyrighted. If the work isn’t copyrighted, an open license simply doesn’t apply. You can’t license what you don’t own.
(This was another one of the confusions in M-13-13. But unlike the federal government, the DC government probably can copyright things it produces. But probably not data files.)
Data users must agree to a contract first
(This was not a problem in M-13-13. The License-Free Best Practices did address this though.)
Attribution and explanation requirements
The directive also gives us a clue about what else will be in the Terms of Service:
Nothing in this Order shall be deemed to prohibit OCTO or any agency … from adopting or implementing measures necessary or appropriate to . . . (v) require a third party providing the District’s public data (or applications based on public data) to the public to explicitly identify the source and version of the public dataset, and describe any modifications made to the public dataset.
This is an attribution requirement, plus a requirement for data users to explain themselves.
To be sure, and as Alex Howard called me out on on Twitter, these are hypotheticals that the directive leaves open and not something the directive is mandating. But the fact that these are mentioned strongly suggests that OCTO or other agencies want to enforce these sort of terms and will if they can.
And, as you might guess I would say, requirements to attribute the government for data and to explain what you did with data are restrictions on use, which like the others create a lever by which the DC government might put pressure on things it doesn’t like.
(This was also a problem in M-13-13, but in this case it doesn’t appear that the DC directive specifically copied the problem from M-13-13.)
There is a strong American tradition — or at least a core American value — that the government does not get in the way of the dissemination of ideas. We don’t always live up to that ideal, but we strive for it. Access to information about the government that comes with restrictions on what we can say when we use it (e.g. attribution & explanation), a waiver of rights or a commitment to indemnify, etc. are all an anathema to accountability and transparency and respect for the public.
If and when these new terms go up, I will encourage users to FOIA for the same information rather than get it from the DC data catalog.
July 21st, 2014
ProPublic reported this morning that WhiteHouse.gov is — albeit accidentally — using a new method for tracking individual visitors to the website. This reminded me that for the last 6 months I’ve been tracking the White House.
On Jan 17 the President made his first major speech regarding reforms to the NSA’s massive surveillance programs revealed last year. I thought that morning that he would announce new mandatory data retention policies for internet and telephone service providers. He didn’t. But by the time the speech began I had already started tracking the White House.
About 5% of traffic to my website GovTrack.us comes from the government. Most IP addresses are tied to the major broadband providers like Verizon, Comcast, and so on. But some government IP addresses come from special IP address blocks labeled specifically for the office that reserved them.
Three blocks were of interest to me: the blocks for the Executive Office of the President (“EOP”, about 60 page views on GovTrack per day), the United States Senate (about 300 page views/day), and the House of Representatives (about 600 page views/day). I don’t know where the computers are that have these IP addresses, but I expect that EOP IP addresses would include White House, West Wing, and perhaps the Eisenhower Executive Office Building (more on what the EOP is). The House and Senate IP addresses are used in the Capitol and the seven congressional office buildings, including in non-political offices and the guest WiFi network, to the best of my knowledge.
On Jan 17 I began uniquely identifying the users of these IP blocks by placing a persistent cookie with a unique identifier in their web browsers when they visited GovTrack and logging each of those page views. Persistent cookies get lost when users clear their browser cookies, but it’s a useful first approximation to identifying users.
So far, 324,705 hits on GovTrack have been logged from 19,131 unique tracking cookies:
(Interestingly, 373 unique cookies appeared on more than one of the three networks — probably a laptop that moved from one building to another.)
The longest recorded session is for one tracking cookie on the House network that made 1,590 page views almost all on Feb 21, but also in January and March, to pages for various representatives. My guess is this was a lobbyist on the guest wifi doing research before a meeting.
From the EOP, the longest recorded session is 901 page views between March 18 and July 18. This user mostly looked at my congressional district maps and a few bills on a variety of subjects. There was no discernible pattern to it, except that this person is probably responsible for looking up the congressional districts of people. Maybe the person processes incoming mail to the President.
This is all I’ll look into right now, but I may post more about it if I find anything interesting.
I’d be glad to share the data on request.
July 4th, 2014
We the People, one of the White House’s cornerstone open government initiatives, is 10% a sham. The site promises to respond to petitions posted by users if the petition reaches a certain number of signatures within 30 days. Nextgov reported earlier in the year that the White House was not keeping its end of the pledge. It’s true. More than 10% of petitions that deserve an answer go unanswered.
The threshold for a White House response has gone up steadily as the popularity of We The People has increased. In 2012 the threshold was 25,000 signatures within 30 days, and now the threshold is 100,000. Except when it isn’t. Of the 217 petitions that gathered enough signatures for a response, 29 have gone unanswered.
The full list of petitions that the White House owes a response to are below.
Two of those petitions hit close to home for the open government movement. In early 2013, Aaron Swartz, an early leader of the open government movement, committed suicide while under investigation for downloading research papers without permission. Two petitions (1, 2) were submitted shortly after his death calling for the firing of attorneys believed to be over-zealous in their prosecution of Swartz. The petitions each gathered more than the then-threshold of 25,000 signatures, but more than one year and 89,733 signatures later there has been no response.
The earliest of the petitions still “pending response,” according to the White House’s own data, is a petition which has now reached ten times the (early) 5,000 signature threshold. It is a petition about GMO food labeling. Another petition, at 211,925 signatures today (181,479 in its first 30 days), created in July 2013 asked the White House to declare the Muslim Brotherhood party in Egypt a terrorist organization. It remains not responded to as well.
The White House has posted 156 responses to 225 petitions since the site launched in September 2011. 2,916 petitions have been created in all. The most successful petition, by number of signatures, was one created in December 2012 asking the White House to recognize the Westboro Baptist Church as a hate group. It gathered 367,180 signatures and in July 2013 received a response to the effect that the White House does not maintain a list of hate groups — demonstrating that, of course, getting a response does not mean getting the response the petitioners wanted.
h/t and thanks to @konklone for mentioning this to me a long long while ago.
Okay here’s the list, in order of number of signatures gathered. The date before each is the date the petition was created, and so it met its threshold 30 days later.
07/07/13: Declare Muslim Brotherhood organization as a terrorist group (211,925 signatures; 181,479 in 30 days)
06/09/13: Pardon Edward Snowden (161,395 signatures; 129,312 in 30 days)
05/03/13: Invest and deport Jasmine Sun who was the main suspect of a famous Thallium poison murder case (victim:Zhu Lin) in China (151,169 signatures; 148,285 in 30 days)
05/13/14: put sanctions on China for invading Vietnam territory with the deployment of oil rig Haiyang 981. (139,216 signatures; 138,878 in 30 days)
06/05/13: allow Tesla Motors to sell directly to consumers in all 50 states. (138,379 signatures; 110,384 in 30 days)
12/11/13: Remove offensive state in Glendale, CA public park (129,170 signatures; 123,629 in 30 days)
05/01/14: Demand Release of U.S.M.C. Sgt. Tahmooressi Suffering with PTSD from Mexico Imprisonment (128,770 signatures; 115,889 in 30 days)
04/25/14: Urge S. Korean Government & Press to Stop the Attack Against Church in the Aftermath of Ferry Tragedy (119,369 signatures; 117,749 in 30 days)
08/22/13: Stop SOPA 2013 (118,905 signatures; 106,486 in 30 days)
05/15/13: Provide necessary assistance to prevent Taiwanese people from being murdered by Philippines and rebuild friendship. (115,676 signatures; 113,330 in 30 days)
02/17/14: Stop SOPA 2014. (112,293 signatures; 104,184 in 30 days)
11/12/13: Reform ECPA: Tell the Government to Get a Warrant (112,087 signatures; 105,236 in 30 days)
04/12/12: Support mandatory labeling of genetically engineered foods (GMOs). (110,784 signatures; 30,740 in 30 days)
02/26/14: Allow Ukrainian Citizens 90 day entrance into the USA on passport, without Visa. (107,909 signatures; 103,037 in 30 days)
01/04/14: Please Protect The Peace Monument in Glendale Central Library (106,751 signatures; 105,390 in 30 days)
02/27/14: Urge the FDA to Say YES to Accelerated Approval for safe, effective therapies for children with Duchenne. (106,734 signatures; 105,036 in 30 days)
04/23/14: Designate Russia as “State Sponsor of Terrorism” (104,914 signatures; 103,722 in 30 days)
03/21/14: Legally Recognize Non-Binary Genders (103,166 signatures; 101,494 in 30 days)
09/23/11: Require all Genetically Modified Foods to be labeled as such. (64,311 signatures; 8,747 in 30 days)
01/12/13: Remove United States District Attorney Carmen Ortiz from office for overreach in the case of Aaron Swartz. (60,881 signatures; 52,466 in 30 days)
05/10/12: Remove the monument and not to support any international harassment related to this issue against the people of Japan. (47,477 signatures; 31,473 in 30 days)
06/21/12: Repeal the House of Representatives Resolution 121 to stop aggravating int’l harassment by Korean propaganda & lies! (46,012 signatures; 27,623 in 30 days)
09/01/12: Persuade South Korea (the ROK) to accept Japan’s proposal on territorial dispute over islets. (42,015 signatures; 30,213 in 30 days)
12/28/12: To award the Medal of Freedom to the 4 Firefighters who were ambushed in West Webster New York on Christmas Eve 2012 (34,067 signatures; 29,322 in 30 days)
12/02/12: Investigate and publicly condemn organ harvesting from Falun Gong believers in China (33,733 signatures; 28,624 in 30 days)
01/08/13: Invite Neal Boortz, the author of The FairTax Book, to spend one hour talking with the President about tax reform. (32,155 signatures; 28,191 in 30 days)
12/11/12: oppose the petition created by xHisa Axon Japanxs proposal to take Japanxs claim over Dokdo (or Takeshima) to the ICJ. (31,609 signatures; 28,959 in 30 days)
01/12/13: Fire Assistant U.S. Attorney Steve Heymann. (28,854 signatures; 25,717 in 30 days)
12/29/12: There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA (26,797 signatures; 25,467 in 30 days)
June 29th, 2014
In a Facebook study published this week, Facebook manipulated
all of their U.S. many of their users’ News Feeds by omitting 0-90% of posts containing either positive or negative content over the course of a week in 2012. They reported that those users wrote fewer positive and negative words (respectively) in their own posts, concluding that Facebook is a medium on which emotions spread, a case of “emotional contagion” using their technical term.
Here’s what you need to know:
On average, no emotion actually spread
The number of positive words in their average user’s posts decreased from 6 words to… 6 words.
The first major omission in the study is the lack of individual-level statistics. While they reported aggregate numbers such as having analyzed ”over 3 million posts” totaling “122 million words” made by their “N = 689,003″ users, and the study’s implication for “hundreds of thousands of emotion expressions,” they omitted any discussion of whether and how individuals were affected in any meaningful way.
From their numbers, the average user wrote 4.5-5 posts totaling 177 words during the experimental week. Only 3.6% of those words — so about 6 words — were “emotional,” and they found that by omitting about half of emotional posts from user’s News Feeds that percentage would go down by 0.1% or less. A 0.1% change is about 2/10ths of a word.
For most of their users, there was not even close to a measurable effect.
(The study did mention the Cohen’s d statistic of ’0.02′ which is another way to say that there was an aggregate effect but basically no individual-level effect.)
The study has no test for external validity (was it about emotions at all?)
An important part of every study is checking that what you’re measuring actually relates to the phenomenon you’re interested in. This is called external validity. The authors of the Facebook study boasted that they didn’t think of this.
The paper quixotically mentions that “no text was seen by the researchers” in order to comply with Facebook’s agreement with its users about how it will use their data.
They didn’t look at all?
That’s kind of a problem. How do you perform a study on 122 million words and not look at any of them?
Are the posts even original, expressive content? The users might be sharing posts less (sharing is sort of like retweeting) or referring less to the emotional states of friends (“John sounds sad!”). The words in a post may reflect the emotions of someone besides the poster!
To classify words as “positive” or “negative” the study consulted a pre-existing list of positive and negative words used throughout these sorts of social science research studies. This comes with some limitations: sarcasm, quotation, or even simple negation completely cut out the legs under this approach. I actually think in aggregate these problems tend to go away, but only when you have a large effect size.
The whole of Facebook’s reported effect on emotion could be due to one of the many limitations of using word lists as a proxy for emotion. They needed to demonstrate it wasn’t.
This study is not reproducible. While most research isn’t ever reproduced, that it could be provides a check against the fabrication of results (and sometimes that’s how fabricators are caught). Facebook provides the only access to a network of this size and shape. It is unlikely they would provide access to research that might discredit the study.
The study also uses a strange analysis. Their experimental design was 2 X 9-ish (control or experiment X 10-90% of posts hidden), but they plugged the two variables into their linear regression in two ways. The first became a binary (“dummy”) variable in the regression, which is right, but the second become a weight on the data points rather than a predictor. That’s an odd choice. Do the results come out differently if the percentage of posts hidden is properly included in the regression model? Did they choose the analysis that gave the results they wanted to see? (This is why I say “about half of emotional posts” above, since the analysis is over a weighted range.)
Finally, there’s the problem of informed consent. It is unethical to run experiments on people without it. The paper addresses legal consent, in the sense that the users agreed to various things as a pre-condition for using Facebook. Though being manipulated was probably not one of them (I don’t know what Facebook’s terms of service were in early 2012 unfortunately).
Certainly the consent didn’t reach the level of informed consent, in which participants have a cogent sense of what is at stake. There’s a great discussion of this at Slate by Katy Waldman.
Facebook’s users have a right to be outraged over this.
Keep in mind though that there are different ethical obligations for research versus developing a product. It could be ethical for Facebook to manipulate News Feeds to figure out how to increase engagement while at the same time being unethical for a research journal to publish a paper about it.
June 20th, 2014
Last week I noticed that the sunset aligned unusually well with my cross-street, Newton St NW, and it made me wonder if we have any Manhattanhenge-like events in DC. DC can one-up Manhattan — we’ve got a double-henge, if you’ll let me coin a phrase.
Here in Columbia Heights we have a unique street pattern. Two roads — Park Rd and Monroe St. — come to an apex on 14th St. They go north both to the east and west of 14th St. On a few days a year — centered on May 15 and July 29 — the roads point east toward sunrise and west toward sunset. Click the links to see on suncalc.net. (The alignment isn’t exact, so the effect spans a few days.)
All the henges
Like Manhattan, DC’s grid lines up with sunrise & sunset. It’s on the equinoxes, so we get a boring double-henge on those days too.
Some of the state avenues are kind of close to the solar azimuths on the solstices, but the peak days are a few days off. In the summer it is on the same days as the Columbia Heights Doublehenge. On those days the avenues parallel to New York Avenue line up with sunrise and the avenues parallel to Pennsylvania Avenue line up with sunset. Around the winter solstice — Nov 5 and Feb 6 — the avenues parallel to Pennsylvania Avenue line up with sunrise and the avenues parallel to New York Avenue line up with sunset.
I wondered for each day of the year, what was the DC road that best aligns with sunrise and sunset. If you’re driving these would also be the roads to avoid (h/t @knowtheory). Here’s a table for the next year. The links will show you where exactly it is:
June 8th, 2014
Last month the Administration posted perhaps the first github pull request to change federal policy. One WIRED writer was quick to call success, writing here, “By opening up the revisions and the discussions behind them, the White House is making its thinking clear.” But no.
It’s easy to be fooled into believing that a new medium also signals new substance.
No substantive policy change in this pull request
The pull request is a proposed change to a federal memorandum on github regarding open data. The change clarifies when agencies should openly license their data. The memorandum originally said that federal agencies should always use open licensing. But as I pointed out when the memorandum was issued a year ago, that’s not legally possible. Most federal data is not subject to copyright in the first place, and works that are in the public domain can’t be licensed.
The proposed update to the memorandum fixes the Administration’s mistake by adding at the top:
“In instances where government data . . . does not fall squarely within the public domain . . .”
clarifying that open licensing should only be used where copyright applies. Mainly that means when the data was produced by a government contractor. There is no substantive change made in this pull request though. It clarifies the only sensible meaning the original memorandum actually had.
Omits discussion of the substantive issues
If this were the only issue in the paragraph being edited, then I too would call it success. But late last year 14 organizations backed a statement supporting the public domain for government data — not open licensing — and several of us who wrote the letter met with the Administration about the issue. The absence of any mention of the substantive issue in that paragraph should be a red flag for thinking the pull request represents open dialog.
The substantive issue is that the policy condones the copyrighting of any government data, much of which might be used to create or enforce government policy. That’s a serious First Amendment concern. It means that even if journalists can get a hold of some data, they might only be able to share it on terms set by a government agency or even a government contractor. As a broad government policy, the notion of copyrighting government data is ridiculous and flies in the face of our country’s traditions and values. (Note: Forget national security, privacy, etc. This could be data about any mundane policy.)
The pull request omits discussion of this issue, as well as other issues that I and others have discussed with the Administration (as I noted in my reply to the pull request).
Where was the dialog?
There was dialog on these issues, but it wasn’t on github. It was in private in-person meetings, as these things usually are. I and others met with Administration staff in private meetings in August 2013, April 2014, and May 2014. Our discussions each time were thoughtful and productive.
There was plenty of good dialog, but it wasn’t online. I first raised the licensing issue on github a year ago in issues #5 and #64, to which the Administration replied only that they would look into it. The issue was picked up against in issue #257, but again there was no participation in the github issue by the Administration. (There is a lot of dialog in that github repository, but it is about data standards and not policy, and most of the participants in those discussions are government employees or contractors (including myself, in those conversations) — which is a good thing, but not the subject of the WIRED article.)
The pull request posted last month represents the end of a year-long process in which discussions were taking place off-line, and proof that even with github most dialog will still continue to take place off-line.
Lest journalists get confused let’s just be clear that there wasn’t any discussion of substance on github. It was elsewhere, off-line, like normal.
Now I’m just going to be a jerk and red-line the WIRED article because it got a lot of details wrong:
This White House GitHub Experiment Could Help Fix Government
BY ROBERT MCMILLAN
While many of our nation’s problems are quite clear, the way our government addresses them is too often a black box—opaque and closed to all but insiders and lobbyists.
But the White House has taken a remarkable–if small–step toward bringing greater transparency to the legislative process. (“legislative” refers to the legislative branch of government, i.e. Congress. This is an executive-branch memo and thus not related to the legislative process.) For the first time, it has used the GitHub social coding website as a forum for discussing and ultimately changing government policy. With one GitHub “pull request,” it modified (The document has not yet been modified.) theProject Open Data policy document, which spells out how government agencies are supposed to open up access to their data. This represents the fusion of open source software and government policy that open-government advocates have long predicted (#notalladvocates predict this). And it might be a sign of things to come as others—the city of San Francisco, and the New York state senate, to name a couple—bring collaborative government into the light.
‘We’re taking a well-known page from the open source playbook: that developing policy in an open and iterative way will create a stronger, more effective product.’
Late last week, Haley Van Dyck at the Office of Management and Budget submitted a pull request that suggested small changes to Project Open data that clarify how agencies think about open source and public domain software (The memo does not cover software. It is about data.). Pull requests are a Silicon Valley innovation. They’re typically used by software developers on GitHub to suggest and discuss changes to code. But they’re also a good tool for tracking changes to complex legal documents, even government regulations.
While Van Dyck’s changes weren’t big, it’s important that these issues were raised and addressed in a public forum where anyone can suggest language for the policy document.(Anyone can, but no one did. The pull request was submitted by the Administration to the Administration’s own document. Let’s wait until they accept a pull request submitted by the public to a policy document.) “We’re taking a well-known page from the open source playbook: that developing policy in an open and iterative way will create a stronger, more effective product. The more we can involve the community, the better that product will be,” said Van Dyck—a senior adviser to the U.S. Chief Information Officer—in an email to WIRED.
The White House will wait a few weeks to review comments to the pull requests, but then Van Dyck’s changes become official government policy with the push of a button. This is open source government: The tonic that could cure the back-room deal. (Most government policy-making involves public comments, review periods, and pushing a button to upload the final policy to the Internet. There is absolutely nothing more open-source about this than the usual agency rule-making process.)
By opening up the revisions (there is no policy-making in our government that doesn’t involve posting revisions) and the discussions behind them (as I mentioned, there was no discussion on github), the White House is making its thinking clear, and there’s an added bonus: The changes are easier to read and understand. Compare Van Dyck’s revisions here, to Rep. Lou Barletta’s proposed changes to existing law in his Emergency Unemployment Compensation Extension Act of 2014. In the GitHub document, you can see the old text struck-through in red and the new additions in green. Congressional bills like Barletta’s, on the other hand, read like uncompiled source code, detailing all the changes to be made but giving the reader no idea what the finished product will look like.(That’s not what uncompiled source code looks like. And ‘compiled’ source code certainly looks no better.)
That makes some bills unreadable, as far as the average citizen is concerned. (This isn’t an apples-to-apples comparison. Modifying 200-year-old statutory law is going to be harder for the “average citizen” to read than modifying a memo written last year.) “The thing that is actually voted on is the edits,” says Ben Balter, GitHub’s government evangelist. He has been working with the feds for years, convincing them to use more open-source software and adopt more of an open-source attitude. “The open government community has been talking about doing stuff like this, but it’s never reached fruition because there weren’t enough stakeholders in government.”
That’s begun to change, Balter says. He says he’s spending more time explaining to federal employees how they can use open source tools and methods. Two years ago, he was still convincing them to give open-source a shot. Now he’s watching the White House merge pull requests.
June 1st, 2014
The Department of Education has an RFI due tomorrow on the Use of APIs in Higher Education Data and Student Aid Processes. I submitted the following response:
The RFI asks how APIs for higher education data and programs can achieve policy goals of the Administration. As an expert on open government data, I am submitting this comment to address when APIs are useful.
Modern methods of information dissemination and service delivery recognize the long-standing role of mediators in facilitating citizen-government transactions. The media, educational institutions, and many others have long played a crucial role in helping citizens make use of information about higher education produced by the government and enroll in government services. The function of electronic standards for information dissemination and service delivery is to make mediation more efficient and therefore able to reach a wider audience. These new methods are a force multiplier for policy objectives.
Do Open Data First
An API is one of the two modern methods of information dissemination and service delivery specifically sought after by the Administration. Besides building APIs, creating open data — also called bulk, raw, and structured data — is also now an Administration goal as outlined in the White House’s Memorandum on Open Data (M-13-13).
It is important to understand when open data or an API is the right technology for a particular database or service.
Open data, when possible, is always both less costly to implement and more powerful than a “read API”. Here is a summary of why:
* Open data is static but APIs are dynamic. That means that APIs require long-term maintenance to ensure that the API remains continuously and indefinitely available. Open data is released once and updated periodically as needed.
* Open data provides a complete database but APIs provide only a small window into the data. That means that while open data can be used to build any application, an API can only be used to build applications that require a small amount of the data at a time.
* A *good* API requires that the agency do everything that good open data requires plus much more, including the creation of RESTful services, building around use cases, and creating “client libraries”.
A “read API” must do everything that open data does, plus much more. Therefore agencies should walk before they run. Build good open data first, validate that it meets the needs of users, learn how to do that well, and only after validation and learning invest in building an API to address additional use cases.
Open data should always be available before a “read API” is considered. The few cases were open data is not possible for information dissemination (when data changes in real time, like in the stock market, or the data is extremely large) are not likely to apply to data about higher education.
For an example of open data, the Census Bureau and the National Weather Service have been providing open data since the mid 1990s. The practices of open data have a 25-year history.
I advise against the implementation of any read APIs for a dataset before open data for that dataset is available and validated with stakeholders.
Not all open data is created equal. Well-designed open data will prove to be most useful for mediators — and thus the consumers. For more information on open data, please see:
* My book, Open Government Data: The Book, at http://opengovdata.io/
* Best practices for open data licensing, at http://theunitedstates.io/licensing/
When to build APIs
That said, the above advice applies only to information dissemination. Read/write APIs are an excellent strategy for the enrollment in or participation in government services. In a read/write API, unlike a read-only API, the external user is submitting information — such as form values — in a transactional process. A read/write API decouples the customer’s experience from the business logic so that mediators can create new experiences but still be in compliant with the agency’s business logic.
Just as with information dissemination, mediators can be valuable during transactions. Different audiences might respond best to different ways in which the transaction occurs (off-line, on an iPad, in large print, in plain language, or using jargon when targeting domain experts, etc.). Using a read/write API, mediators can create new and tailored methods of performing the same transaction and best reach audiences that the agency alone could not best reach.
Since transactions are by their nature dynamic, open data would not meet this need.
Not all APIs are created equal. Exceptional APIs lower the barrier to entry and the ongoing costs for mediators. Poorly designed APIs could result in helping no one.
A well-designed API provides granular access, deep filtering, typed values, normalized tables, RESTful interfaces, multiple output formats, useful validation messages, use-case or intent-oriented URLs, documentation, client libraries, versioning, fast results, high uptime, easy on-boarding, interactive documentation, and a developer community hub. The best APIs are used by the agencies themselves inside their own products and services.
For more information on what makes a good API, please see my blog post “What makes a good API?” at http://razor.occams.info/blog/2014/02/10/what-makes-a-good-api/.
I am the founder of GovTrack.us, a legislative transparency website that has been at the forefront of open government data since 2004, and I am the author of Open Government Data: The Book (opengovdata.io). I formerly consulted for the Department of Health and Human Services on HealthData.gov and currently consult for the Council of the District of Columbia. I can be reached at email@example.com.
April 29th, 2014
Here’s a little data exploration. Although there are far more small-money contributions to congressional candidates than large-money contributions, the large-money contributions more than make up for the difference in aggregate totals.
I’ve binned all reported individual contributions (types 15 and 15E) to congressional candidates (‘H’ and ‘S’ committees) in the 2011-2012 election cycle so that the bucket means are close to most common contribution amounts ($250, $500, $1000, and $2500), using FEC data. Amendments to filings were ignored. (Source Code)
Nothing new or exciting here. Just useful to have numbers.
Not included here are contributions to non-candidate PACs (issue PACs, leadership PACs, etc. , except when those contributions were earmarked) and unreported contributions (<$200), but we can extrapolate from this data that while there may be a lot of unreported contributions they likely don’t add up to much.
|up to $400
|$1,400 or more
(The bins overlap because I’ve apportioned some of those contributions to both sides in order to make the mean contribution of each bin close to a nice number.)
April 15th, 2014
Our family has gone through many haggadahs over the years at our Passover seders as my grandparents and relatives have tried to modernize (if not also shorten) our Passover experience. This year a reading conspicuously inserted into our haggadahs by my grandmother had relevance not only for the Passover story but also for civic hacking.
The traditional four children of the seder are a rhetorical device by which we remember the purpose of the Passover seder. The “wicked” child is said to ask, “Why do you do this seder?” The response, we read, is that God freed us from bondage: us, not you. The implication is that with that attitude, God would have left the wicked child behind.
Last night we had four new children on a printout from this page by the American Jewish World Service. If you’re not familiar with Passover, you’ll need to know that the story of Passover is of how God righted the grave injustice of the slavery of the Jews in Egypt, lead the Jews to Israel, and punished the Egyptians with plagues for committing the injustice. Here were the new four children:
The Activist Child: “The Torah tells me, ‘Justice, justice you shall pursue,’ but how can I pursue justice?”
Empower him always to seek pathways to advocate for the vulnerable. As Proverbs teaches, “Speak up for the mute, for
the rights of the unfortunate. Speak up, judge righteously, champion the poor and the needy.”
The Skeptical Child: “How can I solve problems of such enormity?”
Encourage her by explaining that she need not solve the problems, she must only do what she is capable of doing. As we read in Pirke Avot, “It is not your responsibility to complete the work, but neither are you free to desist from it.”
The Indifferent Child: “It’s not my responsibility.”
Persuade him that responsibility cannot be shirked. As Abraham Joshua Heschel writes, “The opposite of good is not
evil, the opposite of good is indifference. In a free society where terrible wrongs exist, some are guilty, but all are responsible.”
For the Uninformed Child who does not know how to ask a question:
Prompt her to see herself as an inheritor of our people’s legacy. As it says in Deuteronomy, “You must befriend the stranger, for you were strangers in the land of Egypt.”
These are the questions I hear often about the usefulness of hackathons and the purpose of meetups like Code for DC, and this provides some useful answers. I will be thinking about how to incorporate these thoughts into my own civic hacking.