Why the return-on-investment of open data is the wrong question

February 20th, 2015
Posted in Civic Hacking | Comments Off

Based on brief remarks I gave at Open Data Day DC 2015.

Open data is a set of practices. It is a community around those practices. And it is a set of values that we bring to problems that we’re tasked to solve.

Open data is a lot like voting. On election day, voting is a messy process, and not everyone wants to do it. It’s expensive to buy and maintain all of those voting machines, to take the day off from work, to do recounts when something goes wrong. It’s confusing. There are a lot of local positions I’m not familiar with and I need the help of experts to effectively participate.

But if someone walked up to you at 10pm election night and asked you to demonstrate the return on investment of all of the day’s efforts, I think you’d say that that’s not the right question. You have to look back, first, at the history of how we got to vote. And then you have to be patient and look forward for change and evolution in government and the new policies that might be enacted years if not decades later, to know whether the vote was “successful.”

So it goes for open data. We should invest in learning and perfecting the methods of open data — how you publish it, get it, analyze it, and so on — and about the values of open data. But always keep in mind that these skills and ideas are in the service of the problems that brought us to use open data in the first place: government corruption, consumer choice in the marketplace, more effectively telling a story, widening access to justice, and so on. Those are big problems, and when we bring open data to the table we must remember to evaluate it in the context of playing the long game for specific social change or other goals.

Campaign finance reading list

January 9th, 2015
Posted in Civic Hacking | Comments Off

Thomas Stratmann. 2005. Some talk: Money in politics. A (partial) review of the literature. In Public Choice, volume 124.
Decades of academic research, and copious amounts of data, has failed to find any widespread influence of campaign contributions on the outcomes of roll call votes.

Joshua L. Kalla and David E. Broockman. Forthcoming. Campaign Contributions Facilitate Access to Congressional Officials: A Randomized Field Experiment.
A field experiment showed that campaign contributors get greater access to policymakers. ”[The first randomized field experiment on the effects of campaign contributions on access to policymakers. In the experiment, a political organization attempted to schedule meetings between 191 Congressional offices and active campaign donors in their districts. .  . . When informed prospective attendees were political donors, senior policymakers made themselves available between three and four times more often.”

Caitlin Macneal. June 9, 2014. GOP Rep. Acknowledges That Members Expect Donations For Votes. In Talking Points Memo Limewire.
It’s an open secret that large donors make tactical contributions. Macneal reports on an open admission of how this works. “McAllister told the crowd that an unnamed colleague told him on the House floor that if he voted ‘no’ on the bill, he would receive a contribution from Heritage, a conservative think tank. ‘I played dumb and asked him, “How would you vote?” ‘ McAllister said. ‘He told me, “Vote no and you will get a $1,200 check from the Heritage [Action]. If you vote yes, you will get a $1,000 check from some environmental impact group.” ‘ ”

Lee Jared Drutman. 2010. The Business of America is Lobbying: The Expansion of Corporate Political Activity and the Future of American Pluralism. Doctoral dissertation, U.C. Berkeley.
In a survey of lobbyists by Lee Drutman, the importance of fundraiser events was ranked near the bottom among 21 lobbying tactics. Drutman also reported that of businesses with a lobbying presence in Washington, D.C., just 24% maintain a PAC, the sort of organization they would need to make campaign contributions.22 (Of course, as Drutman pointed out, the sensitivity of admitting that fundraisers are a component of lobbying may have reduced their apparent importance.) (pages 11, 39)

Damon M. Cann. 2009. Sharing the Wealth: Member Contributions and the Exchange Theory of Party Influence in the U.S. House of Representatives.
Cann performed a thorough analysis of how transfers of money between congressional campaigns influenced committee chair assignments. Cann compared seniority, party unity, contributions to other candidates’ campaigns and other factors against who won and who lost of those House members seeking chair positions. On the bright side, it hasn’t always been about money. In the 104th Congress, the Speaker (Newt Gingrich) relied primarily on committee seniority when choosing his new set of committee chairs, following long-standing precedent. Chair selection in the 105th and 106th Congresses (under Gingrich and then Dennis Hastert) began to be influenced by campaign contributions to the party. An extra $30,000 could catapult the second senior Republican member into the chair. By 2001 and the 107th Congress, the seniority system had been abandoned. By the numbers, Hastert’s chair assignments from the 107th to the 109th Congress could be explained almost entirely by who had given the most to Hastert’s party and whether they had in the past voted in unity with the party. A similar but slightly less certain picture unfolded for the selection of the chairs of the Appropriations subcommittees.

Lynn Vavreck. Oct. 7, 2014. A Campaign Dollar’s Power Is More Valuable to a Challenger. In The New York Times / Upshot.
The value of a dollar spent may be worth more to challengers than to incumbents. “[T]o earn one additional vote, the incumbent member of Congress had to spend roughly $200, while the mayoral challenger had to spend only $30 . . . Caps on money probably hurt challengers in both parties more than they hurt either individual party. A large amount of money in campaigns, often deplored, may actually hurt incumbents by helping challengers compete effectively. 

Eleanor Neff Powell and Justin Grimmer. 2014. Money in Exile: Campaign Contributions and Committee Access.
Some contributions are shown to be tied to whether a member of Congress holds a particular committee position. That is, some contributors are trying to shape the make-up of committees. “[W]e exploit committee exile—the involuntary removal of committee members after a party loses a sizable number of seats . . . We use exile to show that . . . [i]ndustries overseen by the committee decrease contributions to exiled legislators, and instead direct their contributions to new committee members from the opposite party.”

Phil Mattingly. August 28, 2014. The Super PAC Workaround: How Candidates Quietly, Legally Communicate. In Bloomberg Businessweek.
Candidates cannot coordinate their expenditures with other PACs that support them. This article shows how candidates are skirting the rules to communicate with Super PACs.

Ray La Raja. January 7, 2015. Campaign finance laws that make small donations public may lead to fewer people contributing and to smaller donations. In the London School of Economics and Political Science blog.
Donors, at least small donors, are reluctant to divulge personal information and put their contribution in the public record. Disclosure of personal information can decrease small money donations by half and can lead to donors making smaller donations to stay beneath reporting requirements.

Author anonymous. February 5, 2015. Confessions of a congressman: 9 secrets from the inside. In Vox.
“Campaigns are so expensive that the average member needs a million-dollar war chest every two years and spends 50 percent to 75 percent of their term in office raising money. Think about that. You’re paying us to do a job and we’re spending that time you’re paying us asking rich people and corporations to give us money so we can run ads convincing you to keep paying us to do this job.” “If a member of Congress doesn’t vote with his or her party 99 percent of the time, he’s considered unreliable and excluded from party decision-making.”

My 13-year campaign for legislative data finally comes to a successful end

December 19th, 2014
Posted in Civic Hacking | Comments Off

Yesterday at a small meeting the Senate announced that it would be making its legislative data available to the public. This has been a long time coming.

The what & why

No legislative branch agency makes available a spreadsheet that lists every bill introduced in Congress. This issue is that simple. We’re finally going to get a list of bills in a useful data format, and, hopefully, a lot more information on top of it, some time next year.

I first asked the Library of Congress for access to its database of legislation in 2001 when I began building GovTrack. They said no, under orders from the House and Senate, and so I began “screen scraping”, or reverse engineering, their public THOMAS.gov website for the same information and making that data freely available to others. The data is what you need to create large-scale visualization, analysis, and tools, such as the ideology and leadership scores, bill prognosis, email updates, legislator report cards, bill text paragraph permalinks, maps of congressional districts, advanced search, and much much more that I built on GovTrack.

And my data on GovTrack, rather than anything Congress produces, quickly became the authoritative source for legislative information. Endless apps have been build on top of the data I made available. Even Congress comes to me for data. Representatives embed the maps on GovTrack on their websites and ask me, from time to time, for their own voting statistics. The House Democrats use GovTrack’s data to keep their caucus informed, and many Senate offices load GovTrack data into their back-office systems.

The data is now collected in a community project on GitHub (which began in 2012 and was spearheaded by Eric Mill at the Sunlight Foundation, Derek Willis, and myself), but the right place for this data is Congress. I never wanted to be the linchpin of congressional information (except in so far as it provided me with a career, so… thank you Congress). Once the Senate begins actually making its data available, planned for next year some time, I hope to see Congress become the authoritative source for its own information.

The history

Advocacy around legislative data began in 2007. At the request of Speaker Nancy Pelosi, who was looking for ways to reform the House, a group of government transparency advocates issued the The Open House Project report, co-written by myself and others and spearheaded by the new Sunlight Foundation. The report called for the House to make available the legislative data I had been asking for, among several other transparency recommendations. It was just seven years ago that “data” was something totally new to Congress. I surveyed the state of legislative data in 2008 – there was not much. At this time the Senate had not yet even started publishing its voting records in data (as XML). Following the report, many of us worked with Senate staff to explain why making vote data available to the public was a good thing, and only in 2009 did they start making that available (see also 2007). In 2009 we also secured favorable language in the FY 2009 omnibus appropriations bill (see also 2008), but Congress’s support agencies largely ignored the directive to make data available.

John Wonderlich at the Sunlight Foundation, who had started The Open House Project, kept the advocacy going over the next several years. But the House, under Pelosi, was not very responsive to requests for more transparency during this time. Some headway was made, but not in legislative data.

The Republican take-over of the House in 2011 marked a major shift toward transparency. They began making much more data available and promised data about bills. When one representative strangely tried to put the kibosh on data in 2012, The Washington Post ran a story about it (and about me, which was flattering), which lit the fire under House leadership and lead to the formation of the House Bulk Data Task Force. Advocates formed a new Congressional Data Coalition in 2014, spearheaded by Daniel Schuman at CREW, and we secured favorable language in the FY2015 legislative branch appropriations bill to keep the pressure on. The House task force during this time made some progress, but without the cooperation from the Senate it wasn’t able to actually do much.

That’s what changed yesterday: the Senate is on board. This closes out what has been, for me, a 13-year campaign.

Daniel wrote more about the news here.

DC updates its open data terms of use: Round 2

October 29th, 2014
Posted in Civic Hacking | Comments Off

Over the last few months DC has worked with the open data community to revise its outdated terms of use agreement. Here’s where we stand today, after DC’s second revision posted earlier today.

Background: Do I need a lawyer to hack?

Back in September I asked Do I need a lawyer to hack in DC? on the Code for DC blog. I had discovered that in exchange for access to the District’s data, civic hackers (including myself) were agreeing to very odd terms including not taking any legal action against the District. Imagine if the data reveals actual injustice. We’d have given up the right to use the legal system to make things right! See the Code for DC post for more on why I think these terms were bad policy, but in short: data isn’t “open” if it can only be used on capricious terms. Open government data must be license-free.

What’s been revised since then

The District’s Office of the Chief Technology Officer (OCTO) immediately engaged with me, Code for DC, and others in the open government community to fix these problems. To their credit, several OCTO staff members spent several hours talking through these issues with me on multiple occasions. They have really been putting in the effort to get this all right.

Little more than a week after my blog post, DC posted its first update to the terms, which Alex Howard covered here. That update removed two of the clauses that I noted were problematic:

  • the agreement not to take legal action against the District
  • the indemnification clause

The removal of those two clauses were major improvements. But the rest of the updated terms, in the parts I cared about, were incoherent. They had intended to retain a requirement to attribute the District in all uses of District data, they explained to me, but the legal language they used to say it made no sense.

In a new update to the terms posted today, which followed additional conversations with OCTO, there were two more great improvements. These terms were finally dropped:

  • agreeing to follow all “rules”, a very ambiguous term
  • the requirement to attribute the data to the District in all uses of the data (it’s now merely a suggestion)

The removal of these two requirements, in combination with the two removed in September, makes this a very important step forward.

One of my original concerns remains, however, and that is that the District has not granted anyone a copyright license to use District datasets. Data per se isn’t protected by copyright law, but the way a dataset is presented may be. The District has claimed copyright over its things before, and it remains risky to use District datasets without a copyright license. Both the September update and today’s update attempted to address this concern but each created more confusion that there was before.

Although today’s update mentions the CC0 public domain dedication, which would be the correct way to make the District data available, it also explicitly says that the District retains copyright:

  • The terms say, at the top, that they “apply only to . . . non-copyrightable information.” The whole point is that we need a license to use the aspects of the datasets that are copyrighted by the District.
  • Later on, the terms read: “Any copyrighted or trademarked content included on these Sites retains that copyright or trademark protection.” Again, this says that the District retains copyright.
  • And: “You must secure permission for reuse of copyrighted … content,” which, as written (but probably not intended), seems to say that to the extent the District datasets are copyrighted, data users must seek permission to use it first. (Among other problems, like side-stepping “fair use” in copyright law.)

With respect to the copyright question, the new terms document is a step backward because it may confuse data users into thinking the datasets have been dedicated to the public domain when in fact they haven’t been.

A notice of opposition to the Legal Hackers trademark application

October 15th, 2014
Posted in Civic Hacking | Comments Off

Today Eric Mill and I filed a notice of opposition (text pasted below) against Philip WeissLegal Hackers LLC’s “Legal Hackers” trademark application. We wanted to take a moment to put our filing in context. PhilipThe folks behind Legal Hackers LLC helped to found the legal hackers community, and this is something Eric and I are both thankful for. We’ve been participants in DC Legal Hackers events and truly appreciate the work that Philipthey and our chapter leaders Jameson Dempsey, Rebecca Williams, and Alan deLevie put into building the community.

Our filing with the USPTO shouldn’t be taken as criticism of what Philipthe folks behind Legal Hackers LLC are trying to do. Strengthening the community by creating branding and setting norms are natural and important next steps. We don’t oppose a trademark generally, but we do think this particular trademark that Philipthey applied for, “Legal Hackers,” is not subject to trademark protections under current law, and that this is for good reason.

Trademark law, by our understanding, does not permit the trademarking of a phrase that simply has the meaning of the services being offered. “Apple” would not be a permitted trademark for a company that sells apples (although it is permitted for a company in an unrelated industry). These are called “merely descriptive” or “generic terms.” We believe “legal hackers” is a generic term. That is, it means what it says: people who use computers to address legal issues. Further, this is an essential feature of the services proposed to be provided under the trademark. Philip wasThose folks were possibly the first to put the two words together in that way, but that in and of itself is not sufficient to confer legal ownership over the term.

We also think that this is good policy. Eric and I are legal hackers. We have been legal hackers since before “legal hackers” was first spoken. And though we certainly don’t imagine a future of Philip filing trademark infringement lawsuits against us, we think the trademark would be stifling.

Please see the full filing below (and also linked above) for a more complete explanation of our thinking. (Also see John Grant’s earlier blog post about the trademark for more background.)

——

October 15, 2014

Notice of opposition to trademark application (serial number: 86-247,678; mark: Legal Hackers; published in TMOG: Sep 16, 2014)

Introduction

(1) On April 9, 2014, Philip Weiss on behalf of Legal Hackers LLC (legalhackers.org), a limited liability company formed in New York, filed an application to trademark “Legal Hackers” for services related to “law and technology,” including a portal, website, videos, community events, and competitive events. The application asserts first use of the term in 2012. This is a notice of opposition to the trademark application.

About the petitioners

(2) Mr. Tauberer is an entrepreneur who has worked in the field of law and technology for nearly fifteen years. He has launched two successful companies, Civic Impulse LLC [footnote: The company, formed in the District of Columbia, produces the website www.GovTrack.us.] and POPVOX Inc. [footnote: Mr. Tauberer is no longer actively involved in this company.], which gather information on bills before the U.S. Congress, statutes, and other legal information and use modern technologies, including creating websites, to make the information more accessible to the public at large. He also performs consulting services related to law and technology for clients that include the Council of the District of Columbia and the Office of the Law Revision Counsel in the U.S. House of Representatives. He has attended several meetings of the DC Legal Hackers, a chapter of the movement that Mr. Weiss helped to found. He holds a Ph.D. in linguistics from the University of Pennsylvania.

(3) Mr. Mill is a software developer who also works in the field of law and technology. Mr. Mill has built numerous applications, search engines, and data services related to this field, especially in the area of congressional and regulatory data. Mr. Mill also attends meetings of the DC Legal Hackers. Mr. Mill and Mr. Tauberer organize a yearly community event related to law and technology.

Our claim

(4) Our opposition rests on our belief that “Legal Hackers” is merely descriptive of the services offered by the applicant (TMEP 1209.01(b)) and, further, that the mark is a generic term that fails the test in TMEP 1209.01(c)(i).

Standing

(5) We stand to be harmed by the trademark of “Legal Hackers.” We believe that the relevant public may understand “Legal Hackers” to primarily refer to the genus of services we have been offering throughout our careers (including well before Mr. Weiss began to use the term) and that the trademark of “Legal Hackers” would impinge our ability to describe the services we offer to potential clients and employers.

Background

(6) “Hacker” is a homograph. A homograph is a word that is spelled the same but has multiple unrelated meanings. For example, “mouse” is a homograph with two meanings: a computer peripheral and a rodent. Hack is the same way, but with even more distinct meanings. One of the oldest uses of the word “hack” is in the phrase “legal hack,” in the sense of a lawyer for hire, which dates at least to the 19th Century. To “hack” is to cough, or to chop. A “hacker” may be a cybercriminal. The Oxford English Dictionary lists at least 10 unrelated meanings of “hack”, “hacker”, and “hacking.” The definitions most relevant to this opposition include:

(7) “hacker. n. A person with an enthusiasm for programming or using computers as an end in itself”

(8) “hacking. n. The use of a computer for the satisfaction it gives; the activity of a hacker.”

(9) Note that these definitions are not the definitions related to computer crime. Although the definitions above and the definitions related to crime both involve computers, the definitions are distinct. In fact, historically, “hacker” was first used in the sense quoted above and was only later used to mean a criminal. It was the meaning of “hacking” in (8) above that the U.S. House Majority Leader invoked when he hosted the “Congressional Facebook Developer Hackathon” in 2011 in the U.S. Capitol.

(10) There is another meaning of hack which is to pervert something’s original purpose to creatively solve a problem, often through technology. Rube Goldberg machines are hacks. Putting folded-up newspapers under table legs to stop a table from shaking is a hack. This form of the word is often combined with other words to restrict the meaning to a particular subject area. There is a website called “IKEA Hackers” devoted to the repurposing of IKEA products. They suggest turning a pillow into a small child’s costume. The Home Depot used the Twitter hashtag “#HDHacks” in marketing messages to promote do-it-yourself projects. The petitioners, Mr. Tauberer and Mr. Mill, are widely known as “civic hackers”: we use technology to solve problems in the civic space (e.g. government).

(11) Among individuals in the software industry, the definitions of hack quoted and described above are near-universally known and extremely commonly used.

Merely descriptive

(12) Mr. Weiss’s application is to trademark “Legal Hackers” for a variety of services related to “law and technology.” “Legal Hackers” is merely descriptive of these services.

(13) A mark is considered merely descriptive if it describes or immediately conveys knowledge of a feature of the specified goods or services.  For the population interested in services related to law and technology, many will be aware of the definition of “hacker” quoted above, specifically that it relates to an enthusiasm for computers. Many will also understand the word “legal” to mean that which is related to the law. By combining these two words, the mark “Legal Hackers” describes and immediately conveys the feature of “law and computers,” which is a predominant component of the services described in the application (“law and technology”). (To the extent “computers” and “technology” are distinct, we note that a merely descriptive term need not describe all of the services and may describe a set of services of which the application’s services are a subset. (TMEP 1209.01(b)).)

(14) In fact, the official Twitter account for Legal Hackers NYC, an organization run (in part) by Mr. Weiss, describes the organization as “us[ing] technology and the creativity of the hacker ethos to address legal deficiencies.” The “hacker ethos” refers to the definition in paragraph (10) above, that is, a spirit of creatively solving problems. The mark “Legal Hackers” merely conjoins the two most important words from the organization’s own description. A term could hardly be more immediately descriptive than being an extract from the words used to actually describe the services.

Generic

(15) The mark is, further, a generic term for the services offered. We claim that the mark is generic, and not only merely descriptive, because the term not only invokes features of the services but is demonstrably understood to have that meaning by the intended population. For the population interested in services related to law and technology, many are aware of the definition of “hacker” quoted above, specifically that it relates to an enthusiasm for computers. They are further aware that this term is often, in common usage, combined with a second word to restrict its meaning. (See “IKEA Hackers” and “civic hackers” above.) A “legal hacker” is thus a natural phrase that would be understood to mean someone who creatively solves legal problems using technology. This is the genus of the services described in the application.

(16) Although Mr. Weiss may claim the first use of the mark, that he used it first is not relevant to whether the intended public would understand the mark to have this meaning. (It is not necessary to show that the relevant public uses the term to refer to the genus. The correct inquiry is whether the relevant public would understand the term to be generic. TMEP 1209.01(c)(i)) If “hackers” can be combined with “IKEA” and “civic,” it can also be combined with “legal” and be understood as related to “law and technology.”

(17) This can be seen in the use of the mark by individuals unaffiliated with the application to refer to individuals working on law and technology. ABA Journal, the journal of the American Bar Association, referred to four lawyer and policy professionals as “legal hackers” in an August 2014 article. The article additionally used the phrase “legal hacking” to refer to that which the four legal hackers do:

“The panelists agreed that the future for legal hacking was bright, and they were optimistic about the possibilities for technology to increase access to justice.”

To be sure, the panelists were not saying that the future of Legal Hackers LLC is bright but that the future of combining law and technology is bright.

(18) “Hackcess to Justice 2014,” a recent community event unaffiliated with Mr. Weiss or his company, is a play on the words “hack” and “access.” The event described itself as for those interested in “improving access to justice through technology.”

(19) That “legal hackers” has been used naturally in every-day speech by individuals unaffiliated with the application to refer to law and technology demonstrates that it is a generic term. That those individuals have modified “legal hackers” according to the rules of English, by adding “-ing”, or with a poetic license, as in “Hackcess,” demonstrates that the term has a readily understood meaning relating to “law and technology” and independent of the services provided by Legal Hackers LLC.

Summary

(20) Our opposition is based on our belief that “Legal Hackers” is merely descriptive, and, further, a generic term, for the services offered by the applicant, based on the common usage of the terms “legal” and “hacker” among those interested in technology services, and the common usage of the compound phrase “WORD + hacker” to restrict the meaning to a subject area.

(21) We would not oppose an application in which “Legal Hackers” is combined with another word, such as “Society,” “Alliance,” or “NYC,” which would remove the merely-descriptive and generic nature of the term.

Joshua Tauberer, Ph.D.

Eric Mill

How bad is IMAP IDLE?

August 9th, 2014
Posted in Code | Comments Off

Over on Mail-in-a-Box issue #129@llchen223 noted that I hadn’t changed Dovecot‘s default setting of imap_idle_notify_interval. This setting controls the duration of IMAP IDLE pauses during which the mail client waits patiently for a new mail notification.

Here’s how it looks with K-9 mail (client) talking to Dovecot (server):

   CLIENT> 9 IDLE
   SERVER> + idling
   SERVER> * OK Still here
   . . . server waits imap_idle_notify_interval minutes . . .
   SERVER> * OK Still here
   . . . server waits imap_idle_notify_interval minutes . . .
   SERVER> * OK Still here

The dance ends when the server reports something other than OK, such as the presence of new mail, or if the client decides to go back into normal IMAP command mode. Multiple of these may be actually happening simultaneously in different connections if the mail client is monitoring for new mail in more than one folder.

This is pretty efficient. The * OK Still here message is just 444 bytes (over SSL!).

But the concern is that with the default 2 minute delay, that’s 720 × the number of monitored folders possible times a day that a phone has to do something. Does waking the phone’s networking capabilities drain the battery?  If multiple connections are open to monitor multiple folders Dovecot seems to smartly group the OKs together so that the phone is woken up just once — so at least there’s that.

None of this appears to have actually been a problem for my phone, and @llchen223 reports that having K-9 monitoring a mailbox has negligible impact on battery usage on his phone.

Peter Kieser suggested (three years ago) increasing imap_idle_notify_interval to be so long that the client checks in first (h/t @jkaberg). The IMAP IDLE standard says the server can hang up after 29 minutes if it hasn’t heard from the client, and so K-9 checks in after at most 24 minutes from the start of the IDLE by ending the IDLE (with DONE) and starting a new one (IDLE again). If imap_idle_notify_interval is more than 24 minutes, * OK Still here will never occur (I think). 

This sounds great, but the longer the connection remains totally silent the higher the risk that some intermediate point on the connection will decide the connection is broken and reset it. In recording these sessions, I ran into socat’s timeout of 15 minutes.

Not all mail clients are as efficient as K-9. Mozilla Thunderbird restarts the IDLE after each server OK:

   CLIENT> 7 IDLE
   SERVER> + idling
   . . . both sides wait imap_idle_notify_interval minutes . . .
   SERVER> * OK Still here
   CLIENT> DONE
   SERVER> 7 OK Idle completed.
   CLIENT> 8 noop
   SERVER> 8 OK NOOP completed.

Immediately after the NOOP at the end the client issues a new IDLE command and the pattern repeats. It only looks like 93 bytes, but inside an SSL connection it takes 1,292 bytes. (That’s both sides of the connection.) With the default Dovecot setting of a 2 minute delay, that’s a little less than 1 MB per day (× the number of monitored folders).

The default settings of Dovecot and K-9 seem to be just fine both in terms of bandwidth and battery usage, and there’s no indication that increasing the interval will benefit phones running K-9. Mozilla Thunderbird is not as good at minimizing bandwidth, but I don’t expect many people are running Mozilla Thunderbird on a mobile broadband connection. Maybe an increase of imap_idle_notify_interval from 2 to 4 minutes would be prudent.

I recorded the session with socat by having it act as an OpenSSL server (on a new port, here 992) to terminate the encryption, log the unencrypted IMAP stream, and proxy the IMAP traffic to a new SSL connection to the Dovecot IMAP server (port 993):

    socat -x \
        OPENSSL-LISTEN:992,fork,reuseaddr,bind=0.0.0.0,cert=ssl_certificate.pem,key=ssl_priv_key.pem,verify=0,rcvtimeo=9999999 \
        OPENSSL:localhost:993,verify=0,rcvtimeo=9999999

This outputs the IMAP stream in hex, which happens to be a little cleaner than outputting the ASCII stream. socat’s default socket timeout was about 15 minutes, so I’ve also extended it to be able to handle the 24-minute IDLE length.

To measure the size of the encrypted traffic (including link-level headers and so on), I used tcpdump to monitor port 992:

    tcpdump -q -nn -e -ttttt -U "port 992"

I’m using Dovecot 2.2.9, K-9 4.804, and Mozilla Thunderbird 31.0. The exact sizes of the encrypted IDLE-related messages probably depend on which protocol and ciphers happen to be selected for the connection, so socat will be affecting those measurements.

(While recording these sessions, I noticed that K-9 would also check the Drafts folder every 90 seconds if no Drafts folder exists. As soon as the first draft was saved, causing K-9 to create the folder, this poll stopped. So I’ll have to revise Mail-in-a-Box to create the Drafts folder by default.)

DC’s open data directive adopts the mistakes made by the White House

July 21st, 2014
Posted in Civic Hacking | Comments Off

Earlier today DC’s mayor issued a Transparency, Open Government and Open Data Directive (readable thanks to Alex Howard here). Much of it was adapted from the White House’s open government memoranda, including those memoranda’s faults.

Overview

There are many things to like about the Directive, including the mention of a potential new Chief Data Officer position, the use of open formats, and the goal of promoting reuse. The framing in terms of transparency, participation, and collaboration — lifted from Obama’s 2009 open government memo and adopted in the Mayor’s 2011 memorandum on transparency and open government – is good. (Though not great. The White House never managed to actually execute the collaboration part.)

But much of it is also undercut by a new notion of conditional access to government data that is becoming the norm.

Having their cake and eating it too

What I mean is that while the directive explicitly and clearly states that there will be

no restrictions on copying, publishing, further distributing, modifying or using the data [in DC's data catalog]

it simultaneously explicitly describes a number of restrictions that there will or may be on use of the data. (It’s clear DC copied language from the White House’s 2013 open data memo (“M-13-13″), which I’ve blogged about before here and here, including their mistakes.)

“No restrictions” is what we want. It is, by community consensus, a core and defining quality of open government data.

If there are capricious rules around the reuse of it, it’s not open government data. Period. Restrictions serve only to create a legal lever by which the government can put pressure on things they don’t like. Imagine if the DC government took legal action against Greater Greater Washington to stop an unflattering story on the basis that GGW didn’t properly cite the DC government for the data used in a story. This is what the future of open data in DC looks like when there are restrictions on reuse.

Okay so specifically:

“Open license” does not mean “no restrictions”

So first it says that the data catalog will accomplish this goal of “no restrictions” by making the data available through an “open license.” The usual meaning of open license does not mean “no restrictions,” however. Most open licenses, including open source licenses and Creative Commons licenses, only grant some privileges but not others. Often privileges come along with new requirements, such as GPL’s virality clause, or the restriction that users must attribute the work to the author. Under the Open Definition, “open” means reusable but potentially subject to certain terms.

In guidance I co-wrote with Eric Mill, Jonathan Gray, and others called Best-Practices Language for Making Data “License-Free”, we addressed what governments should do if they really want to create “no restrictions.” They should use CC0, a copyright waiver. This is really the only way to achieve “no restrictions.”

(This was one of the confusions in M-13-13 as well. It’s clear the directive took the open licensing language from M-13-13.)

“Open license” presumes the work is copyrighted

Facts cannot be copyrighted. To the extent that DC’s data catalog contains facts about the District, about government operations, and so on, the data files in the catalog are likely not subject to copyright protections. (What is and isn’t copyrightable is murky.) Open licensing, as normally understood, presumes the work is copyrighted. If the work isn’t copyrighted, an open license simply doesn’t apply. You can’t license what you don’t own.

(This was another one of the confusions in M-13-13. But unlike the federal government, the DC government probably can copyright things it produces. But probably not data files.)

Data users must agree to a contract first

The data “shall be subject to Terms of Use developed by OCTO.” This means that DC residents will have to agree to a contract before getting the data. What will the contract say? More on that later. This is, by its nature, a restriction on use.

Imagine if data provided in response to a Freedom of Information Act request came with a contract. They’ll fulfill the FOIA request but only if — let’s say hypothetically — you agree to not sue the government using the information you get. Well, duh, that defeats the point. Just as a Terms of Use agreement undermines “no restrictions.”

The directive indicates that the Terms of Use will include a “disclaimer of liability or indemnification provision”. These are complex legal provisions that could involve waiving rights or compensating the DC government if there is a lawsuit. These are serious things to consider before using government data.

(This was not a problem in M-13-13. The License-Free Best Practices did address this though.)

Attribution and explanation requirements

The directive also gives us a clue about what else will be in the Terms of Service:

Nothing in this Order shall be deemed to prohibit OCTO or any agency … from adopting or implementing measures necessary or appropriate to . . . (v) require a third party providing the District’s public data (or applications based on public data) to the public to explicitly identify the source and version of the public dataset, and describe any modifications made to the public dataset.

This is an attribution requirement, plus a requirement for data users to explain themselves.

To be sure, and as Alex Howard called me out on on Twitter, these are hypotheticals that the directive leaves open and not something the directive is mandating. But the fact that these are mentioned strongly suggests that OCTO or other agencies want to enforce these sort of terms and will if they can.

And, as you might guess I would say, requirements to attribute the government for data and to explain what you did with data are restrictions on use, which like the others create a lever by which the DC government might put pressure on things it doesn’t like.

(This was also a problem in M-13-13, but in this case it doesn’t appear that the DC directive specifically copied the problem from M-13-13.)

Conclusion

There is a strong American tradition — or at least a core American value — that the government does not get in the way of the dissemination of ideas. We don’t always live up to that ideal, but we strive for it. Access to information about the government that comes with restrictions on what we can say when we use it (e.g. attribution & explanation), a waiver of rights or a commitment to indemnify, etc. are all an anathema to accountability and transparency and respect for the public.

If and when these new terms go up, I will encourage users to FOIA for the same information rather than get it from the DC data catalog.

I’m tracking the White House with persistent cookies

July 21st, 2014
Posted in Civic Hacking | Comments Off

ProPublic reported this morning that WhiteHouse.gov is — albeit accidentally — using a new method for tracking individual visitors to the website. This reminded me that for the last 6 months I’ve been tracking the White House.

Methodology

On Jan 17 the President made his first major speech regarding reforms to the NSA’s massive surveillance programs revealed last year. I thought that morning that he would announce new mandatory data retention policies for internet and telephone service providers. He didn’t. But by the time the speech began I had already started tracking the White House.

About 5% of traffic to my website GovTrack.us comes from the government. Most IP addresses are tied to the major broadband providers like Verizon, Comcast, and so on. But some government IP addresses come from special IP address blocks labeled specifically for the office that reserved them.

Three blocks were of interest to me: the blocks for the Executive Office of the President (“EOP”, about 60 page views on GovTrack per day), the United States Senate (about 300 page views/day), and the House of Representatives (about 600 page views/day). I don’t know where the computers are that have these IP addresses, but I expect that EOP IP addresses would include White House, West Wing, and perhaps the Eisenhower Executive Office Building (more on what the EOP is). The House and Senate IP addresses are used in the Capitol and the seven congressional office buildings, including in non-political offices and the guest WiFi network, to the best of my knowledge.

On Jan 17 I began uniquely identifying the users of these IP blocks by placing a persistent cookie with a unique identifier in their web browsers when they visited GovTrack and logging each of those page views. Persistent cookies get lost when users clear their browser cookies, but it’s a useful first approximation to identifying users.

Summary Results

So far, 324,705 hits on GovTrack have been logged from 19,131 unique tracking cookies:

Network Hits Uniques
EOP 12,512 1,161
Senate 92,917 7,572
House 219,276 10,771

(Interestingly, 373 unique cookies appeared on more than one of the three networks — probably a laptop that moved from one building to another.)

The longest recorded session is for one tracking cookie on the House network that made 1,590 page views almost all on Feb 21, but also in January and March, to pages for various representatives. My guess is this was a lobbyist on the guest wifi doing research before a meeting.

From the EOP, the longest recorded session is 901 page views between March 18 and July 18. This user mostly looked at my congressional district maps and a few bills on a variety of subjects. There was no discernible pattern to it, except that this person is probably responsible for looking up the congressional districts of people. Maybe the person processes incoming mail to the President.

This is all I’ll look into right now, but I may post more about it if I find anything interesting.

I’d be glad to share the data on request.

We the People is 10% a Sham

July 4th, 2014
Posted in Civic Hacking | Comments Off

We the People, one of the White House’s cornerstone open government initiatives, is 10% a sham. The site promises to respond to petitions posted by users if the petition reaches a certain number of signatures within 30 days. Nextgov reported earlier in the year that the White House was not keeping its end of the pledge. It’s true. More than 10% of petitions that deserve an answer go unanswered.

The threshold for a White House response has gone up steadily as the popularity of We The People has increased. In 2012 the threshold was 25,000 signatures within 30 days, and now the threshold is 100,000. Except when it isn’t. Of the 217 petitions that gathered enough signatures for a response, 29 have gone unanswered.

The full list of petitions that the White House owes a response to are below.

Two of those petitions hit close to home for the open government movement. In early 2013, Aaron Swartz, an early leader of the open government movement, committed suicide while under investigation for downloading research papers without permission. Two petitions (1, 2) were submitted shortly after his death calling for the firing of attorneys believed to be over-zealous in their prosecution of Swartz. The petitions each gathered more than the then-threshold of 25,000 signatures, but more than one year and 89,733 signatures later there has been no response.

The earliest of the petitions still “pending response,” according to the White House’s own data, is a petition which has now reached ten times the (early) 5,000 signature threshold. It is a petition about GMO food labeling. Another petition, at 211,925 signatures today (181,479 in its first 30 days), created in July 2013 asked the White House to declare the Muslim Brotherhood party in Egypt a terrorist organization. It remains not responded to as well.

The White House has posted 156 responses to 225 petitions since the site launched in September 2011. 2,916 petitions have been created in all. The most successful petition, by number of signatures, was one created in December 2012 asking the White House to recognize the Westboro Baptist Church as a hate group. It gathered 367,180 signatures and in July 2013 received a response to the effect that the White House does not maintain a list of hate groups — demonstrating that, of course, getting a response does not mean getting the response the petitioners wanted.

h/t and thanks to @konklone for mentioning this to me a long long while ago.

Okay here’s the list, in order of number of signatures gathered. The date before each is the date the petition was created, and so it met its threshold 30 days later.

07/07/13: Declare Muslim Brotherhood organization as a terrorist group (211,925 signatures; 181,479 in 30 days)

06/09/13: Pardon Edward Snowden (161,395 signatures; 129,312 in 30 days)

05/03/13: Invest and deport Jasmine Sun who was the main suspect of a famous Thallium poison murder case (victim:Zhu Lin) in China (151,169 signatures; 148,285 in 30 days)

05/13/14: put sanctions on China for invading Vietnam territory with the deployment of oil rig Haiyang 981. (139,216 signatures; 138,878 in 30 days)

06/05/13: allow Tesla Motors to sell directly to consumers in all 50 states. (138,379 signatures; 110,384 in 30 days)

12/11/13: Remove offensive state in Glendale, CA public park (129,170 signatures; 123,629 in 30 days)

05/01/14: Demand Release of U.S.M.C. Sgt. Tahmooressi Suffering with PTSD from Mexico Imprisonment (128,770 signatures; 115,889 in 30 days)

04/25/14: Urge S. Korean Government & Press to Stop the Attack Against Church in the Aftermath of Ferry Tragedy (119,369 signatures; 117,749 in 30 days)

08/22/13: Stop SOPA 2013 (118,905 signatures; 106,486 in 30 days)

05/15/13: Provide necessary assistance to prevent Taiwanese people from being murdered by Philippines and rebuild friendship. (115,676 signatures; 113,330 in 30 days)

02/17/14: Stop SOPA 2014. (112,293 signatures; 104,184 in 30 days)

11/12/13: Reform ECPA: Tell the Government to Get a Warrant (112,087 signatures; 105,236 in 30 days)

04/12/12: Support mandatory labeling of genetically engineered foods (GMOs). (110,784 signatures; 30,740 in 30 days)

02/26/14: Allow Ukrainian Citizens 90 day entrance into the USA on passport, without Visa. (107,909 signatures; 103,037 in 30 days)

01/04/14: Please Protect The Peace Monument in Glendale Central Library (106,751 signatures; 105,390 in 30 days)

02/27/14: Urge the FDA to Say YES to Accelerated Approval for safe, effective therapies for children with Duchenne. (106,734 signatures; 105,036 in 30 days)

04/23/14: Designate Russia as “State Sponsor of Terrorism” (104,914 signatures; 103,722 in 30 days)

03/21/14: Legally Recognize Non-Binary Genders (103,166 signatures; 101,494 in 30 days)

09/23/11: Require all Genetically Modified Foods to be labeled as such. (64,311 signatures; 8,747 in 30 days)

01/12/13: Remove United States District Attorney Carmen Ortiz from office for overreach in the case of Aaron Swartz. (60,881 signatures; 52,466 in 30 days)

05/10/12: Remove the monument and not to support any international harassment related to this issue against the people of Japan. (47,477 signatures; 31,473 in 30 days)

06/21/12: Repeal the House of Representatives Resolution 121 to stop aggravating int’l harassment by Korean propaganda & lies! (46,012 signatures; 27,623 in 30 days)

09/01/12: Persuade South Korea (the ROK) to accept Japan’s proposal on territorial dispute over islets. (42,015 signatures; 30,213 in 30 days)

12/28/12: To award the Medal of Freedom to the 4 Firefighters who were ambushed in West Webster New York on Christmas Eve 2012 (34,067 signatures; 29,322 in 30 days)

12/02/12: Investigate and publicly condemn organ harvesting from Falun Gong believers in China (33,733 signatures; 28,624 in 30 days)

01/08/13: Invite Neal Boortz, the author of The FairTax Book, to spend one hour talking with the President about tax reform. (32,155 signatures; 28,191 in 30 days)

12/11/12: oppose the petition created by xHisa Axon Japanxs proposal to take Japanxs claim over Dokdo (or Takeshima) to the ICJ. (31,609 signatures; 28,959 in 30 days)

01/12/13: Fire Assistant U.S. Attorney Steve Heymann. (28,854 signatures; 25,717 in 30 days)

12/29/12: There are election rigging made by Progressive Program that have been used in the 18th Presidential Election of S. KOREA (26,797 signatures; 25,467 in 30 days)

6 problems you should know before writing about Facebook’s emotional contagion study

June 29th, 2014
Posted in Uncategorized | Comments Off

In a Facebook study published this week, Facebook manipulated all of their U.S. many of their users’ News Feeds by omitting 0-90% of posts containing either positive or negative content over the course of a week in 2012. They reported that those users wrote fewer positive and negative words (respectively) in their own posts, concluding that Facebook is a medium on which emotions spread, a case of “emotional contagion” using their technical term.

Here’s what you need to know:

On average, no emotion actually spread

The number of positive words in their average user’s posts decreased from 6 words to… 6 words.

The first major omission in the study is the lack of individual-level statistics. While they reported aggregate numbers such as having analyzed ”over 3 million posts” totaling “122 million words” made by their “N = 689,003″ users, and the study’s implication for “hundreds of thousands of emotion expressions,” they omitted any discussion of whether and how individuals were affected in any meaningful way.

From their numbers, the average user wrote 4.5-5 posts totaling 177 words during the experimental week. Only 3.6% of those words — so about 6 words — were “emotional,” and they found that by omitting about half of emotional posts from user’s News Feeds that percentage would go down by 0.1% or less. A 0.1% change is about 2/10ths of a word.

For most of their users, there was not even close to a measurable effect.

(The study did mention the Cohen’s d statistic of ’0.02′ which is another way to say that there was an aggregate effect but basically no individual-level effect.)

The study has no test for external validity (was it about emotions at all?)

An important part of every study is checking that what you’re measuring actually relates to the phenomenon you’re interested in. This is called external validity. The authors of the Facebook study boasted that they didn’t think of this.

The paper quixotically mentions that “no text was seen by the researchers” in order to comply with Facebook’s agreement with its users about how it will use their data.

They didn’t look at all?

That’s kind of a problem. How do you perform a study on 122 million words and not look at any of them?

Are the posts even original, expressive content? The users might be sharing posts less (sharing is sort of like retweeting) or referring less to the emotional states of friends (“John sounds sad!”). The words in a post may reflect the emotions of someone besides the poster!

To classify words as “positive” or “negative” the study consulted a pre-existing list of positive and negative words used throughout these sorts of social science research studies. This comes with some limitations: sarcasm, quotation, or even simple negation completely cut out the legs under this approach. I actually think in aggregate these problems tend to go away, but only when you have a large effect size.

The whole of Facebook’s reported effect on emotion could be due to one of the many limitations of using word lists as a proxy for emotion. They needed to demonstrate it wasn’t.

Methodological concerns

This study is not reproducible. While most research isn’t ever reproduced, that it could be provides a check against the fabrication of results (and sometimes that’s how fabricators are caught). Facebook provides the only access to a network of this size and shape. It is unlikely they would provide access to research that might discredit the study.

The study also uses a strange analysis. Their experimental design was 2 X 9-ish (control or experiment X 10-90% of posts hidden), but they plugged the two variables into their linear regression in two ways. The first became a binary (“dummy”) variable in the regression, which is right, but the second become a weight on the data points rather than a predictor. That’s an odd choice. Do the results come out differently if the percentage of posts hidden is properly included in the regression model? Did they choose the analysis that gave the results they wanted to see? (This is why I say “about half of emotional posts” above, since the analysis is over a weighted range.)

Informed consent

Finally, there’s the problem of informed consent. It is unethical to run experiments on people without it. The paper addresses legal consent, in the sense that the users agreed to various things as a pre-condition for using Facebook. Though being manipulated was probably not one of them (I don’t know what Facebook’s terms of service were in early 2012 unfortunately).

Certainly the consent didn’t reach the level of informed consent, in which participants have a cogent sense of what is at stake. There’s a great discussion of this at Slate by Katy Waldman.

Facebook’s users have a right to be outraged over this.

Keep in mind though that there are different ethical obligations for research versus developing a product. It could be ethical for Facebook to manipulate News Feeds to figure out how to increase engagement while at the same time being unethical for a research journal to publish a paper about it.