JohnWonderlich: woot! Go Senate XML! (votes data now posted, in policy reversal) http://bit.ly/Il7hF
Senate votes now in XML (success story)
The new world of government transparency through technology
The big news lately is that the Center for Responsive Politics opened up their large database of normalized campaign contribution records under a Creative Commons license. I think this is more significant to the world of government transparency & technology than it might appear. Just around five years ago this world was quite different. Organizations like CRP were very much using technology to bring new insight to civics. That hasn’t changed. But organizations saw themselves as solitary entities whose primary mission was to provide a new direct-to-citizen service to the public. A web application, for instance. There’s no need for me to list off other examples — every advocacy and government transparency website was like that, to the best of my recollection. (Except maybe IMSP who seemed to be ahead of the pack.)
All that has changed, and I wish I could pinpoint exactly how that happened. The combination of “Web 2.0″ as a buzz-word and grassroots digital campaigning in 2004 probably had a lot to do with it. The Howard Dean presidential campaign got a boost (at least in terms of publicity if not poll numbers) from developers coming together to specialize the Drupal open source CMS for political campaigning (“CivicSpace”). That sent a message, even if no one quite recognized it at the time, that developers have a role to play in the world of civics and that cooperation was a viable model for getting things done. Not to say that the CivicSpace project invented this — I was working on GovTrack for a few years by that point and across the pond Tom Steinberg and the MySociety group had been thinking about open source civics for even longer. But I suspect, even in my own thinking, that CivicSpace crystalized some vague earlier notions of civic hacking.
The story isn’t over yet, though, because I don’t think any of this alone would have brought us to where we are today. Unfortunately, from this point forward I run the risk of giving too much credit to the things I know about and not enough credit elsewhere. Still, here’s how I see it. Four more things had to happen, independently. First, entrepreneur Mike Klein had to make a lot a lot a lot of money. Second, Dan Newman and David Moore had to build MAPLight.org and OpenCongress.org, respectively. These are, now, and especially were at the start, leading examples of how you can do really cool new things by mixing data sources (for MAPLight, mixing my GovTrack legislation data with campaign contribution data from CRP) or re-mixing data sources (for OpenCongress giving my legislation data a more social spin). Third, John Wonderlich had to start, quite by accident, the Open House Project — this was a crucial step in bridging the technology world with staffers for congressmen, especially with Speaker Pelosi’s office. The fourth bit was that Ellen Miller and Micah Sifry had to put it all together and form the Sunlight Foundation: funding from Mike going to two great technology projects (IMO these are Sunlight’s most important grantees) and a policy arm with teeth because of its pragmatic approach to connecting with policymakers.
That’s pretty much it, because from there things just make sense. Sunlight recruited great staff and steamrolled through the open government world stamping out the idea that each open government group should be in its own little world — by funding interaction, in a sense.
The expectations for government transparency advocacy changed. Groups had to walk the walk a bit more by sharing and collaborating. So now besides CRP’s data being opened up for anyone to remix we have the Taxpayers earmark data, the Sunlight Labs API, the MAPLight API, and probably several more databases. The New York Times API probably owes some of its inspiration to these changing expectations too. So it’s a whole new world now of not just open governenment, and not even open government data, but open government transparency advocacy data. (Is there a catchier name for that?)
Update on bulk data from Congress
One of the Open House Project’s recommendations was that Congress share its legislative data with the public in bulk and I’ve had a long history of posts on the subject. Over at the Free Gov info blog (link), Bob Tapella, Public Printer at the Government Printing Office, tells us that they are responding to this recommendation. He writes in a comment (presumably it is really him):
We have recently been called upon by Congress in the joint explanatory statement on the H.R. 1105, to work with the Library of Congress, including the Congressional Research Service, and the Law Library of Congress, to discuss access to bulk data. Specifically, the language is as follows:
[JT: omitted --- I've posted it before here]
To address this request, a Legislative branch task force has been assembled consisting of representatives from the offices of the Secretary of the Senate, the Clerk of the House, the Library of Congress, Congressional Research Service, the Law Library of Congress, and GPO. This task force has already met and is working to develop a position on access to bulk data. We will look to this work and the review by Congress to help guide our work on making bulk data accessible.
Grin.
Check me out: My talk at Berkeley’s Free Culture Conference last year
Watch a video of my talk at the Free Culture Conference last year on Civic Hacking. (Text and slides here.) It was my best talk yet. I’ve got another good one (if I do say so myself) coming up at CITP’s Studying Society in a Digital World conference in a few weeks at Princeton.
SPARQL OLE DB Provider
Andy Gueritz announced on the mail list for my SemWeb RDF library for .NET that he has created an OLE provider for a SPARQL endpoint that is usable in Microsoft Excel. He wrote,
In a moment of insanity (but a great learning experience), I gave myself the challenge of writing an OLE DB provider for SPARQL. It is built on top of the SemWeb libary which has saved a substantial amount of effort and also brings some powerful functionality to the table very quickly (Thanks, Joshua!)
The provider as constructed implements a readonly OLE provider that supports all four SPARQL query types and interfaces to SemWeb through COM-Callable Wrapper. It is not extensively tested yet but seems to work with most of the queries I have now put through it, and of course being built on SemWeb it is able to read both local and remote SPARQL sources.
Moral of the story: populate Excel tables with SPARQL queries.
More here.
Try hacking for government transparency in GSoC
Does the thought of “hacking Congress” entice you? I don’t mean breaking in to U.S. Capitol servers, of course, but putting your l33t hacking skillz to use to improve government transparency and civic engagement. The Sunlight Foundation (I have no affiliation) is a mentoring organization in Google Summer of Code 2009. Check it out.
Shameless plug: HackingCongress.org
Announcing: HackingCongress.org
Announcing: HackingCongress.org
The intersection of civics & technology
http://www.hackingcongress.org/
Our community is growing rapidly these days. And while TransparencyCamp gave us a physical place to come together a few weeks ago, we’re still a little nomadic in the online world.
We’re also a very diverse group. The fact that we all often have to cross-post to the same set of lists indicates that we’ve got great inner communities that focus separately on coding, policy, social media, etc.
HackingCongress.org is meant to be a neutral-ground home for the coder community in the open government world. Really, it’s just a links page. But it’s a links page with a nice Drupal theme you can proudly point to and say “this is my movement”. “Hacking” is, of course, a word with several meanings. In the programming world it is very much a positive term meaning something like “creative programming”.
The site is running Drupal and anyone that creates an account can edit any content on the site. So, it’s basically a new wiki.
Right now you can find:
Community
———
Links to the primary convergence locations for this community, the Sunlight Labs and PoliParsers mail lists, the IRC channel #transparency, the Planet oGosh blog aggregator, the oGosh Facebook group, and the Upcoming Transparency Events page on the OpenCongress wiki.
Links to all of the other mail lists for our community (all of the one’s I’ve mailed here and some others).
Data & APIs
———–
Then beginnings of a list of the databases and APIs that are available for government transparency data. If you’re a data source, add yourself to the list or make sure I got your entry correct, please.
Projects
——–
Links to ongoing projects broken down by type-
– Open-source coding projects like OpenCongress and Sunlight’s
Fifty States.
– Policy projects like Open House/Senate.
– Wiki projects like the new Wired gov data wiki.
It was everything I could come up with quickly. I’ll be adding more as I see them, but feel free to add your own project.
#transparency
————-
Using a Drupal module you can enter the community’s IRC channel #transparency through the website.
Blog Aggregator
—————
Recently I announced Planet oGosh, an aggregator bringing together a whole bunch of blogs in the open government tech community. I’m changing the URL to planet.hackingcongress.org.
Final Notes
———–
Thanks to Kendall Clark for donating the domain name.
Bulk data downloads approved in the omnibus spending bill (success!)
Two recommendations of the Open House Project report have been taken up in the FY09 omnibus appropriations bill (H.R. 1105). The first recommendation in our chapter on legislative databases was that the Library of Congress make its bill status database directly available to the public and that the GPO not sell legislative documents to the public. These have been the two issues I’ve had my sights on over the last three years (probably starting here). The second recommendation was about coordinating web standards across Congress. These recommendations are addressed in two paragraphs the House statement accompanying the bill for Division G – Legislative Branch, which is almost like being law itself.
The two paragraphs were added by Congressman Mike Honda of California, one of our champions of the use of technology to further transparency and civic engagement. John Wonderlich of Sunlight Foundation, Rob Pierson in Honda’s office, and I collaborated on this over a long period of time. Honda got involved in 2007 asking the Library to look into this and then in 2008 getting the paragraphs added to the bill markup.
Congressional Transparency Limerick
There’s a wacky thread on the Open House List of poems for transparency, so I gave it a try. (In fairness, I should say I used an electronic pronunciation dictionary to find some of the rhymes.)
There once was a man named Mike Honda,
A congressman us geeks are quite fond ‘a,
In markup sessions takes on the chairman, a hulk,
so that we the people can get our data in bulk.
His friend maverick Joe likes transparency too,
Senate votes in XML he says long overdue,
At party politics he snorts,
Because the public should see those CRS reports.
And last we hear of the executive’s new plan,
For a CTO and CIO…
…perhaps YesWeScan?
Civic Hacking, the Semantic Web, and Visualization
Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):
Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.
Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.
Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.
Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)
Then there are some slides on URIs and RDF.
My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.
Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.
Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.
SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.
Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.
My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.
Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.
Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.