Archive for the ‘Civic Hacking’ Category

Posts related to my site GovTrack.us, which tracks the U.S. Congress, and related issues in the world of civics, technology, open government.

Civic Hacking, the Semantic Web, and Visualization

Monday, March 2nd, 2009

Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):

Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.

Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.

Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.

Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)

Then there are some slides on URIs and RDF.

My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.

Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.

Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.

SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.

Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.

My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.

Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.

Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.

More Data that Changed the World

Monday, February 23rd, 2009

Continuing from my last post on this subject, I found some more examples of influential data sets from a page on FlowingData.com. I’m expanding beyond government data in this post.

“Baseball Statistics: In 2003, Michael M. Lewis’ book, Moneyball: The Art of Winning an Unfair Game, was released. As a result, the way baseball teams were built changed completely. Before Moneyball, teams relied on insider information and the choice of players was highly subjective. However, in 2002, a year before the book was published, the Oakland A’s had $41 million in salary and had to figure out how to compete against teams like the New York Yankees and the Boston Red Sox who spent over $100 million in salaries.”

“Megan’s Law: Since 1994, those who have been convicted of sex crimes against children have been required to register with local law enforcement. That data is made public so that people know about sex offenders in their area. Mash that data with Google Maps. Lo and behold, parents became instantly aware of caution areas and some might never look at their neighbor the same way ever again, while sex offenders start declaring themselves homeless.”

Comparing stimulus bill text versions side-by-side

Tuesday, February 10th, 2009

One of the concrete benefits of open government data is that third parties can use the data to do something useful that no one in government has the mandate, resources, or insight to do. If you think what I am about to tell you below is cool, and helpful, then you are a supporter of open government data.

On my site GovTrack, you can now find comparisons of the text of H.R. 1, the stimulus bill, at different stages in its legislative life — including the House version (as passed) and the current Senate version (amendment 570).

The main page on GovTrack for HR 1 is: here

Here’s a direct link to the comparison:

Comparisons are possible between any two versions of the bill posted by GPO. Comparisons are available for any bill.

If you find this useful, please take a moment to consider that something like this is possible only when Congress takes data openness seriously. When GPO went online and THOMAS was created in the early 90s, they chose good data formats and access policies (mostly). But the work on open government data didn’t end 15 years ago. As “what’s hot” shifts to video and Twitter, the choices made today are going to impact whether or not these sources of data empower us in the future, whether or not we miss exciting opportunities such as having tools like the one above.

(Thanks to John Wonderlich and Peggy Garvin for some side discussion about this before my post. GovTrack wasn’t initially picking up the latest Senate versions because GPO seems to have gone out of its way to accommodate posting the latest versions before they were passed by the Senate, which is great, but caught GovTrack by surprise.)

Open Government Data that Changed the World

Thursday, February 5th, 2009

I want to make the case that open government data has value not just for geeks, but has the power to change lives in significant ways. I spend a lot of time convincing government managers and staffers that open governemnt data is a good thing, but sometimes we get caught up in the technical details. It’s easy to say that legislative data is an important component of maintaining an educated public, or that open and reusable bits are important for the media to be able to make compelling cases, but it’s all very abstract. So I asked my Open House Project friends: what open government data has changed the world?

Here’s what I got:

Weather data from the NOAA plays an important role in the agricultural sector (hat tip: Clay Shirky, David Weller) and, for that matter, has a lot to do with the weather reports we all use to plan our daily lives. (I tried to get some info on this from NOAA but they ignored my email, ah well.)

Information on publicly traded companies reported to the SEC plays a vital role in the public’s ability to trade fairly. The fact that the SEC continues to break ground on even more comprehensive data requirements for reporting signals that the public availability of these files is extraordinarily important. (Hat tip to Clay for the pointer, and to Carl Malamud for spearheading getting these files originally online in the first place.) Data from other agencies like BLS and USDA affect the trading of other commodities. (Hat tip: Philip Kromer)

The social security death index has been a tool for genealogy research (hat tip: Tom Bruce).

NASA’s photos of Earth from space are part of the bedrock of inspiration of the country. Can you imagine how different the world might be if NASA kept the photos to itself? The Library of Congress publishes digital versions of historical artifacts, like the founding documents — this too is a crical part of inspiring Americans to strive for an ideal. (Hat tip: Clay.)

Geospacial data from the USGS and the Census bureau have made mapping applications like Google Maps and in-car GPS devices like TomTom possible or at least cheaper to make.  (Hat tip: Philip Kromer. Francis Irving notes that the UK is a counterexample. OK.)

Census statistics, epidemiology data, and many state-funded survey projects have played crucial roles in public health and economic research. No doubt CDC data has saved lives, though I don’t know any specifics (hat tip: many).

If you have other examples, or can help me flesh out these examples, please send something my way. To reiterate: I’m looking for open data that changed lives — please tell me what the data is and how it changed lives.

Examples of XML for government legislative data

Saturday, December 20th, 2008

I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here’s what I can think of off the top of my head:

About structured data

Structured data in the U.S. Congress

  • The House drafts most of its legislation in XML now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.
  • The House publishes its votes in XML (example). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It’s completely transparent. This started around 2004-2005, I believe.
  • The Senate makes its list of membership and contact information available in XML. They have much more XML than they share. The Library of Congress’s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).
  • The Senate also recently started publishing their committee hearing schedule in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).
  • The Senate’s lobbying disclosure database is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.
  • Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.
  • Behind the scenes, the Library of Congress’s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.

Structured data made independently

  • I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My GovTrack.us Source Data covers the status of legislation (example), voting records (for both chambers in a common format; example), the text of the Congressional Record, Congressional membership, committee membership (example), etc. I’ve been doing this since 2004.
  • The Cornell Legal Information Institute produces an XML version of the U.S. Code, based on some structured but difficult-to-use data files made available by the House. I think they’ve been doing this since around 2004. (more info; example not easily available)
  • The Sunlight Labs API provides congressional membership and data-linking information.

State-level legislative information

  • In early 2007 I surveyed all of the state legislatures and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.
  • Richmond Sunlight, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).

Federal non-legislative data

To quickly list some other sources of structured data at the federal level-

Other notable government structured data

  • Washington DC sets a real example with its Data Catalog. It covers data produced by many aspects of its local government.

oGosh! IRC Meeting Aug 16 4pm EDT

Wednesday, August 6th, 2008

Join me at an IRC chat to talk about open source civic technology projects, on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix between seeing what various civic technology projects are up to like GovTrack (my site, powered by Mono), OpenCongress, and any others run by people who show up, and getting new people involved in ongoing projects. “oGosh” is Open Government Open Source Hacking (wiki | Facebook), what I’m calling the loose community that binds these projects together.

The chat will be in the #transparency channel on Freenode. For more information on the meeting (and on how to get to the chat), see http://wiki.opengovdata.org/index.php/OGosh.

Suggestions for agenda topics are most welcome either to me directly or by revising the wiki page above. Hope to see you there.

The Open House Project

Thursday, February 8th, 2007

(Just here for archival purposes…)

On my GovTrack blog: It’s rare when Congress asks the people for help being transparent, and so I’m particularly pleased to announce the formation of The Open House Project, a Sunlight Foundation-sponsored project with the encouragement of Speaker Pelosi that will be making specific proposals about how The House can better use the Internet in the interests of transparency. Various people, including myself, will be blogging on that site over the next few weeks about some ideas on this point. Feel free to contribute your ideas by commenting on the TOHP website, joining the project’s mail list, or talking on GovTrack’s own mail list.

And on the TOHP blog:

Mash-ups for government transparency

January 25th, 2007 by Joshua Tauberer

A few years ago I launched GovTrack.us. I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (THOMAS, the Senate website, and the House website) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.

The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this is something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:

“Eleven bills and one resolution were introduced, as follows: S. 360-370 and S. Res. 37.”

I have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill S. 365 introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?

Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and someone will want each of them. So the question is this: How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?

To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do anything with it that someone might want?

This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called structured (or “machine-readable”) data, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.

Meaningful Reform

Thursday, January 11th, 2007

About a year ago following a few scandals, the House and Senate saw a flurry of Congressional reform legislation get introduced… and then promptly ignored. Finally, however, we may see meaningful reform. Senate majority leader Harry Reid has introduced S. 1: Commission to Strengthen Confidence in Congress Act of 2007. The bill would make two incredibly important advances:

(Sec 103) It shall not be in order to consider any Senate bill or Senate amendment or conference [without] a list of– (1) all earmarks in such measure; (2) an identification of the Member or Members who proposed the earmark; and (3) an explanation of the essential governmental purpose for the earmark is available … to all Members and made available on the Internet to the general public for at least 48 hours before its consideration.’.

(Sec 104) It shall not be in order to consider a conference report unless such report is available to all Members and made available to the general public by means of the Internet for at least 48 hours before its consideration.

Strangely, the bill does not require that bills (!) be available on the Internet for 48 hours before being voted on. Just conference reports. After a bill has been passed by both the House and the Senate, it’s often the case that the second chamber to get the bill has made amendments to the bill that the first chamber hasn’t yet gotten a chance to see. In that case, a conference committee is made to get the two chambers back in sync, and the final version of a bill comes out in a conference report.

Since it’s been introduced by Reid, I think it’s almost certainly going to get through the Senate. The House seems to be off in its own world, so I’m not sure whether we’ll see this bill ever become law, but it’s got a good shot.

Bounty on bringing state legislative info to the semantic web

Wednesday, June 1st, 2005

Taking the idea from the Gnome community, I’m placing a $150 bounty on the following programming project: Screen-scrape the website of a state legislature and put the information into XML or RDF so that it can be used on the semantic web. The code has to be released under an open-source-type license (and it must meet my personal satisfaction). It should parallel what I’ve done with federal legislative information. If you’re interested, join and post on the GovTrack mail list.

(They money comes from the advertising I have on GovTrack.)

My Trip to D.C.

Wednesday, March 9th, 2005

Last night I got back from a two-day trip to D.C. The point of the
trip was to make a presentation about GovTrack and also to start some
collaboration with others on expanding the political information that
is freely and openly available online.

\"\"

Monday afternoon I presented GovTrack and some ideas about the
semantic web to the people who are responsible for getting some
aspects of legislative information posted online in XML format. Right
now GovTrack gets its information from screen-scraping, which is an
inexact and fragile process of extracting information out of the same
HTML pages that you see when you view web sites. Having data
published also in XML format can greatly improve the accuracy of
getting information. What the people at the clerk of the House have
done to date, in terms of getting bills written in XML and roll call
votes posted in XML, has been a great step forward, although it
hasn’t been that useful for GovTrack. (One reason is the Senate
hasn’t followed suit because, as I understand it, the clerk of the
Senate isn’t authorized by the Senate itself to work on such things.)

I think I’ve met now almost all of the
players in the arena of building a network of political
information. Between everyone involved, we have enough data and enthusiasm to get
something very unique and useful started.

(For more details, see my posting on the GovTrack blog.)