March 18th, 2009
Announcing: HackingCongress.org
The intersection of civics & technology
http://www.hackingcongress.org/
Our community is growing rapidly these days. And while TransparencyCamp gave us a physical place to come together a few weeks ago, we’re still a little nomadic in the online world.
We’re also a very diverse group. The fact that we all often have to cross-post to the same set of lists indicates that we’ve got great inner communities that focus separately on coding, policy, social media, etc.
HackingCongress.org is meant to be a neutral-ground home for the coder community in the open government world. Really, it’s just a links page. But it’s a links page with a nice Drupal theme you can proudly point to and say “this is my movement”. “Hacking” is, of course, a word with several meanings. In the programming world it is very much a positive term meaning something like “creative programming”.
The site is running Drupal and anyone that creates an account can edit any content on the site. So, it’s basically a new wiki.
Right now you can find:
Community
———
Links to the primary convergence locations for this community, the Sunlight Labs and PoliParsers mail lists, the IRC channel #transparency, the Planet oGosh blog aggregator, the oGosh Facebook group, and the Upcoming Transparency Events page on the OpenCongress wiki.
Links to all of the other mail lists for our community (all of the one’s I’ve mailed here and some others).
Data & APIs
———–
Then beginnings of a list of the databases and APIs that are available for government transparency data. If you’re a data source, add yourself to the list or make sure I got your entry correct, please.
Projects
——–
Links to ongoing projects broken down by type-
– Open-source coding projects like OpenCongress and Sunlight’s
Fifty States.
– Policy projects like Open House/Senate.
– Wiki projects like the new Wired gov data wiki.
It was everything I could come up with quickly. I’ll be adding more as I see them, but feel free to add your own project.
#transparency
————-
Using a Drupal module you can enter the community’s IRC channel #transparency through the website.
Blog Aggregator
—————
Recently I announced Planet oGosh, an aggregator bringing together a whole bunch of blogs in the open government tech community. I’m changing the URL to planet.hackingcongress.org.
Final Notes
———–
Thanks to Kendall Clark for donating the domain name.
March 11th, 2009
Two recommendations of the Open House Project report have been taken up in the FY09 omnibus appropriations bill (H.R. 1105). The first recommendation in our chapter on legislative databases was that the Library of Congress make its bill status database directly available to the public and that the GPO not sell legislative documents to the public. These have been the two issues I’ve had my sights on over the last three years (probably starting here). The second recommendation was about coordinating web standards across Congress. These recommendations are addressed in two paragraphs the House statement accompanying the bill for Division G – Legislative Branch, which is almost like being law itself.
The two paragraphs were added by Congressman Mike Honda of California, one of our champions of the use of technology to further transparency and civic engagement. John Wonderlich of Sunlight Foundation, Rob Pierson in Honda’s office, and I collaborated on this over a long period of time. Honda got involved in 2007 asking the Library to look into this and then in 2008 getting the paragraphs added to the bill markup.
Read the rest of this entry »
March 5th, 2009
There’s a wacky thread on the Open House List of poems for transparency, so I gave it a try. (In fairness, I should say I used an electronic pronunciation dictionary to find some of the rhymes.)
There once was a man named Mike Honda,
A congressman us geeks are quite fond ‘a,
In markup sessions takes on the chairman, a hulk,
so that we the people can get our data in bulk.
His friend maverick Joe likes transparency too,
Senate votes in XML he says long overdue,
At party politics he snorts,
Because the public should see those CRS reports.
And last we hear of the executive’s new plan,
For a CTO and CIO…
…perhaps YesWeScan?
March 2nd, 2009
Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):
Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.
Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.
Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.
Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)
Then there are some slides on URIs and RDF.
My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.
Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.
Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.
SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.
Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.
My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.
Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.
Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.
February 23rd, 2009
Continuing from my last post on this subject, I found some more examples of influential data sets from a page on FlowingData.com. I’m expanding beyond government data in this post.
“Baseball Statistics: In 2003, Michael M. Lewis’ book, Moneyball: The Art of Winning an Unfair Game, was released. As a result, the way baseball teams were built changed completely. Before Moneyball, teams relied on insider information and the choice of players was highly subjective. However, in 2002, a year before the book was published, the Oakland A’s had $41 million in salary and had to figure out how to compete against teams like the New York Yankees and the Boston Red Sox who spent over $100 million in salaries.”
“Megan’s Law: Since 1994, those who have been convicted of sex crimes against children have been required to register with local law enforcement. That data is made public so that people know about sex offenders in their area. Mash that data with Google Maps. Lo and behold, parents became instantly aware of caution areas and some might never look at their neighbor the same way ever again, while sex offenders start declaring themselves homeless.”
February 11th, 2009
Here is a command-line-only way to forward audio over a network in Ubuntu 8.10+. Ubuntu uses PulseAudio as a sound server. It can be configured (esp. using the padevchooser package) to do this “properly”, but I wanted something fast that didn’t require configuration (and, uhm, I didn’t know about padevchooser).
The first step is to establish a secure connection between the machines using SSH to forward a local TCP port (the one with the sound application) to a TCP port on the remote machine (the one with the speakers):
(on the computer with the sound application)
ssh -L4000:localhost:4000 remotehost
PulseAudio is a server that listens for connections on a Unix domain socket by default, which means 1) it can’t be accessed remotely, and 2) ssh can’t redirect a TCP port to it directly. One could configure PulseAudio to use a TCP port instead, but then you have to worry about security. So instead, use the socat tool (you might need to install the socat package) on the remote machine to forward the remote machine’s local port 4000 to its PulseAudio Unix socket:
(on the computer with the speakers, e.g. inside the SSH session)
socat TCP-LISTEN:4000,fork UNIX-CONNECT:/tmp/pulse-$USER/native
socat will accept only local connections by default, which is why we need to SSH to the remote machine to connect — that’s a good thing if you like security.
The “,fork” option has socat listen for multiple connections. Otherwise it’ll quit after the first connection (it’ll play one sound file and exit). You can run socat once you’ve logged in with SSH. You might think you can do it all on one line (because ssh takes a second command line argument for a command to run), but it doesn’t work well with this “,fork” option because when you CTRL+C SSH to end the session, socat keeps running (which might be fine, I guess, but it will prevent you from running it a second time since something will already be listening on port 4000).
Now we can have PulseAudio-enabled programs play sound remotely by specifying to put the sound on the local port 4000, rather than to where it normally puts it for local sound. The paplay program plays a wav file:
paplay soundfile.wav (plays it locally)
paplay -s localhost:4000 soundfile.wav (plays it remotely)
Or equivalently by setting an environment variable:
PULSE_SERVER=localhost:4000 paplay soundfile.wav
The environment variable method should work for any other program that plays sound with PulseAudio.
The tricks begin when programs don’t support PulseAudio and instead use OSS or Alsa to play sounds. In principle, the padsp command is able to redirect the use of /dev/dsp (i.e. OSS output) to PulseAudio. Likewise, for programs that output sound using ALSA, you can redirect the output to PulseAudio by setting the environment variable ALSA_PCM_NAME=pulse.
PULSE_SERVER=localhost:4000 padsp application_using_oss
PULSE_SERVER=localhost:4000 ALSA_PCM_NAME=pulse application_using_alsa
I wanted OSS support for running Festival, the speech synthesis program, but this method doesn’t work in the versions of everything I have – Festival segfauls if you use padsp on it. So instead there are more tricks. You can override how Festival plays sounds with some Festival commands. Here’s how to have it output with the PulseAudio sound player (and with the command to redirect the output to the SSH forwarded port):
(Parameter.set ‘Audio_Method ‘Audio_Command)
(Parameter.set ‘Audio_Required_Format ‘wav)
(Parameter.set ‘Audio_Command “paplay -s localhost:4000 $FILE”)
I also wanted this to work with the Praat phonetics program, which currently outputs using ALSA, but it did not recognize the ALSA environment variable setting as described above, so it may not be possible to redirect its sound output this way.
February 10th, 2009
One of the concrete benefits of open government data is that third parties can use the data to do something useful that no one in government has the mandate, resources, or insight to do. If you think what I am about to tell you below is cool, and helpful, then you are a supporter of open government data.
On my site GovTrack, you can now find comparisons of the text of H.R. 1, the stimulus bill, at different stages in its legislative life — including the House version (as passed) and the current Senate version (amendment 570).
The main page on GovTrack for HR 1 is: here
Here’s a direct link to the comparison:
Comparisons are possible between any two versions of the bill posted by GPO. Comparisons are available for any bill.
If you find this useful, please take a moment to consider that something like this is possible only when Congress takes data openness seriously. When GPO went online and THOMAS was created in the early 90s, they chose good data formats and access policies (mostly). But the work on open government data didn’t end 15 years ago. As “what’s hot” shifts to video and Twitter, the choices made today are going to impact whether or not these sources of data empower us in the future, whether or not we miss exciting opportunities such as having tools like the one above.
(Thanks to John Wonderlich and Peggy Garvin for some side discussion about this before my post. GovTrack wasn’t initially picking up the latest Senate versions because GPO seems to have gone out of its way to accommodate posting the latest versions before they were passed by the Senate, which is great, but caught GovTrack by surprise.)
February 5th, 2009
I want to make the case that open government data has value not just for geeks, but has the power to change lives in significant ways. I spend a lot of time convincing government managers and staffers that open governemnt data is a good thing, but sometimes we get caught up in the technical details. It’s easy to say that legislative data is an important component of maintaining an educated public, or that open and reusable bits are important for the media to be able to make compelling cases, but it’s all very abstract. So I asked my Open House Project friends: what open government data has changed the world?
Here’s what I got:
Weather data from the NOAA plays an important role in the agricultural sector (hat tip: Clay Shirky, David Weller) and, for that matter, has a lot to do with the weather reports we all use to plan our daily lives. (I tried to get some info on this from NOAA but they ignored my email, ah well.)
Information on publicly traded companies reported to the SEC plays a vital role in the public’s ability to trade fairly. The fact that the SEC continues to break ground on even more comprehensive data requirements for reporting signals that the public availability of these files is extraordinarily important. (Hat tip to Clay for the pointer, and to Carl Malamud for spearheading getting these files originally online in the first place.) Data from other agencies like BLS and USDA affect the trading of other commodities. (Hat tip: Philip Kromer)
The social security death index has been a tool for genealogy research (hat tip: Tom Bruce).
NASA’s photos of Earth from space are part of the bedrock of inspiration of the country. Can you imagine how different the world might be if NASA kept the photos to itself? The Library of Congress publishes digital versions of historical artifacts, like the founding documents — this too is a crical part of inspiring Americans to strive for an ideal. (Hat tip: Clay.)
Geospacial data from the USGS and the Census bureau have made mapping applications like Google Maps and in-car GPS devices like TomTom possible or at least cheaper to make. (Hat tip: Philip Kromer. Francis Irving notes that the UK is a counterexample. OK.)
Census statistics, epidemiology data, and many state-funded survey projects have played crucial roles in public health and economic research. No doubt CDC data has saved lives, though I don’t know any specifics (hat tip: many).
If you have other examples, or can help me flesh out these examples, please send something my way. To reiterate: I’m looking for open data that changed lives — please tell me what the data is and how it changed lives.
December 20th, 2008
I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here’s what I can think of off the top of my head:
About structured data
Structured data in the U.S. Congress
- The House drafts most of its legislation in XML now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.
- The House publishes its votes in XML (example). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It’s completely transparent. This started around 2004-2005, I believe.
- The Senate makes its list of membership and contact information available in XML. They have much more XML than they share. The Library of Congress’s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).
- The Senate also recently started publishing their committee hearing schedule in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).
- The Senate’s lobbying disclosure database is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.
- Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.
- Behind the scenes, the Library of Congress’s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.
Structured data made independently
- I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My GovTrack.us Source Data covers the status of legislation (example), voting records (for both chambers in a common format; example), the text of the Congressional Record, Congressional membership, committee membership (example), etc. I’ve been doing this since 2004.
- The Cornell Legal Information Institute produces an XML version of the U.S. Code, based on some structured but difficult-to-use data files made available by the House. I think they’ve been doing this since around 2004. (more info; example not easily available)
- The Sunlight Labs API provides congressional membership and data-linking information.
State-level legislative information
- In early 2007 I surveyed all of the state legislatures and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.
- Richmond Sunlight, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).
Federal non-legislative data
To quickly list some other sources of structured data at the federal level-
Other notable government structured data
- Washington DC sets a real example with its Data Catalog. It covers data produced by many aspects of its local government.
August 22nd, 2008
In May, the Congress passed the 2008 Farm Bill, which regulates various food, nutrition, and apparently biofuel issues. Tufts food policy professor Parke Wilde writes on his blog today:
The 629-page text (.pdf) of the 2008 Farm Bill is so complex and unreadable that the U.S. food policy community has been on the edge of our seats waiting for the USDA/ERS side-by-side comparison unveiled today.
The ERS side-by-side tool compares the new Farm Bill with current law, title by title, so we can finally begin to understand what the law really means.
ERS is the USDA’s Economic Research Service. Their side-by-side webpage, which I think was just published this week, shows the provisions of the previous and the current bill side-by-side. (It’s not a comparison of the bill text, but of summaries of the provisions.)
This is interesting on a number of accounts. First, the fact that it is the USDA making this comparison suggests that everyone agrees that the bill itself is effectively incomprehensible even to professionals and scholars on account of its size and summarizing it is costly enough that only the government would do it, taking three months to prepare.
Second, if this is what was needed to understand the Farm Bill, was it passed without anyone understanding it?
Third- This comparison was made by and for professionals and scholars, not by tech geeks. Why aren’t we talking to them?
The ERS tool comes complete with a seemingly unintentionally hilarious intro video — overly dramatic with background music fit for the Miss Universe competition. (Wilde likened it to “a documentary by Kenneth Burns or an account of a manned mission to the moon”.)
|