February 23rd, 2009
Continuing from my last post on this subject, I found some more examples of influential data sets from a page on FlowingData.com. I’m expanding beyond government data in this post.
“Baseball Statistics: In 2003, Michael M. Lewis’ book, Moneyball: The Art of Winning an Unfair Game, was released. As a result, the way baseball teams were built changed completely. Before Moneyball, teams relied on insider information and the choice of players was highly subjective. However, in 2002, a year before the book was published, the Oakland A’s had $41 million in salary and had to figure out how to compete against teams like the New York Yankees and the Boston Red Sox who spent over $100 million in salaries.”
“Megan’s Law: Since 1994, those who have been convicted of sex crimes against children have been required to register with local law enforcement. That data is made public so that people know about sex offenders in their area. Mash that data with Google Maps. Lo and behold, parents became instantly aware of caution areas and some might never look at their neighbor the same way ever again, while sex offenders start declaring themselves homeless.”
February 11th, 2009
Here is a command-line-only way to forward audio over a network in Ubuntu 8.10+. Ubuntu uses PulseAudio as a sound server. It can be configured (esp. using the padevchooser package) to do this “properly”, but I wanted something fast that didn’t require configuration (and, uhm, I didn’t know about padevchooser).
The first step is to establish a secure connection between the machines using SSH to forward a local TCP port (the one with the sound application) to a TCP port on the remote machine (the one with the speakers):
(on the computer with the sound application)
ssh -L4000:localhost:4000 remotehost
PulseAudio is a server that listens for connections on a Unix domain socket by default, which means 1) it can’t be accessed remotely, and 2) ssh can’t redirect a TCP port to it directly. One could configure PulseAudio to use a TCP port instead, but then you have to worry about security. So instead, use the socat tool (you might need to install the socat package) on the remote machine to forward the remote machine’s local port 4000 to its PulseAudio Unix socket:
(on the computer with the speakers, e.g. inside the SSH session)
socat TCP-LISTEN:4000,fork UNIX-CONNECT:/tmp/pulse-$USER/native
socat will accept only local connections by default, which is why we need to SSH to the remote machine to connect — that’s a good thing if you like security.
The “,fork” option has socat listen for multiple connections. Otherwise it’ll quit after the first connection (it’ll play one sound file and exit). You can run socat once you’ve logged in with SSH. You might think you can do it all on one line (because ssh takes a second command line argument for a command to run), but it doesn’t work well with this “,fork” option because when you CTRL+C SSH to end the session, socat keeps running (which might be fine, I guess, but it will prevent you from running it a second time since something will already be listening on port 4000).
Now we can have PulseAudio-enabled programs play sound remotely by specifying to put the sound on the local port 4000, rather than to where it normally puts it for local sound. The paplay program plays a wav file:
paplay soundfile.wav (plays it locally)
paplay -s localhost:4000 soundfile.wav (plays it remotely)
Or equivalently by setting an environment variable:
PULSE_SERVER=localhost:4000 paplay soundfile.wav
The environment variable method should work for any other program that plays sound with PulseAudio.
The tricks begin when programs don’t support PulseAudio and instead use OSS or Alsa to play sounds. In principle, the padsp command is able to redirect the use of /dev/dsp (i.e. OSS output) to PulseAudio. Likewise, for programs that output sound using ALSA, you can redirect the output to PulseAudio by setting the environment variable ALSA_PCM_NAME=pulse.
PULSE_SERVER=localhost:4000 padsp application_using_oss
PULSE_SERVER=localhost:4000 ALSA_PCM_NAME=pulse application_using_alsa
I wanted OSS support for running Festival, the speech synthesis program, but this method doesn’t work in the versions of everything I have – Festival segfauls if you use padsp on it. So instead there are more tricks. You can override how Festival plays sounds with some Festival commands. Here’s how to have it output with the PulseAudio sound player (and with the command to redirect the output to the SSH forwarded port):
(Parameter.set ‘Audio_Method ‘Audio_Command)
(Parameter.set ‘Audio_Required_Format ‘wav)
(Parameter.set ‘Audio_Command “paplay -s localhost:4000 $FILE”)
I also wanted this to work with the Praat phonetics program, which currently outputs using ALSA, but it did not recognize the ALSA environment variable setting as described above, so it may not be possible to redirect its sound output this way.
February 10th, 2009
One of the concrete benefits of open government data is that third parties can use the data to do something useful that no one in government has the mandate, resources, or insight to do. If you think what I am about to tell you below is cool, and helpful, then you are a supporter of open government data.
On my site GovTrack, you can now find comparisons of the text of H.R. 1, the stimulus bill, at different stages in its legislative life — including the House version (as passed) and the current Senate version (amendment 570).
The main page on GovTrack for HR 1 is: here
Here’s a direct link to the comparison:
Comparisons are possible between any two versions of the bill posted by GPO. Comparisons are available for any bill.
If you find this useful, please take a moment to consider that something like this is possible only when Congress takes data openness seriously. When GPO went online and THOMAS was created in the early 90s, they chose good data formats and access policies (mostly). But the work on open government data didn’t end 15 years ago. As “what’s hot” shifts to video and Twitter, the choices made today are going to impact whether or not these sources of data empower us in the future, whether or not we miss exciting opportunities such as having tools like the one above.
(Thanks to John Wonderlich and Peggy Garvin for some side discussion about this before my post. GovTrack wasn’t initially picking up the latest Senate versions because GPO seems to have gone out of its way to accommodate posting the latest versions before they were passed by the Senate, which is great, but caught GovTrack by surprise.)
February 5th, 2009
I want to make the case that open government data has value not just for geeks, but has the power to change lives in significant ways. I spend a lot of time convincing government managers and staffers that open governemnt data is a good thing, but sometimes we get caught up in the technical details. It’s easy to say that legislative data is an important component of maintaining an educated public, or that open and reusable bits are important for the media to be able to make compelling cases, but it’s all very abstract. So I asked my Open House Project friends: what open government data has changed the world?
Here’s what I got:
Weather data from the NOAA plays an important role in the agricultural sector (hat tip: Clay Shirky, David Weller) and, for that matter, has a lot to do with the weather reports we all use to plan our daily lives. (I tried to get some info on this from NOAA but they ignored my email, ah well.)
Information on publicly traded companies reported to the SEC plays a vital role in the public’s ability to trade fairly. The fact that the SEC continues to break ground on even more comprehensive data requirements for reporting signals that the public availability of these files is extraordinarily important. (Hat tip to Clay for the pointer, and to Carl Malamud for spearheading getting these files originally online in the first place.) Data from other agencies like BLS and USDA affect the trading of other commodities. (Hat tip: Philip Kromer)
The social security death index has been a tool for genealogy research (hat tip: Tom Bruce).
NASA’s photos of Earth from space are part of the bedrock of inspiration of the country. Can you imagine how different the world might be if NASA kept the photos to itself? The Library of Congress publishes digital versions of historical artifacts, like the founding documents — this too is a crical part of inspiring Americans to strive for an ideal. (Hat tip: Clay.)
Geospacial data from the USGS and the Census bureau have made mapping applications like Google Maps and in-car GPS devices like TomTom possible or at least cheaper to make. (Hat tip: Philip Kromer. Francis Irving notes that the UK is a counterexample. OK.)
Census statistics, epidemiology data, and many state-funded survey projects have played crucial roles in public health and economic research. No doubt CDC data has saved lives, though I don’t know any specifics (hat tip: many).
If you have other examples, or can help me flesh out these examples, please send something my way. To reiterate: I’m looking for open data that changed lives — please tell me what the data is and how it changed lives.
December 20th, 2008
I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here’s what I can think of off the top of my head:
About structured data
Structured data in the U.S. Congress
- The House drafts most of its legislation in XML now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.
- The House publishes its votes in XML (example). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It’s completely transparent. This started around 2004-2005, I believe.
- The Senate makes its list of membership and contact information available in XML. They have much more XML than they share. The Library of Congress’s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).
- The Senate also recently started publishing their committee hearing schedule in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).
- The Senate’s lobbying disclosure database is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.
- Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.
- Behind the scenes, the Library of Congress’s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.
Structured data made independently
- I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My GovTrack.us Source Data covers the status of legislation (example), voting records (for both chambers in a common format; example), the text of the Congressional Record, Congressional membership, committee membership (example), etc. I’ve been doing this since 2004.
- The Cornell Legal Information Institute produces an XML version of the U.S. Code, based on some structured but difficult-to-use data files made available by the House. I think they’ve been doing this since around 2004. (more info; example not easily available)
- The Sunlight Labs API provides congressional membership and data-linking information.
State-level legislative information
- In early 2007 I surveyed all of the state legislatures and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.
- Richmond Sunlight, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).
Federal non-legislative data
To quickly list some other sources of structured data at the federal level-
Other notable government structured data
- Washington DC sets a real example with its Data Catalog. It covers data produced by many aspects of its local government.
August 22nd, 2008
In May, the Congress passed the 2008 Farm Bill, which regulates various food, nutrition, and apparently biofuel issues. Tufts food policy professor Parke Wilde writes on his blog today:
The 629-page text (.pdf) of the 2008 Farm Bill is so complex and unreadable that the U.S. food policy community has been on the edge of our seats waiting for the USDA/ERS side-by-side comparison unveiled today.
The ERS side-by-side tool compares the new Farm Bill with current law, title by title, so we can finally begin to understand what the law really means.
ERS is the USDA’s Economic Research Service. Their side-by-side webpage, which I think was just published this week, shows the provisions of the previous and the current bill side-by-side. (It’s not a comparison of the bill text, but of summaries of the provisions.)
This is interesting on a number of accounts. First, the fact that it is the USDA making this comparison suggests that everyone agrees that the bill itself is effectively incomprehensible even to professionals and scholars on account of its size and summarizing it is costly enough that only the government would do it, taking three months to prepare.
Second, if this is what was needed to understand the Farm Bill, was it passed without anyone understanding it?
Third- This comparison was made by and for professionals and scholars, not by tech geeks. Why aren’t we talking to them?
The ERS tool comes complete with a seemingly unintentionally hilarious intro video — overly dramatic with background music fit for the Miss Universe competition. (Wilde likened it to “a documentary by Kenneth Burns or an account of a manned mission to the moon”.)
August 12th, 2008
My Census RDF dataset is being used in a public health project:
On SemanticWeb.com: http://www.semanticweb.com/article.php/3764266
The Semantic Web’s Role in Dealing with Disasters
August 8, 2008
By Jennifer Zaino
The University of Southern California Information Sciences Institute and Childrens Hospital Los Angeles have been working together to build a software tool. Dubbed PEDSS (Pediatric Emergency Decision Support System), the tool is designed to help medical service providers more effectively plan for, train for, and respond to serious incidents and disasters affecting children.
The project, a part of the Pediatric Disaster Resource and Training Center (PDRTC), has been going on for about eight months.
Dr. Tatyana Ryutov, a research scientist at the USC Information Sciences Institute, is working on the system. Recently, the Institute contacted Joshua Tauberer, the creator of Govtrak.us and the man who maintains a large RDF (Resource Description Framework) data set of U.S. Census data, about making SPAQRL queries to that data in conjunction with the PEDSS.
…
“Currently, demographic data (number of children in four age groups) is entered manually. We want the tool to calculate this information automatically based on a zip-code. Therefore, we extend the tool to query the RDF census data server to get this information,” Ryutov writes. Currently this is the only server the software queries, but Ryutov says they plan to add calls to other census data servers to improve reliability. Those servers do not have to be RDF databases.
(and it continues)
August 10th, 2008
Chris Bizer and team have posted a benchmark specification for SPARQL endpoints, the Berlin SPARQL Benchmark (BSBM). They have “run the initial version of the benchmark against Sesame, Virtuoso, Jena SDB and against D2R Server, a relational database-to-RDF wrapper. The stores were benchmarked with datasets ranging from 50,000 triples to 100,000,000 triples” (announcement email).
I ran the benchmark against my SemWeb .NET library. Instructions for setting up the benchmark are here and turned out to be a good example for how very quickly to set up a SPARQL endpoint using my library, backed with your SQL database of choice (in this case MySQL). I had some trouble the first time I ran the benchmark though:
- The first time I ran the tests I found the library had several bugs/limitations: a bug preveting ORDER BY with dataTime values, an error parsing function calls in FILTER expressions, and a glitch in the translation of the query to SQL. I corrected these problems.
- Query 10 must be modified to change the ordering to ORDER BY xsd:double(str(?price)), which adds the cast xsd:double(str(…)), since ordering by the custom USD datatype is not supported and not required to be supported by the SPARQL specification.
- In the same query, in FILTER (?date > “2008-06-20″^^<http://www.w3.org/2001/XMLSchema#date> ), xsd:date comparisons are not a part of the SPARQL spec (as I understand it; dateTime comparisons on the other hand are required by the spec). Such comparisons weren’t implemented in my library, but I went ahead and added it.
Also I have some concerns. First, I am not 100% sure if the results of my library are actually correct. Query 4 seemed to always return no results. Second, queries are largely translated into SQL, and there is a good deal of caching going on at the level of MySQL. The benchmark results then are saying a lot about the best-case run time, and indicate something about the overhead of SPARQL processing, but may not indicate general use performance.
Benchmark results reported below are for my desktop: Intel Core2 Duo at 3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly comparable to the machine used in the BSBM.
Load time (in seconds and triples/sec) is reported below for some of the different data set sizes.
|
50K |
250K |
1M |
5M |
25M |
| Time (sec) |
|
|
224 |
|
16129 |
| triples/sec |
|
|
4441 |
|
1544 |
For comparison, load time for the 1M data set was 224 seconds. This is about double-to-2.5 times (worse) the time of Jena SDB (Hash) with MySQL over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 (87s), as reported in the BSBM results. For the larger 25M dataset, the load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 times faster than Sesame over Tomcat and 3 times faster than Virtuoso. (But, again, the machines were different.)
Results for query execution are reported below. AQET (Average Query Execution Time, in seconds) is reported below for each of the queries for different data set sizes. The results were roughly comparable again to Jena and Virtuoso. But, again, the three caveats above are worth restating: the query results are not validated to be known to be correct, there is significant caching, and the machine was different than the machine used in BSBM.
August 6th, 2008
Actually a lie. I’ve just been blogging elsewhere. I’ve fetched all of my old blog archives and my Open House Project posts from over there and re-made this blog.
August 6th, 2008
Join me at an IRC chat to talk about open source civic technology projects, on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix between seeing what various civic technology projects are up to like GovTrack (my site, powered by Mono), OpenCongress, and any others run by people who show up, and getting new people involved in ongoing projects. “oGosh” is Open Government Open Source Hacking (wiki | Facebook), what I’m calling the loose community that binds these projects together.
The chat will be in the #transparency channel on Freenode. For more information on the meeting (and on how to get to the chat), see http://wiki.opengovdata.org/index.php/OGosh.
Suggestions for agenda topics are most welcome either to me directly or by revising the wiki page above. Hope to see you there.
|