Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):
Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.
Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.
Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.
Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)
Then there are some slides on URIs and RDF.
My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.
Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.
Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.
SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.
Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.
My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.
Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.
Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.