Archive for the ‘Semantic Web’ Category

Posts about my projects involving the Semantic Web, RDF, and Linked Data. Posts in this category are syndicated on Planet Linked Data.

The Semantic Web’s Role in Dealing with Disasters

Tuesday, August 12th, 2008

My Census RDF dataset is being used in a public health project:

On SemanticWeb.com: http://www.semanticweb.com/article.php/3764266

The Semantic Web’s Role in Dealing with Disasters
August 8, 2008
By Jennifer Zaino

The University of Southern California Information Sciences Institute and Childrens Hospital Los Angeles have been working together to build a software tool. Dubbed PEDSS (Pediatric Emergency Decision Support System), the tool is designed to help medical service providers more effectively plan for, train for, and respond to serious incidents and disasters affecting children.

The project, a part of the Pediatric Disaster Resource and Training Center (PDRTC), has been going on for about eight months.

Dr. Tatyana Ryutov, a research scientist at the USC Information Sciences Institute, is working on the system. Recently, the Institute contacted Joshua Tauberer, the creator of Govtrak.us and the man who maintains a large RDF (Resource Description Framework) data set of U.S. Census data, about making SPAQRL queries to that data in conjunction with the PEDSS.

“Currently, demographic data (number of children in four age groups) is entered manually. We want the tool to calculate this information automatically based on a zip-code. Therefore, we extend the tool to query the RDF census data server to get this information,” Ryutov writes. Currently this is the only server the software queries, but Ryutov says they plan to add calls to other census data servers to improve reliability. Those servers do not have to be RDF databases.

(and it continues)

Berlin SPARQL Benchmarks for my SemWeb .NET Library

Sunday, August 10th, 2008

Chris Bizer and team have posted a benchmark specification for SPARQL endpoints, the Berlin SPARQL Benchmark (BSBM). They have “run the initial version of the benchmark against Sesame, Virtuoso, Jena SDB  and against D2R Server, a relational database-to-RDF wrapper. The stores were benchmarked with datasets ranging from 50,000 triples to 100,000,000 triples” (announcement email).

I ran the benchmark against my SemWeb .NET library. Instructions for setting up the benchmark are here and turned out to be a good example for how very quickly to set up a SPARQL endpoint using my library, backed with your SQL database of choice (in this case MySQL). I had some trouble the first time I ran the benchmark though:

  • The first time I ran the tests I found the library had several bugs/limitations: a bug preveting ORDER BY with dataTime values, an error parsing function calls in FILTER expressions, and a glitch in the translation of the query to SQL. I corrected these problems.
  • Query 10 must be modified to change the ordering to ORDER BY xsd:double(str(?price)), which adds the cast xsd:double(str(…)), since ordering by the custom USD datatype is not supported and not required to be supported by the SPARQL specification.
  • In the same query, in FILTER (?date > “2008-06-20″^^<http://www.w3.org/2001/XMLSchema#date> ), xsd:date comparisons are not a part of the SPARQL spec (as I understand it; dateTime comparisons on the other hand are required by the spec). Such comparisons weren’t implemented in my library, but I went ahead and added it.

Also I have some concerns. First, I am not 100% sure if the results of my library are actually correct. Query 4 seemed to always return no results. Second, queries are largely translated into SQL, and there is a good deal of caching going on at the level of MySQL. The benchmark results then are saying a lot about the best-case run time, and indicate something about the overhead of SPARQL processing, but may not indicate general use performance.

Benchmark results reported below are for my desktop: Intel Core2 Duo at 3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly comparable to the machine used in the BSBM.

Load time (in seconds and triples/sec) is reported below for some of the different data set sizes.

50K 250K 1M 5M 25M
Time (sec) 224 16129
triples/sec 4441 1544

For comparison, load time for the 1M data set was 224 seconds. This is about double-to-2.5 times (worse) the time of Jena SDB (Hash) with MySQL over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 (87s), as reported in the BSBM results. For the larger 25M dataset, the load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 times faster than Sesame over Tomcat and 3 times faster than Virtuoso. (But, again, the machines were different.)

Results for query execution are reported below. AQET (Average Query Execution Time, in seconds) is reported below for each of the queries for different data set sizes. The results were roughly comparable again to Jena and Virtuoso. But, again, the three caveats above are worth restating: the query results are not validated to be known to be correct, there is significant caching, and the machine was different than the machine used in BSBM.

50K 250K 1M 5M 25M
Query 1 0.019184 0.049200
Query 2 0.051187 0.048590
Query 3 0.030508 0.079187
Query 4 0.032693 0.075603
Query 5 0.172283 0.342828
Query 6 0.102105 3.277656
Query 7 0.256491 1.108414
Query 8 0.175357 0.572258
Query 9 0.059674 0.088451
Query 10 0.089215 0.322246

SemWeb RDF Library for C#

Sunday, June 12th, 2005

Semantic Web/RDF Library for C#

I’ve just posted the first release of my SemWeb library, written in C# for Mono and .NET, at http://taubz.for.net/code/semweb.
Features:
* Simple API; easy to deploy; no platform-specific dependencies.
* Reading and writing RDF/XML, Turtle, NTriples, and most of Notation 3, at around 20,000 statements/second.
* All operations are streaming, so it should scale.
* Two built-in types of RDF stores: an in-memory hashtable-indexed store for small amounts of data and an SQL store with MySQL and SQLite backends for large amounts of data.
* Creating new SQL-based stores takes minutes, and implementing other types of stores is as simple as extending an abstract class.
* Statements are quads, rather than triples, with the fourth ‘meta’ field left for application-specific uses.
I’ve been using SemWeb to push around the 7 million triples created by
GovTrack (shameless plug).

Directory of C# Libraries

I had hoped to time the release of SemWeb with the debut of a new website I’m working on that will be a really great directory of Mono/.NET reusable, open source libraries. But, I need to get some other people’s libraries listed, besides my own, before finishing the site. If you’ve written a C# library that you think others would find useful, please let me know.

A Programming Project

Friday, March 18th, 2005

A Programming Task for Someone Looking to Hack

The biggest thing that has helped me to program better is little programming projects. My first was a simple math tutoring program in GW-BASIC, written with the help of my dad back around third grade. I’ve almost always had a little project to keep me busy since then.Today, it’s creating an RDF library in C#.

I know that often people are looking for ideas for programs to write, so I thought I’d post a routine that someone might want to spend some time hacking. This is a mildly advanced routine, but anyway:

The goal is to parse an RDF/XML document using only XmlReader. That is, extract the RDF statements without loading the entire document into memory as an XmlDocument. As far as I know, this has never been programmed in C#, and it is really critical if semantic web applications are going to be built in .NET.

Getting the basics going isn’t too difficult a task. Getting the entire spec implemented is more of a challenge. But what’s life without challenges, eh? If you’re interested in taking a stab at this, drop me an email (tauberer@for.net).

A Design Suggestion

When I was riding the train back from D.C. to Philly last week, the speaker in the car I was in wasn’t working, so no one could hear the conductor’s announcements. Probably no Amtrak person noticed the problem.

It made me think that we often build things that don’t notice when they’re not working. Speakers should be built with microphones that realize when the speaker isn’t emitting the sound it should be, and when that happens it sends back a signal to… somewhere. Software should do the same thing. Applications should realize when things aren’t working right and, more importantly, send back a useful message that a problem occured.

Here’s a for instance. I plugged in a printer to my Linux desktop this week, but I couldn’t print a test page. The only message I got back was that I should increase the debugging level and inspect the output. Well, this is not a useful signal. Even with debugging on, the message I got was that the driver couldn’t be loaded. Pretty vague. It turned out the driver wasn’t even present on my system because I didn’t have the RPM installed. This is a condition that the printing system should have been able to detect and inform me of.

The failure here is there was no mechanism built into the system for passing back useful error messages to the user. If there was a useful message at some point, it was discarded before it reached me. Don’t write software like this.

Diffing and RDF

Saturday, March 5th, 2005

If you’re reading this, you’re probably reading this on Monologue, and that means I’ve successfully added myself to Monologue. :-)

Recently I got a helpful bug report for my Diff library for C# which pointed out that my port of Perl’s Algorithm::Diff wasn’t generating the same diffs as the original module. I fixed the bug and reposted a new version of the library.

In unrelated news, I’m working on building the semantic web for information about the U.S. government. This is a spin-off of my work on GovTrack (which is powered by Mono). To get this web built, I’m in the position of having to convince people that RDF is the right way to approach the problem of distributed information — over, for instance, XML, XML Schema, and XQuery. The problem is that RDF is complicated and often misunderstood, and I hadn’t found a good document explaining what RDF is and why it should be used for this. So, I wrote one. I’m not a master of RDF by any means, so any corrections and suggestions are welcome.

By the way, if you’re interested in building this political semantic web, join the GovTrack mail list.

Lastly, with my new interest in RDF, I was looking for a good C# library for working with RDF data models. I didn’t find one that I particularly liked (there are a few ones out there, but for various reasons I just couldn’t see myself using them), so I’m working on my own. I’ll post the source in a few weeks, probably.