SemWeb Documentation

SemWeb: Berlin SPAQRL Benchmark

Note: This page was written in 2008.

This will run the Berlin SPARQL Benchmark (BSBM) against the SPARQL server provided by the SemWeb .NET library. The data will be loaded into a local MySQL database. The SPARQL server will run as a local ASP.NET server on port 8080.

Prerequisites:

Notes

Instructions:

Create a test dataset using the BSBM dataset generator. Below products is the data set scale factor. Use 1915 to get a 1 million triples data set.

$ java -cp bin:lib/ssj.jar benchmark.generator.Generator -pc products \
	-s ttl -dir datadir -fn datadir/dataset

Download the current version of the SemWeb .NET library from the website http://razor.occams.info/code/semweb. Save and extract the library package into a directory within your BSBM directory. We'll call the directory semweb.

For compatibility with other benchmarks, the following MySQL configuration lines were set:

bulk_insert_buffer_size = 32M
query_cache_size = 512M

Start MySQL. Have a database available to load the data into. We'll assume the 'test' database is available and doesn't require any user permissions to access.

Load the dataset into the database. You may have to change the path to SemWeb and to the dataset.nt N-Triples file, depending on your setup. Additionally, you can change "Database=test" below to any connection string described in the MySQL Connector documentation.

$ export SEMWEB_MYSQL_IMPORT_MODE=DISABLEKEYS
$ time mono semweb/bin_generics/rdfstorage.exe -out "mysql:bsbm:Database=test" -clear datadir/dataset.ttl

There are three data import modes for MySQL. The choice is activated by the environment variable SEMWEB_MYSQL_IMPORT_MODE and can speed up the import a bit. The default mode wraps the import in a transaction, and is the slowest method. The next method locks the tables for reading and writing while importing the data ("LOCK"). The third method ("DISABLEKEYS") delays generating indexes on the table until after all of the data is loaded. The table should not be accessed by another program while importing data when DISABLEKEYS is used.

On the 2785-product 1M-triples data set, I get the following output:

Total Time: 1m48s, 991958 statements, 9145 st/sec

Create a file web.config with the contents below. This will configure the ASP.NET server that will run the SPARQL endpoint. Replace "Database=test" with the same connection string above, if you changed it.

<configuration>
     <configSections>
          <section name="sparqlSources" type="System.Configuration.NameValueSectionHandler"/>
     </configSections>

     <system.web>
          <httpHandlers>
               <!-- This line associates the SPARQL Protocol implementation with a path on your
                    website. With this, you get a SPARQL server at http://yourdomain.com/sparql.  -->
               <add verb="*" path="sparql" type="SemWeb.Query.SparqlProtocolServerHandler, SemWeb.Sparql" />
          </httpHandlers>
     </system.web>

     <sparqlSources>
          <!-- This line configures the data source associated with each SPARQL server added above.
                  This sets the server to query the RDF/XML file at the given path.  You can use any
                  spec string described in SemWeb.Store.CreateForInput(). -->
          <add key="/sparql" value="mysql:bsbm:Database=test"/>
     </sparqlSources>
</configuration>

Start the SPARQL endpoint. xsp2 is the standalone ASP.NET server from Mono. The MONO_PATH environment variable tells Mono where to look for the binaries that implement the SPARQL Protocol.

$ MONO_PATH=semweb/bin_generics xsp2

See the documentation for SemWeb.Query.SparqlProtocolServerHandler for instructions on setting up an endpoint using Apache with mod_mono or mod_aspdotnet.

Verify the SPARQL endpoint is working by visiting in your browser:

http://localhost:8080/sparql?query=DESCRIBE+<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductType1>

This will also force the SPARQL server to load up, which takes a brief moment.

In another terminal (since the ASP.NET server is still running), start the tests against the SPARQL endpoint at http://localhost:8080/sparql:

java -cp bin:lib/* benchmark.testdriver.TestDriver http://localhost:8080/sparql -idir datapath -o datapath/benchmark_result.xml

Results

Benchmark results reported below are for my desktop: Intel Core2 Duo at 3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly comparable to the machine used in the BSBM.

Load time (in seconds and triples/sec) is reported below for each of the different data set sizes.

50K 250K 1M 5M 25M
Time 224 16129
triples/sec 4441 1544

The 25M dataset was loaded with the DISABLEKEYS MySQL import mode, rather than LOCK.

For comparison, load time for the 1M data set was 224 seconds. This is about double-to-2.5 times the time of Jena SDB (Hash) with MySQL over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 (87s), as reported in the BSBM results. For the larger 25M dataset, the load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 times faster than Sesame over Tomcat and 3 times faster than Virtuoso.

Results for query execution are reported below. Note that it is entirely unknown whether the query results are correct. Query 4 on the 25M dataset (I am not sure about the other dataset sizes) always yielded no results, for instance.

AQET (Average Query Execution Time, in seconds) is reported below for each of the queries for different data set sizes.


50K 250K 1M 5M 25M
Query 1 0.019184 0.049200
Query 2 0.051187 0.048590
Query 3 0.030508 0.079187
Query 4 0.032693 0.075603
Query 5 0.172283 0.342828
Query 6 0.102105 3.277656
Query 7 0.256491 1.108414
Query 8 0.175357 0.572258
Query 9 0.059674 0.088451
Query 10 0.089215 0.322246