<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Joshua Tauberer's Blog &#187; GovTrack</title>
	<atom:link href="http://razor.occams.info/blog/category/govtrack/feed/" rel="self" type="application/rss+xml" />
	<link>http://razor.occams.info/blog</link>
	<description></description>
	<pubDate>Sat, 03 Jan 2009 15:16:05 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7-bleeding</generator>
	<language>en</language>
			<item>
		<title>Examples of XML for government legislative data</title>
		<link>http://razor.occams.info/blog/2008/12/20/examples-of-xml-for-government-legislative-data/</link>
		<comments>http://razor.occams.info/blog/2008/12/20/examples-of-xml-for-government-legislative-data/#comments</comments>
		<pubDate>Sat, 20 Dec 2008 12:24:45 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://razor.occams.info/blog/?p=346</guid>
		<description><![CDATA[<p>I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here&#8217;s what I can think of off the top of my head:</p>
<p><em>About structured data</em></p>
<ul>
<li>I wrote in late 2006 <a href="http://www.govtrack.us/articles/20061209data.xpd">Prose is Poetry to a Computer: What is structured data?</a> It could use some updating.</li>
</ul>
<p><em>Structured data in the U.S. Congress</em></p>
<ul>
<li>The House <a href="http://xml.house.gov">drafts most of its legislation in XML</a> now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.</li>
<li>The House publishes its votes in XML (<a href="http://clerk.house.gov/evs/2008/roll690.xml">example</a>). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It&#8217;s completely transparent. This started around 2004-2005, I believe.</li>
<li>The Senate makes its list of <a href="http://www.senate.gov/general/contact_information/senators_cfm.xml">membership and contact information</a> available in XML. They have much more XML than they share. The Library of Congress&#8217;s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).</li>
<li>The Senate also recently started publishing their <a href="http://www.senate.gov/pagelayout/committees/b_three_sections_with_teasers/committee_hearings.htm">committee hearing schedule</a> in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).</li>
<li>The Senate&#8217;s <a href="http://www.senate.gov/legislative/Public_Disclosure/LDA_reports.htm">lobbying disclosure database</a> is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.</li>
<li>Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.</li>
<li>Behind the scenes, the Library of Congress&#8217;s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.</li>
</ul>
<p><em>Structured data made independently<br />
</em></p>
<ul>
<li>I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My <a href="http://www.govtrack.us/source.xpd">GovTrack.us Source Data</a> covers the <em>status</em> of legislation (<a href="http://www.govtrack.us/data/us/110/bills/h1.xml">example</a>), voting records (for both chambers in a common format; <a href="http://www.govtrack.us/data/us/110/rolls/h2007-493.xml">example</a>), the text of the Congressional Record, Congressional membership, committee membership (<a href="http://www.govtrack.us/data/us/110/committees.xml">example</a>), etc. I&#8217;ve been doing this since 2004.</li>
<li>The <a href="http://www.law.cornell.edu/uscode/">Cornell Legal Information Institute</a> produces an XML version of the U.S. Code, based on some structured but difficult-to-use <a href="http://uscode.house.gov/">data files made available by the House</a>. I think they&#8217;ve been doing this since around 2004. (<a href="http://xml.coverpages.org/ni2004-10-06-a.html">more info</a>; example not easily available)</li>
<li>The <a href="http://services.sunlightlabs.com/api/">Sunlight Labs API</a> provides congressional membership and data-linking information.</li>
</ul>
<p><em>State-level legislative information</em></p>
<ul>
<li>In early 2007 I <a href="http://www.theopenhouseproject.com/2007/02/22/states-are-leading-the-way-with-downloadable-legislative-databases/">surveyed all of the state legislatures</a> and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.</li>
<li><a href="http://www.richmondsunlight.com/labs/">Richmond Sunlight</a>, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).</li>
</ul>
<p><em>Federal non-legislative data</em></p>
<p>To quickly list some other sources of structured data at the federal level-</p>
<ul>
<li>FEC&#8217;s electronic filings for <a href="http://www.fec.gov/finance/disclosure/ftp_download.shtml">campaign contributions and related data</a> (XML and flat fixed-width)</li>
<li>SEC&#8217;s <a href="http://www.sec.gov/edgar/searchedgar/ftpusers.htm">EDGAR</a> system for corporate public filings (XML/SGML), and their <a href="http://ftp.sec.gov/news/press/2008/2008-300.htm">recent decision</a> to require documents to be submitted to them in XRBL, a dialect of XML.</li>
<li>From the Census Bureau, essentially the whole census (flat fixed-width) and geographic data (various formats)</li>
<li>USDA&#8217;s <a href="http://www.nal.usda.gov/fnic/foodcomp/search/">nutrition database</a> (extremely comprehensive and crucially helpful for public health; XML if I recall right; it&#8217;s downloadable in bulk somewhere)</li>
<li><a href="http://earthquakes.usgs.gov/eqcenter/catalogs/">USGS&#8217;s Earthquake Hazards Program</a> (Atom, KML, XML, CSV)</li>
<li><a href="http://www.bls.gov/data/home.htm">The Bureau of Labor Statistics</a>&#8217;s datasets</li>
<li><a href="http://nces.ed.gov/ccd/pubschuniv.asp">The National Center for Education Statistics</a> provides survey data in both SAS and non-proprietary flat file formats. (description copied from elsewhere)</li>
<li><a href="http://www.epa.gov/enviro/">EPA</a>&#8217;s Envirofacts Data Warehouse</li>
</ul>
<p>Other notable government structured data</p>
<ul>
<li>Washington DC sets a real example with its <a href="http://data.octo.dc.gov/">Data Catalog</a>. It covers data produced by many aspects of its local government.</li>
</ul>
]]></description>
			<content:encoded><![CDATA[<p>I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here&#8217;s what I can think of off the top of my head:</p>
<p><em>About structured data</em></p>
<ul>
<li>I wrote in late 2006 <a href="http://www.govtrack.us/articles/20061209data.xpd">Prose is Poetry to a Computer: What is structured data?</a> It could use some updating.</li>
</ul>
<p><em>Structured data in the U.S. Congress</em></p>
<ul>
<li>The House <a href="http://xml.house.gov">drafts most of its legislation in XML</a> now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.</li>
<li>The House publishes its votes in XML (<a href="http://clerk.house.gov/evs/2008/roll690.xml">example</a>). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It&#8217;s completely transparent. This started around 2004-2005, I believe.</li>
<li>The Senate makes its list of <a href="http://www.senate.gov/general/contact_information/senators_cfm.xml">membership and contact information</a> available in XML. They have much more XML than they share. The Library of Congress&#8217;s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).</li>
<li>The Senate also recently started publishing their <a href="http://www.senate.gov/pagelayout/committees/b_three_sections_with_teasers/committee_hearings.htm">committee hearing schedule</a> in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).</li>
<li>The Senate&#8217;s <a href="http://www.senate.gov/legislative/Public_Disclosure/LDA_reports.htm">lobbying disclosure database</a> is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.</li>
<li>Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.</li>
<li>Behind the scenes, the Library of Congress&#8217;s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.</li>
</ul>
<p><em>Structured data made independently<br />
</em></p>
<ul>
<li>I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My <a href="http://www.govtrack.us/source.xpd">GovTrack.us Source Data</a> covers the <em>status</em> of legislation (<a href="http://www.govtrack.us/data/us/110/bills/h1.xml">example</a>), voting records (for both chambers in a common format; <a href="http://www.govtrack.us/data/us/110/rolls/h2007-493.xml">example</a>), the text of the Congressional Record, Congressional membership, committee membership (<a href="http://www.govtrack.us/data/us/110/committees.xml">example</a>), etc. I&#8217;ve been doing this since 2004.</li>
<li>The <a href="http://www.law.cornell.edu/uscode/">Cornell Legal Information Institute</a> produces an XML version of the U.S. Code, based on some structured but difficult-to-use <a href="http://uscode.house.gov/">data files made available by the House</a>. I think they&#8217;ve been doing this since around 2004. (<a href="http://xml.coverpages.org/ni2004-10-06-a.html">more info</a>; example not easily available)</li>
<li>The <a href="http://services.sunlightlabs.com/api/">Sunlight Labs API</a> provides congressional membership and data-linking information.</li>
</ul>
<p><em>State-level legislative information</em></p>
<ul>
<li>In early 2007 I <a href="http://www.theopenhouseproject.com/2007/02/22/states-are-leading-the-way-with-downloadable-legislative-databases/">surveyed all of the state legislatures</a> and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.</li>
<li><a href="http://www.richmondsunlight.com/labs/">Richmond Sunlight</a>, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).</li>
</ul>
<p><em>Federal non-legislative data</em></p>
<p>To quickly list some other sources of structured data at the federal level-</p>
<ul>
<li>FEC&#8217;s electronic filings for <a href="http://www.fec.gov/finance/disclosure/ftp_download.shtml">campaign contributions and related data</a> (XML and flat fixed-width)</li>
<li>SEC&#8217;s <a href="http://www.sec.gov/edgar/searchedgar/ftpusers.htm">EDGAR</a> system for corporate public filings (XML/SGML), and their <a href="http://ftp.sec.gov/news/press/2008/2008-300.htm">recent decision</a> to require documents to be submitted to them in XRBL, a dialect of XML.</li>
<li>From the Census Bureau, essentially the whole census (flat fixed-width) and geographic data (various formats)</li>
<li>USDA&#8217;s <a href="http://www.nal.usda.gov/fnic/foodcomp/search/">nutrition database</a> (extremely comprehensive and crucially helpful for public health; XML if I recall right; it&#8217;s downloadable in bulk somewhere)</li>
<li><a href="http://earthquakes.usgs.gov/eqcenter/catalogs/">USGS&#8217;s Earthquake Hazards Program</a> (Atom, KML, XML, CSV)</li>
<li><a href="http://www.bls.gov/data/home.htm">The Bureau of Labor Statistics</a>&#8217;s datasets</li>
<li><a href="http://nces.ed.gov/ccd/pubschuniv.asp">The National Center for Education Statistics</a> provides survey data in both SAS and non-proprietary flat file formats. (description copied from elsewhere)</li>
<li><a href="http://www.epa.gov/enviro/">EPA</a>&#8217;s Envirofacts Data Warehouse</li>
</ul>
<p>Other notable government structured data</p>
<ul>
<li>Washington DC sets a real example with its <a href="http://data.octo.dc.gov/">Data Catalog</a>. It covers data produced by many aspects of its local government.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2008/12/20/examples-of-xml-for-government-legislative-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>oGosh! IRC Meeting Aug 16 4pm EDT</title>
		<link>http://razor.occams.info/blog/2008/08/06/ogosh-irc-meeting-aug-16-4pm-edt/</link>
		<comments>http://razor.occams.info/blog/2008/08/06/ogosh-irc-meeting-aug-16-4pm-edt/#comments</comments>
		<pubDate>Wed, 06 Aug 2008 12:21:48 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<category><![CDATA[Mono]]></category>

		<guid isPermaLink="false">http://razor.occams.info/blog/?p=243</guid>
		<description><![CDATA[<p>Join me at an IRC chat to talk about open source civic technology projects,  on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix  between seeing what various civic technology projects are up to like <a href="http://www.govtrack.us">GovTrack</a> (my site, powered by Mono), <a href="http://www.opencongress.org">OpenCongress</a>, and any others run by people who show up, and  getting new people involved in ongoing projects. &#8220;oGosh&#8221; is Open Government Open Source Hacking (<a href="http://wiki.opengovdata.org/index.php/OGosh">wiki</a> | <a href="http://www.new.facebook.com/home.php#/group.php?gid=45606565313">Facebook</a>), what I&#8217;m calling the  loose community that binds these projects together.</p>
<p>The chat will be in the #transparency channel on Freenode. For more  information on the meeting (and on how to get to the chat), see <a class="moz-txt-link-freetext" href="http://wiki.opengovdata.org/index.php/OGosh">http://wiki.opengovdata.org/index.php/OGosh</a>.</p>
<p>Suggestions for agenda topics are most welcome either to me directly or  by revising the wiki page above. Hope to see you there.</p>
]]></description>
			<content:encoded><![CDATA[<p>Join me at an IRC chat to talk about open source civic technology projects,  on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix  between seeing what various civic technology projects are up to like <a href="http://www.govtrack.us">GovTrack</a> (my site, powered by Mono), <a href="http://www.opencongress.org">OpenCongress</a>, and any others run by people who show up, and  getting new people involved in ongoing projects. &#8220;oGosh&#8221; is Open Government Open Source Hacking (<a href="http://wiki.opengovdata.org/index.php/OGosh">wiki</a> | <a href="http://www.new.facebook.com/home.php#/group.php?gid=45606565313">Facebook</a>), what I&#8217;m calling the  loose community that binds these projects together.</p>
<p>The chat will be in the #transparency channel on Freenode. For more  information on the meeting (and on how to get to the chat), see <a class="moz-txt-link-freetext" href="http://wiki.opengovdata.org/index.php/OGosh">http://wiki.opengovdata.org/index.php/OGosh</a>.</p>
<p>Suggestions for agenda topics are most welcome either to me directly or  by revising the wiki page above. Hope to see you there.</p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2008/08/06/ogosh-irc-meeting-aug-16-4pm-edt/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Open House Project</title>
		<link>http://razor.occams.info/blog/2007/02/08/the-open-house-project/</link>
		<comments>http://razor.occams.info/blog/2007/02/08/the-open-house-project/#comments</comments>
		<pubDate>Fri, 09 Feb 2007 03:01:35 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<category><![CDATA[Politics]]></category>

		<guid isPermaLink="false">http://razor.occams.info/blog/2007/02/08/the-open-house-project/</guid>
		<description><![CDATA[<p>(Just here for archival purposes&#8230;)</p>
<p><em>On my GovTrack blog:</em> It’s rare when Congress asks <em>the people</em> for help being transparent, and so I’m particularly pleased to announce the formation of <a href="http://www.theopenhouseproject.com/">The Open House Project</a>, a Sunlight Foundation-sponsored project with the encouragement of Speaker Pelosi that will be making specific proposals about how The House can better use the Internet in the interests of transparency. Various people, including myself, will be blogging on that site over the next few weeks about some ideas on this point. Feel free to contribute <em>your ideas</em> by commenting on the TOHP website, joining the project’s mail list, or talking on <a target="_blank" href="http://groups.yahoo.com/group/govtrack/">GovTrack’s own mail list</a>.</p>
<p><em>And on the TOHP blog</em>:</p>
<h2><a title="Permanent Link to Mash-ups for government transparency" rel="bookmark" href="http://www.theopenhouseproject.com/2007/01/25/mash-ups-for-government-transparency/">Mash-ups for government transparency</a></h2>
<h4>January 25th, 2007 by <a href="http://theopenhouseproject.com/about/who">Joshua Tauberer</a></h4>
<div class="entry">A few years ago I launched <a title="GovTrack.us" target="_blank" href="http://www.govtrack.us/">GovTrack.us</a>. I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (<a href="http://thomas.loc.gov/">THOMAS</a>, <a target="_blank" href="http://www.senate.gov/">the Senate website</a>, and <a target="_blank" href="http://www.house.gov/">the House website</a>) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.</p>
<p>The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this <em>is </em>something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:</p>
<blockquote><p>“Eleven bills and one resolution were introduced, as follows:  S. 360-370 and S. Res. 37.”</p>
</blockquote>
<p><em>I</em> have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill <u>S. 365</u> introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?</p>
<p>Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and <em>someone</em> will want each of them. So the question is this: <em>How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?</em></p>
<p>To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do <em>anything</em> with it that someone might want?</p>
<p>This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called <a target="_blank" href="http://www.govtrack.us/articles/20061209data.xpd"><em>structured (or “machine-readable”) data</em></a>, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.</div>
]]></description>
			<content:encoded><![CDATA[<p>(Just here for archival purposes&#8230;)</p>
<p><em>On my GovTrack blog:</em> It’s rare when Congress asks <em>the people</em> for help being transparent, and so I’m particularly pleased to announce the formation of <a href="http://www.theopenhouseproject.com/">The Open House Project</a>, a Sunlight Foundation-sponsored project with the encouragement of Speaker Pelosi that will be making specific proposals about how The House can better use the Internet in the interests of transparency. Various people, including myself, will be blogging on that site over the next few weeks about some ideas on this point. Feel free to contribute <em>your ideas</em> by commenting on the TOHP website, joining the project’s mail list, or talking on <a target="_blank" href="http://groups.yahoo.com/group/govtrack/">GovTrack’s own mail list</a>.</p>
<p><em>And on the TOHP blog</em>:</p>
<h2><a title="Permanent Link to Mash-ups for government transparency" rel="bookmark" href="http://www.theopenhouseproject.com/2007/01/25/mash-ups-for-government-transparency/">Mash-ups for government transparency</a></h2>
<h4>January 25th, 2007 by <a href="http://theopenhouseproject.com/about/who">Joshua Tauberer</a></h4>
<div class="entry">A few years ago I launched <a title="GovTrack.us" target="_blank" href="http://www.govtrack.us/">GovTrack.us</a>. I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (<a href="http://thomas.loc.gov/">THOMAS</a>, <a target="_blank" href="http://www.senate.gov/">the Senate website</a>, and <a target="_blank" href="http://www.house.gov/">the House website</a>) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.</p>
<p>The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this <em>is </em>something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:</p>
<blockquote><p>“Eleven bills and one resolution were introduced, as follows:  S. 360-370 and S. Res. 37.”</p>
</blockquote>
<p><em>I</em> have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill <u>S. 365</u> introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?</p>
<p>Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and <em>someone</em> will want each of them. So the question is this: <em>How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?</em></p>
<p>To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do <em>anything</em> with it that someone might want?</p>
<p>This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called <a target="_blank" href="http://www.govtrack.us/articles/20061209data.xpd"><em>structured (or “machine-readable”) data</em></a>, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.</div>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2007/02/08/the-open-house-project/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Meaningful Reform</title>
		<link>http://razor.occams.info/blog/2007/01/11/meaningful-reform/</link>
		<comments>http://razor.occams.info/blog/2007/01/11/meaningful-reform/#comments</comments>
		<pubDate>Thu, 11 Jan 2007 20:25:08 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<category><![CDATA[Politics]]></category>

		<guid isPermaLink="false">http://razor.occams.info/blog/?p=49</guid>
		<description><![CDATA[<p>About a year ago following a few scandals, the House and Senate saw a flurry of Congressional reform legislation get introduced&#8230; and then promptly ignored. Finally, however, we may see meaningful reform. Senate majority leader Harry Reid has introduced <a href="http://www.govtrack.us/congress/bill.xpd?bill=s110-1">S. 1: Commission to Strengthen Confidence in Congress Act of 2007</a>.  The bill would make two incredibly important advances:</p>
<blockquote><p>(Sec 103) It shall not be in order to consider any Senate bill or Senate amendment or conference [without] a list of&#8211; (1) all earmarks in such measure; (2) an identification of the Member or Members who proposed the earmark; and (3) an explanation of the essential governmental purpose for the earmark is available &#8230; to all Members <strong>and made available on the Internet</strong> to the general public for <strong>at least 48 hours</strong> before its consideration.&#8217;.</p>
</blockquote>
<blockquote><p>(Sec 104) It shall not be in order to consider a conference report unless such report is available to all Members and made available to the general public by means of the Internet for at least 48 hours before its consideration.</p>
</blockquote>
<p>Strangely, the bill does not require that <b>bills</b> (!) be available on the Internet for 48 hours before being voted on.  Just conference reports.  After a bill has been passed by both the House and the Senate, it&#8217;s often the case that the second chamber to get the bill has made amendments to the bill that the first chamber hasn&#8217;t yet gotten a chance to see.  In that case, a conference committee is made to get the two chambers back in sync, and the final version of a bill comes out in a conference report.</p>
<p>Since it&#8217;s been introduced by Reid, I think it&#8217;s almost certainly going to get through the Senate.  The House seems to be off in its own world, so I&#8217;m not sure whether we&#8217;ll see this bill ever become law, but it&#8217;s got a good shot.</p>
]]></description>
			<content:encoded><![CDATA[<p>About a year ago following a few scandals, the House and Senate saw a flurry of Congressional reform legislation get introduced&#8230; and then promptly ignored. Finally, however, we may see meaningful reform. Senate majority leader Harry Reid has introduced <a href="http://www.govtrack.us/congress/bill.xpd?bill=s110-1">S. 1: Commission to Strengthen Confidence in Congress Act of 2007</a>.  The bill would make two incredibly important advances:</p>
<blockquote><p>(Sec 103) It shall not be in order to consider any Senate bill or Senate amendment or conference [without] a list of&#8211; (1) all earmarks in such measure; (2) an identification of the Member or Members who proposed the earmark; and (3) an explanation of the essential governmental purpose for the earmark is available &#8230; to all Members <strong>and made available on the Internet</strong> to the general public for <strong>at least 48 hours</strong> before its consideration.&#8217;.</p>
</blockquote>
<blockquote><p>(Sec 104) It shall not be in order to consider a conference report unless such report is available to all Members and made available to the general public by means of the Internet for at least 48 hours before its consideration.</p>
</blockquote>
<p>Strangely, the bill does not require that <b>bills</b> (!) be available on the Internet for 48 hours before being voted on.  Just conference reports.  After a bill has been passed by both the House and the Senate, it&#8217;s often the case that the second chamber to get the bill has made amendments to the bill that the first chamber hasn&#8217;t yet gotten a chance to see.  In that case, a conference committee is made to get the two chambers back in sync, and the final version of a bill comes out in a conference report.</p>
<p>Since it&#8217;s been introduced by Reid, I think it&#8217;s almost certainly going to get through the Senate.  The House seems to be off in its own world, so I&#8217;m not sure whether we&#8217;ll see this bill ever become law, but it&#8217;s got a good shot.</p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2007/01/11/meaningful-reform/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Bounty on bringing state legislative info to the semantic web</title>
		<link>http://razor.occams.info/blog/2005/06/01/bounty-on-bringing-state-legislative-info-to-the-semantic-web/</link>
		<comments>http://razor.occams.info/blog/2005/06/01/bounty-on-bringing-state-legislative-info-to-the-semantic-web/#comments</comments>
		<pubDate>Wed, 01 Jun 2005 20:29:12 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298/1a5d86e4-4178-4ff9-8649-6ea9dacd7de2</guid>
		<description><![CDATA[<p>Taking the idea from the Gnome community, I&#8217;m placing a $150 bounty on the following programming project: Screen-scrape the website of a state legislature and put the information into XML or RDF so that it can be used on the semantic web. The code has to be released under an open-source-type license (and it must meet my personal satisfaction). It should parallel <a href="">what I&#8217;ve done with federal legislative information</a>. If you&#8217;re interested, join and post on the <a href="">GovTrack mail list</a>.</p>
<p>(They money comes from the advertising I have on <a href="">GovTrack</a>.)</p>
]]></description>
			<content:encoded><![CDATA[<p>Taking the idea from the Gnome community, I&#8217;m placing a $150 bounty on the following programming project: Screen-scrape the website of a state legislature and put the information into XML or RDF so that it can be used on the semantic web. The code has to be released under an open-source-type license (and it must meet my personal satisfaction). It should parallel <a href="">what I&#8217;ve done with federal legislative information</a>. If you&#8217;re interested, join and post on the <a href="">GovTrack mail list</a>.</p>
<p>(They money comes from the advertising I have on <a href="">GovTrack</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2005/06/01/bounty-on-bringing-state-legislative-info-to-the-semantic-web/feed/</wfw:commentRss>
		</item>
		<item>
		<title>My Trip to D.C.</title>
		<link>http://razor.occams.info/blog/2005/03/09/my-trip-to-dc/</link>
		<comments>http://razor.occams.info/blog/2005/03/09/my-trip-to-dc/#comments</comments>
		<pubDate>Wed, 09 Mar 2005 14:01:29 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298/6dea3568-0101-4f9b-bdfe-8ca748e71deb</guid>
		<description><![CDATA[<p>Last night I got back from a two-day trip to D.C. The point of the<br />
trip was to make a presentation about <a href="">GovTrack</a> and also to start some<br />
collaboration with others on expanding the political information that<br />
is freely and openly available online.</p>
<p style=""><img src="" alt="\&quot;\&quot;" /></p>
<p>Monday afternoon I presented GovTrack and some ideas about the<br />
semantic web to the people who are responsible for getting some<br />
aspects of legislative information posted online in XML format. Right<br />
now GovTrack gets its information from screen-scraping, which is an<br />
inexact and fragile process of extracting information out of the same<br />
HTML pages that you see when you view web sites. Having data<br />
published also in XML format can greatly improve the accuracy of<br />
getting information. What the people at the clerk of the House have<br />
done to date, in terms of getting bills written in XML and roll call<br />
votes posted in XML, has been a great step forward, although it<br />
hasn&#8217;t been that useful for GovTrack. (One reason is the Senate<br />
hasn&#8217;t followed suit because, as I understand it, the clerk of the<br />
Senate isn&#8217;t authorized by the Senate itself to work on such things.)</p>
<p>I think I&#8217;ve met now almost all of the<br />
players in the arena of building a network of political<br />
information.  Between everyone involved, we have enough data and enthusiasm to get<br />
something very unique and useful started.</p>
<p>(For more details, see <a href="">my posting on the GovTrack blog</a>.)</p>
]]></description>
			<content:encoded><![CDATA[<p>Last night I got back from a two-day trip to D.C. The point of the<br />
trip was to make a presentation about <a href="">GovTrack</a> and also to start some<br />
collaboration with others on expanding the political information that<br />
is freely and openly available online.</p>
<p style=""><img src="" alt="\&quot;\&quot;" /></p>
<p>Monday afternoon I presented GovTrack and some ideas about the<br />
semantic web to the people who are responsible for getting some<br />
aspects of legislative information posted online in XML format. Right<br />
now GovTrack gets its information from screen-scraping, which is an<br />
inexact and fragile process of extracting information out of the same<br />
HTML pages that you see when you view web sites. Having data<br />
published also in XML format can greatly improve the accuracy of<br />
getting information. What the people at the clerk of the House have<br />
done to date, in terms of getting bills written in XML and roll call<br />
votes posted in XML, has been a great step forward, although it<br />
hasn&#8217;t been that useful for GovTrack. (One reason is the Senate<br />
hasn&#8217;t followed suit because, as I understand it, the clerk of the<br />
Senate isn&#8217;t authorized by the Senate itself to work on such things.)</p>
<p>I think I&#8217;ve met now almost all of the<br />
players in the arena of building a network of political<br />
information.  Between everyone involved, we have enough data and enthusiasm to get<br />
something very unique and useful started.</p>
<p>(For more details, see <a href="">my posting on the GovTrack blog</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2005/03/09/my-trip-to-dc/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Diffing and RDF</title>
		<link>http://razor.occams.info/blog/2005/03/05/diffing-and-rdf/</link>
		<comments>http://razor.occams.info/blog/2005/03/05/diffing-and-rdf/#comments</comments>
		<pubDate>Sat, 05 Mar 2005 12:52:08 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[Code]]></category>

		<category><![CDATA[GovTrack]]></category>

		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298/74a788a4-7a09-4678-a6de-0f78ce907e0f</guid>
		<description><![CDATA[<p><span style="">If you&#8217;re reading this, you&#8217;re probably reading this on Monologue, and that means I&#8217;ve successfully added myself to Monologue. :-)</span></p>
<p>Recently I got a helpful bug report for my <a href="">Diff library for C#</a> which pointed out that my port of Perl&#8217;s Algorithm::Diff wasn&#8217;t generating the same diffs as the original module. I fixed the bug and reposted a new version of the library.</p>
<p>In unrelated news, I&#8217;m working on building the semantic web for information about the U.S. government. This is a spin-off of my work on <a href="">GovTrack</a> (which is powered by Mono). To get this web built, I&#8217;m in the position of having to convince people that RDF is the right way to approach the problem of distributed information &#8212; over, for instance, XML, XML Schema, and XQuery. The problem is that RDF is complicated and often misunderstood, and I hadn&#8217;t found a good document explaining what RDF is and why it should be used for this. So, <a href="">I wrote one</a>. I&#8217;m not a master of RDF by any means, so any corrections and suggestions are welcome.</p>
<p>By the way, if you&#8217;re interested in building this political semantic web, join the <a href="">GovTrack mail list</a>.</p>
<p>Lastly, with my new interest in RDF, I was looking for a good C# library for working with RDF data models. I didn&#8217;t find one that I particularly liked (there are a few ones out there, but for various reasons I just couldn&#8217;t see myself using them), so I&#8217;m working on my own. I&#8217;ll post the source in a few weeks, probably.</p>
]]></description>
			<content:encoded><![CDATA[<p><span style="">If you&#8217;re reading this, you&#8217;re probably reading this on Monologue, and that means I&#8217;ve successfully added myself to Monologue. :-)</span></p>
<p>Recently I got a helpful bug report for my <a href="">Diff library for C#</a> which pointed out that my port of Perl&#8217;s Algorithm::Diff wasn&#8217;t generating the same diffs as the original module. I fixed the bug and reposted a new version of the library.</p>
<p>In unrelated news, I&#8217;m working on building the semantic web for information about the U.S. government. This is a spin-off of my work on <a href="">GovTrack</a> (which is powered by Mono). To get this web built, I&#8217;m in the position of having to convince people that RDF is the right way to approach the problem of distributed information &#8212; over, for instance, XML, XML Schema, and XQuery. The problem is that RDF is complicated and often misunderstood, and I hadn&#8217;t found a good document explaining what RDF is and why it should be used for this. So, <a href="">I wrote one</a>. I&#8217;m not a master of RDF by any means, so any corrections and suggestions are welcome.</p>
<p>By the way, if you&#8217;re interested in building this political semantic web, join the <a href="">GovTrack mail list</a>.</p>
<p>Lastly, with my new interest in RDF, I was looking for a good C# library for working with RDF data models. I didn&#8217;t find one that I particularly liked (there are a few ones out there, but for various reasons I just couldn&#8217;t see myself using them), so I&#8217;m working on my own. I&#8217;ll post the source in a few weeks, probably.</p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2005/03/05/diffing-and-rdf/feed/</wfw:commentRss>
		</item>
		<item>
		<title>My Face Is There</title>
		<link>http://razor.occams.info/blog/2005/01/10/my-face-is-there/</link>
		<comments>http://razor.occams.info/blog/2005/01/10/my-face-is-there/#comments</comments>
		<pubDate>Mon, 10 Jan 2005 04:00:14 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298/0097a800-69e6-45aa-8f16-1c7daa358f59</guid>
		<description><![CDATA[<p>Woot!  <a href="">http://www.technorati.com/.</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Woot!  <a href="">http://www.technorati.com/.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2005/01/10/my-face-is-there/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Mail Lists, Diffs, XPD</title>
		<link>http://razor.occams.info/blog/2004/11/03/mail-lists-diffs-xpd/</link>
		<comments>http://razor.occams.info/blog/2004/11/03/mail-lists-diffs-xpd/#comments</comments>
		<pubDate>Thu, 04 Nov 2004 01:28:54 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[Code]]></category>

		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298.post-109953179516441182</guid>
		<description><![CDATA[<div>For anyone potentially reading, I’ve set up new mail lists for <a href="">GovTrack</a> and my <a href="">Thunderbird SPF Extension</a>. If you have an interest in either thing, please visit the site and join the list.</p>
<p>I’ve also posted a library for <a href="">diffing/merging/patching</a> written in C#, based on the Perl module Algorithm::Diff. And I posted the source for <a href="">XPD</a>, the XML pipline document generation engine that I wrote to power GovTrack. These things have helped me; I hope they help you.</p>
<p>GovTrack is now set up on a new server, and it’s much much more responsive than it used to be. In fact, you can’t tell anymore that it’s doing lots of XSLT transformations on each request.</p></div>
]]></description>
			<content:encoded><![CDATA[<div>For anyone potentially reading, I’ve set up new mail lists for <a href="">GovTrack</a> and my <a href="">Thunderbird SPF Extension</a>. If you have an interest in either thing, please visit the site and join the list.</p>
<p>I’ve also posted a library for <a href="">diffing/merging/patching</a> written in C#, based on the Perl module Algorithm::Diff. And I posted the source for <a href="">XPD</a>, the XML pipline document generation engine that I wrote to power GovTrack. These things have helped me; I hope they help you.</p>
<p>GovTrack is now set up on a new server, and it’s much much more responsive than it used to be. In fact, you can’t tell anymore that it’s doing lots of XSLT transformations on each request.</p></div>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2004/11/03/mail-lists-diffs-xpd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Track Congress on Your Site</title>
		<link>http://razor.occams.info/blog/2004/10/14/track-congress-on-your-site/</link>
		<comments>http://razor.occams.info/blog/2004/10/14/track-congress-on-your-site/#comments</comments>
		<pubDate>Thu, 14 Oct 2004 13:01:30 +0000</pubDate>
		<dc:creator>Joshua Tauberer</dc:creator>
		
		<category><![CDATA[GovTrack]]></category>

		<guid isPermaLink="false">http://taubz.for.net/blog/blog-6956298.post-109775907045667488</guid>
		<description><![CDATA[<div>Presumably no one is actually reading this blog, although server logs say otherwise.  Anywho&#8230;</p>
<p>GovTrack has a nice new feature where you can embed GovTrack’s tracked events into your own website.  I’ve embedded the latest intellectual property-related events on this page, to the right and down.</p></div>
]]></description>
			<content:encoded><![CDATA[<div>Presumably no one is actually reading this blog, although server logs say otherwise.  Anywho&#8230;</p>
<p>GovTrack has a nice new feature where you can embed GovTrack’s tracked events into your own website.  I’ve embedded the latest intellectual property-related events on this page, to the right and down.</p></div>
]]></content:encoded>
			<wfw:commentRss>http://razor.occams.info/blog/2004/10/14/track-congress-on-your-site/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
