uCSV: Unified Character Separated Values

A Draft Recommendation

By Joshua Tauberer

8/24/2009

Introduction

This is a proposal for a mostly backwards compatible specification for a flat-file representation of tabular data like CSV (comma separated values). CSV has never been one particular standard and even standards such as RFC 4180 leave much to be desired, namely removing the optionality of a header row, making the character encoding explicit, and being appropriate in an international context where the comma may be used as a decimal separator.

This specification for Unified Character Separated Values (uCSV) allows most nonalphanumeric character besides the quotation mark, space, and carriage return characters to be used as the delimiter. The header row, obligatory in uCSV, is used to determine the separator in use. The character encoding is required to be UTF-8. This is essentially a specification for a generic delimiter separated values (DSV) format.

When the comma separator is chosen and only ASCII characters are used, the format is compatible with RFC 4180 CSV. Accordingly, the .csv file extension and text/csv MIME type are recommended for uCSV files.

Without further ado, the format follows, based largely on RFC 4180.

Definition of uCSV

  1. uCSV files are comprised of character data encoded in UTF-8.
  2. A uCSV file consists of one or more records each on a separate line. Lines are terminated with CRLF characters (i.e. DOS format). Each record in the file must have the same number of fields.
  3. The last line need not have a final line break CRLF.
  4. Within each record, there is one or more fields seprated by the file’s delimiter. One delimiter must be chosen and used throughout the file as a field separator. The delimiter may be any single Unicode character excluding all letter and number characters (as defined by the Unicode specification), the space, the straight double quote, carriage return, and line feed. There cannot be a delimiter character after the last field.
  5. Each field must be enclosed in the straight double quote character if it contains leading or trailing spaces, the delimiter character, a line break (either CR or LF characters), or the straight double quote character. If the field contains a straight double quote character within it, then it must be escaped by a second straight double quote character preceding it.
  6. The first line is always a header record. The header record has names corresponding to the fields in the records throughout the file. The header line has an additional requirement: fields must also be enclosed in straight double quotes if the field contains any character that is permitted as a delimiter in this specification

Parsing uCSV

To parse uCSV, the parser must first determine the delimiter character. It can do this by reading the header line and looking for any character that might be a delimiter (see above) that is not enclosed in quotes. That is:

  1. Scan the header row characters from left to right.
  2. If the character is a straight double quote and we're not <inside a quote>.
  3. then we're now <inside a quote> until further notice and go on, else
  4. If the character is a straight double quote and the next character is also a straight double quote, skip these two characters and go on, else
  5. If the character is a straight double quote, we're no longer <inside a quote> and go on, else
  6. If we're <inside a quote>, go on, else
  7. If this is not a possible delimiter character (see above), go on, else
  8. This is the delimiter character for the file.
  9. If no delimiter is found, the file has one field per row and there is no delimiter needed.
  10. Re-scan the header row and read the rest of the file now that we know the delimiter character.

Examples

Here is an example of uCSV. It is also CSV. Note that the header row field "trips/year" is enclosed in quotes because it has a non-alphanumeric character (/) that could be the file delimiter. The / does not need to be enclosed in quotes later on because after the header field it is known to not be the delimiter.

ID,name,"trips/year",webpage
123,Joe,10,http://www.example.org
456,Ken,5,http://www.example.com

The same could be done with semicolons.

ID;name;"trips/year";webpage
123;Joe;10;http://www.example.org
456;Ken;5;http://www.example.com

A good way to ensure the header row is done right is to always wrap the fields in quotes so that only the delimiter is exposed outside of quotes.

"ID"|"name"|"trips/year"|"webpage"
123|Joe|10|http://www.example.org
456|Ken|5|http://www.example.com