uCSV: Unified Character Separated Values
A Draft Recommendation
By Joshua Tauberer
This is a proposal for a mostly backwards compatible
specification for a flat-file representation of tabular data
like CSV (comma separated values). CSV has never been one
particular standard and even standards such as
leave much to be desired, namely removing the optionality
of a header row, making the character encoding explicit,
and being appropriate in an international context where
the comma may be used as a decimal separator.
This specification for Unified Character Separated Values (uCSV)
allows most nonalphanumeric character besides the quotation mark, space,
and carriage return characters to be used as the delimiter. The header
row, obligatory in uCSV, is used to determine the separator in use. The
character encoding is required to be UTF-8. This is essentially a
specification for a generic delimiter separated values (DSV) format.
When the comma separator is chosen and only ASCII characters
are used, the format is compatible with RFC 4180 CSV. Accordingly,
the .csv file extension and text/csv MIME type are recommended
for uCSV files.
Without further ado, the format follows, based largely
on RFC 4180.
Definition of uCSV
- uCSV files are comprised of character data encoded
- A uCSV file consists of one or more records each on
a separate line. Lines are terminated with CRLF characters
(i.e. DOS format). Each record in the file must have the
same number of fields.
- The last line need not have a final line break CRLF.
- Within each record, there is one or more fields seprated
by the file’s delimiter. One delimiter must be chosen
and used throughout the file as a field separator. The
delimiter may be any single Unicode character excluding
all letter and number characters (as defined by the Unicode
specification), the space, the straight
double quote, carriage return, and line feed. There cannot be a
delimiter character after the last field.
- Each field must be enclosed in the straight double quote
character if it contains leading or trailing spaces, the
delimiter character, a line break (either CR or LF characters),
or the straight double quote character. If the field contains
a straight double quote character within it, then it must be
escaped by a second straight double quote character preceding it.
- The first line is always a header record.
The header record has names corresponding
to the fields in the records throughout the file. The header line
has an additional requirement: fields must also be enclosed in straight
double quotes if the field contains any character that is
permitted as a delimiter in this specification
To parse uCSV, the parser must first determine the
delimiter character. It can do this by reading the header line
and looking for any character that might be a delimiter (see above) that
is not enclosed in quotes. That is:
- Scan the header row characters from left to right.
- If the character is a straight double quote and we're not <inside a quote>. then we're now <inside a quote> until further notice and go on, else
- If the character is a straight double quote and the next character is also a straight double quote, skip these two characters and go on, else
- If the character is a straight double quote, we're no longer <inside a quote> and go on, else
- If we're <inside a quote>, go on, else
- If this is not a possible delimiter character (see above), go on, else
- This is the delimiter character for the file.
- If no delimiter is found, the file has one field per row and there is no delimiter needed.
- Re-scan the header row and read the rest of the file now that we know the delimiter character.
Here is an example of uCSV. It is also CSV. Note that the
header row field "trips/year" is enclosed in quotes because
it has a non-alphanumeric character (/) that could be the
file delimiter. The / does not need to be enclosed in quotes
later on because after the header field it is known to not
be the delimiter.
The same could be done with semicolons.
A good way to ensure the header row is done right
is to always wrap the fields in quotes so that only
the delimiter is exposed outside of quotes.