12 September 2010
I've recently had to do some work that required sorting a very large CSV file, containing fields with embedded newlines, quickly. As it turns out, Linux comes with a sort implementation that has a "--zero-terminated" option, which sorts on null-terminated delimited strings instead of the default newline separator.
Writing null-terminated CSV files
Since I was writing a process to generate these CSV files, I figured I can just use Python's CSV module, which has support for different types of dialects. Inheriting from csv.Dialect, we can write a simple dialect that will allow us to terminate all lines with a null byte.
import csv
import struct
class null_terminated(csv.excel):
lineterminator = struct.pack('B', 0)
csv.register_dialect("null-terminated", null_terminated)
Essentially, we've registered a global csv dialect called "null-terminated" that inherits from the excel dialect, which has sensible standard defaults.
Here's a simple snippet that shows the usage of the new "null-terminated" dialect that I created above.
from csv import DictWriter
with open("/tmp/file.csv", "w") as f:
dwriter = DictWriter(f, fieldnames=["id","field"], dialect="null-terminated")
for i, field in enumerate(("foo", "bar", "baz", "bif")):
dwriter.writerow({"id": i, "field": field})
Now, /tmp/file.csv will contain a file with four rows that are separated by a null-terminator. As you can see, it's pretty easy to write a null-terminated CSV file, but unfortunately, it's a bit tricky to read a null-terminated csv file due to some inflexible hardcoded defaults.
Reading null-terminated CSV files
The CSV module's unintuitive restriction for Dialect.lineterminator is hard-coded to recognize '\r' or '\n' as the end of line terminator, which unfortunately, means we will need to handle null-termination and implement reading ourselves.
There are many ways of writing a procedure to read null-terminated strings, but I figured the simplest algorithm is to read character-by-character, concatenating everything into a string until we reach a null byte, then we can just return the string. I'd figure an implementation might go something like this:
def read(fobj):
current_string = ""
while True:
char = fobj.read(1)
if char and char != nullbyte:
current_string += char
elif char == nullbyte:
yield current_string
current_string = ""
elif not char:
if current_string:
yield current_string
raise StopIteration
Looks awesome, but, how can we integrate this into the CSV module? We would want to just plug and play with the existing CSV module. A simple solution is to wrap the function above to iterate over each line, like so:
# we use StringIO since cStringIO has poor unicode support
from StringIO import StringIO
from csv import reader
class NullTerminatedDelimiterReader(object):
"""
A CSV reader which will iterate over lines in the CSV file 'f',
which are line terminated by a null byte
"""
def __init__(self, f, dialect, *args, **kwds):
# satisfying DictReader instance
self._line_num = 0
self.fobj = f
self.dialect = dialect
self.reader = self._read()
self.string_io = StringIO()
def _properly_parse_row(self, current_string):
self.string_io.write(current_string)
# seek to the first byte
self.string_io.seek(0)
# we instantiate a reader here to properly parse the row
# taking into account escaping, and various edge cases
return next(reader(self.string_io, dialect=self.dialect))
def _read(self):
current_string = ""
while True:
char = self.fobj.read(1) # read one byte
if char and char != null_byte:
# keep appending to the current string
current_string += char
elif char == null_byte:
yield self._properly_parse_row(current_string)
# increment instrumentation
self._line_num += 1
# clear internal reading buffer
self.string_io.seek(0)
self.string_io.truncate()
# clear row
current_string = ""
elif not char:
if current_string:
yield self._properly_parse_row(current_string)
raise StopIteration
@property
def line_num(self):
return self._line_num
def next(self):
return next(self.reader)
def __iter__(self):
return self
To use the DictReader class, we'll inherit from the DictReader class and override the reader object. It's the cleanest and simplest way of doing it.
class NullByteDictReader(csv.DictReader):
def __init__(self, f, *args, **kwds):
csv.DictReader.__init__(self, f, *args, **kwds)
self.reader = NullTerminatedDelimiterReader(f, *args, **kwds)
with open("/tmp/file.csv", "r") as f:
for line in NullByteDictReader(f, dialect="null-terminated"):
print line["id"], line["field"]
Voila :)
Conclusions and Future Work
Something that might be interesting to pursue further is the possibility of writing, or wrapping a python interface around, a C library as a substitute for the current CSV module. It should be able to support different line terminators, multi-byte delimiters, and have unicode detection outside the box, which happen to be my main three gripes with the CSV module.
For your convenience, I've put all the code in a gist. You should follow me on twitter.
blog comments powered by Disqus