Reading and Writing Null-Terminated CSV Files in Python

12 September 2010

I've recently had to do some work that required sorting a very large CSV file, containing fields with embedded newlines, quickly. As it turns out, Linux comes with a sort implementation that has a "--zero-terminated" option, which sorts on null-terminated delimited strings instead of the default newline separator.

Writing null-terminated CSV files

Since I was writing a process to generate these CSV files, I figured I can just use Python's CSV module, which has support for different types of dialects. Inheriting from csv.Dialect, we can write a simple dialect that will allow us to terminate all lines with a null byte.

import csv
import struct

class null_terminated(csv.excel):
    lineterminator = struct.pack('B', 0)

csv.register_dialect("null-terminated", null_terminated)

Essentially, we've registered a global csv dialect called "null-terminated" that inherits from the excel dialect, which has sensible standard defaults.

Here's a simple snippet that shows the usage of the new "null-terminated" dialect that I created above.

from csv import DictWriter

with open("/tmp/file.csv", "w") as f:
    dwriter = DictWriter(f, fieldnames=["id","field"], dialect="null-terminated")

    for i, field in enumerate(("foo", "bar", "baz", "bif")):
        dwriter.writerow({"id": i, "field": field})

Now, /tmp/file.csv will contain a file with four rows that are separated by a null-terminator. As you can see, it's pretty easy to write a null-terminated CSV file, but unfortunately, it's a bit tricky to read a null-terminated csv file due to some inflexible hardcoded defaults.

Reading null-terminated CSV files

The CSV module's unintuitive restriction for Dialect.lineterminator is hard-coded to recognize '\r' or '\n' as the end of line terminator, which unfortunately, means we will need to handle null-termination and implement reading ourselves.

There are many ways of writing a procedure to read null-terminated strings, but I figured the simplest algorithm is to read character-by-character, concatenating everything into a string until we reach a null byte, then we can just return the string. I'd figure an implementation might go something like this:

def read(fobj):
    current_string = ""
    while True:
        char = fobj.read(1)
        if char and char != nullbyte:
            current_string += char
        elif char == nullbyte:
            yield current_string
            current_string = ""
        elif not char:
            if current_string:
                yield current_string
            raise StopIteration

Looks awesome, but, how can we integrate this into the CSV module? We would want to just plug and play with the existing CSV module. A simple solution is to wrap the function above to iterate over each line, like so:

# we use StringIO since cStringIO has poor unicode support
from StringIO import StringIO
from csv import reader

class NullTerminatedDelimiterReader(object):
    """
    A CSV reader which will iterate over lines in the CSV file 'f',
    which are line terminated by a null byte

    """

    def __init__(self, f,  dialect, *args, **kwds):
        # satisfying DictReader instance
        self._line_num = 0
        self.fobj = f
        self.dialect = dialect
        self.reader = self._read()
        self.string_io = StringIO()

    def _properly_parse_row(self, current_string):
        self.string_io.write(current_string)
        # seek to the first byte
        self.string_io.seek(0)
        # we instantiate a reader here to properly parse the row
        # taking into account escaping, and various edge cases
        return next(reader(self.string_io, dialect=self.dialect))

    def _read(self):
        current_string = ""
        while True:
            char = self.fobj.read(1)  # read one byte
            if char and char != null_byte:
                # keep appending to the current string
                current_string += char
            elif char == null_byte:
                yield self._properly_parse_row(current_string)
                # increment instrumentation
                self._line_num += 1
                # clear internal reading buffer
                self.string_io.seek(0)
                self.string_io.truncate()
                # clear row
                current_string = ""
            elif not char:
                if current_string:
                    yield self._properly_parse_row(current_string)
                raise StopIteration

    @property
    def line_num(self):
        return self._line_num

    def next(self):
        return next(self.reader)

    def __iter__(self):
        return self

To use the DictReader class, we'll inherit from the DictReader class and override the reader object. It's the cleanest and simplest way of doing it.

class NullByteDictReader(csv.DictReader):
    def __init__(self, f, *args, **kwds):
        csv.DictReader.__init__(self, f, *args, **kwds)
        self.reader = NullTerminatedDelimiterReader(f, *args, **kwds)

with open("/tmp/file.csv", "r") as f:
    for line in NullByteDictReader(f, dialect="null-terminated"):
        print line["id"], line["field"]

Voila :)

Conclusions and Future Work

Something that might be interesting to pursue further is the possibility of writing, or wrapping a python interface around, a C library as a substitute for the current CSV module. It should be able to support different line terminators, multi-byte delimiters, and have unicode detection outside the box, which happen to be my main three gripes with the CSV module.

For your convenience, I've put all the code in a gist. You should follow me on twitter.

This entry was tagged as linux programming python

blog comments powered by Disqus