01 June 2010
Over the weekend, I was working on upgrading python-ngrams because I had discovered a bug where the tokenization was incorrect. I was reading a research paper that was describing q-grams and while following their examples, I realized I was getting incorrect results for a fundamental n-gram result.
The tokenization that's required here is quite simple, given some iterable consisting of values [x0, x1, x2, x3...,xi], produce an exhaustive iterator of n-tuples such that it satisfies (x0,...xn), (x1,...xn+1), (xm,...xi).
To make this easier to understand, if given a list of [0, 1, 2, 3, 4, 5, 6], I want to be able to return an iterator such that:
for first, second, third in n_wise([0, 1, 2, 3, 4, 5, 6], 3): print first, second, third
Will have a result of:
0, 1, 2 1, 2, 3 2, 3, 4 3, 4, 5 4, 5, 6
Fortunately, this wasn't too difficult as the trivial implementation of this is already done in itertools, under pairwise. Below I've implemented a n-wise iterator implementation that can take any iterator and return n-iterators where each iterator is advanced by a step ahead of the other.
from itertools import tee, izip def n_wise(iterable, n): """ Returns n iterators for an iterable that are sequentially n-wise """ n_iterators = tee(iterable, n) zippables = [n_iterators] for advance, iteratee in enumerate(n_iterators[1:]): advance += 1 # since enumerate is 0 indexed. while advance > 0: # we advance the iterator ``advance+1`` steps next(iteratee, None) advance -= 1 # append everything to the zippables zippables.append(iteratee) # return the izip expansion of each iterator return izip(*zippables)
I find that I sometimes need to open an iterator for a file and I need to read n-wise lines each step, this iterator will do just that. For sake of completeness, look at how concise, powerful, and easy to use this combination is:
# assuming that n_wise is imported into this namespace from functools import partial triple_wise_line_reader = partial(n_wise, n=3) for line1, line2, line3 in triple_wise_line_reader(open("some_file.log", "r")): # do some computation with line1, line2 and line3 do_something(line1, line2, line3) # next step: # line1 <- line2 # line2 <- line3 # line3 <- line4
Imagine doing this in C ;) It will be just ugly!
Please let me know if you find this useful or you have a better implementation to the problem!blog comments powered by Disqus