So, due to recommendations to use Markov chains for text generation after my last little script on creating fake words, I have finally gotten around to learning and implementing the Markov chain algorithm for text generation in Python. There were a few different implementations already available, but they all seemed a bit gimpy. The solution I wrote up aims to be able to support high volumes of text. Take a look at what I conjured up and tell me what you think:
import collections import random class DynamicDie(collections.defaultdict): def __init__(self): collections.defaultdict.__init__(self, int) def add_side(self, side): self[side] += 1 def total_sides(self): return sum(self.values()) def roll(self): random_num = random.randint(0, self.total_sides()) total_pos = 0 for side, qty in self.items(): total_pos += qty if random_num <= total_pos: return side class MarkovChain(collections.defaultdict): """ Yet another markov chain implementation. This one differs in the sense that it is able to better support huge amounts of data since the weighted randomization doesn't rely on duplicates. """ # Sentinals # Discussion here: http://stackoverflow.com/questions/1677726 class START(object):pass class END(object):pass def __init__(self): collections.defaultdict.__init__(self, DynamicDie) def add(self, iterable): """ Insert an iterable (pattern) item into the markov chain. The order of the pattern will define more of the chain. """ item1 = item2 = MarkovChain.START for item3 in iterable: self[(item1, item2)].add_side(item3) item1 = item2 item2 = item3 self[(item1, item2)].add_side(MarkovChain.END) def random_output(self, max=100): """ Generate a list of elements from the markov chain. The `max` value is in place in order to prevent excessive iteration. """ output =  item1 = item2 = MarkovChain.START for i in range(max-2): item3 = self[(item1, item2)].roll() if item3 is MarkovChain.END: break output.append(item3) item1 = item2 item2 = item3 return output
It’s not really complicated as long as you understand Markov chains. Let’s use this in an example to generate some fake words:
import markov chain = markov.MarkovChain() # Load the dictionary into the markov chain dictionary = "/usr/share/dict/words" for word in open(dictionary): word = word.strip() if word != "" and not word.endswith("'s"): chain.add(word) # Let's generate 10 random pseudowords for i in range(10): print "".join(chain.random_output())
The way I did it above isn’t actually that good since it loads every word in the English language into the Markov chain. Of course, you don’t need all of the words to get good looking output. This is just an example, though. I’ll leave it up to you to make a sufficient solution.
As always, if you see a way that the code in this post could be improved, please leave a comment!
Update: I updated it to use the dynamic die object. This solves the linear search problem that Marius mentions below.