Everybody celebrates November 5th for it's Fawkesian roots. But where's the love for it being the day that Doc Brown came up with the idea for the flux capacitor? Sure, only one of them actually has merit, but still… Great Scott!
So, due to recommendations to use Markov chains for text generation after my last little script on creating fake words, I have finally gotten around to learning and implementing the Markov chain algorithm for text generation in Python. There were a few different implementations already available, but they all seemed a bit gimpy. The solution I wrote up aims to be able to support high volumes of text. Take a look at what I conjured up and tell me what you think:
import collections
import random
class DynamicDie(collections.defaultdict):
def __init__(self):
collections.defaultdict.__init__(self, int)
def add_side(self, side):
self[side] += 1
def total_sides(self):
return sum(self.values())
def roll(self):
random_num = random.randint(0, self.total_sides())
total_pos = 0
for side, qty in self.items():
total_pos += qty
if random_num <= total_pos:
return side
class MarkovChain(collections.defaultdict):
""" Yet another markov chain implementation.
This one differs in the sense that it is able to better support
huge amounts of data since the weighted randomization doesn't rely
on duplicates.
"""
# Sentinals
# Discussion here: http://stackoverflow.com/questions/1677726
class START(object):pass
class END(object):pass
def __init__(self):
collections.defaultdict.__init__(self, DynamicDie)
def add(self, iterable):
""" Insert an iterable (pattern) item into the markov chain.
The order of the pattern will define more of the chain.
"""
item1 = item2 = MarkovChain.START
for item3 in iterable:
self[(item1, item2)].add_side(item3)
item1 = item2
item2 = item3
self[(item1, item2)].add_side(MarkovChain.END)
def random_output(self, max=100):
""" Generate a list of elements from the markov chain.
The `max` value is in place in order to prevent excessive iteration.
"""
output = []
item1 = item2 = MarkovChain.START
for i in range(max-2):
item3 = self[(item1, item2)].roll()
if item3 is MarkovChain.END:
break
output.append(item3)
item1 = item2
item2 = item3
return output
It's not really complicated as long as you understand Markov chains. Let's use this in an example to generate some fake words:
import markov
chain = markov.MarkovChain()
# Load the dictionary into the markov chain
dictionary = "/usr/share/dict/words"
for word in open(dictionary):
word = word.strip()
if word != "" and not word.endswith("'s"):
chain.add(word)
# Let's generate 10 random pseudowords
for i in range(10):
print "".join(chain.random_output())
The way I did it above isn't actually that good since it loads every word in the English language into the Markov chain. Of course, you don't need all of the words to get good looking output. This is just an example, though. I'll leave it up to you to make a sufficient solution.
As always, if you see a way that the code in this post could be improved, please leave a comment!
Update: I updated it to use the dynamic die object. This solves the linear search problem that Marius mentions below.