Archive for the ‘Python’ Category.

Python Markov Chains and how to use them

So, due to recommendations to use Markov chains for text generation after my last little script on creating fake words, I have finally gotten around to learning and implementing the Markov chain algorithm for text generation in Python. There were a few different implementations already available, but they all seemed a bit gimpy. The solution I wrote up aims to be able to support high volumes of text. Take a look at what I conjured up and tell me what you think:

import collections
import random
 
 
class DynamicDie(collections.defaultdict):
 
    def __init__(self):
        collections.defaultdict.__init__(self, int)
 
    def add_side(self, side):
        self[side] += 1
 
    def total_sides(self):
        return sum(self.values())
 
    def roll(self):
        random_num = random.randint(0, self.total_sides())
        total_pos = 0
        for side, qty in self.items():
            total_pos += qty
            if random_num <= total_pos:
                return side
 
 
class MarkovChain(collections.defaultdict):
    """ Yet another markov chain implementation.
        This one differs in the sense that it is able to better support
        huge amounts of data since the weighted randomization doesn't rely
        on duplicates.
    """
 
    # Sentinals 
    # Discussion here: http://stackoverflow.com/questions/1677726
    class START(object):pass
    class END(object):pass
 
    def __init__(self):
        collections.defaultdict.__init__(self, DynamicDie)
 
    def add(self, iterable):
        """ Insert an iterable (pattern) item into the markov chain.
            The order of the pattern will define more of the chain.
        """
        item1 = item2 = MarkovChain.START
        for item3 in iterable:
            self[(item1, item2)].add_side(item3)
            item1 = item2
            item2 = item3
        self[(item1, item2)].add_side(MarkovChain.END)
 
    def random_output(self, max=100):
        """ Generate a list of elements from the markov chain.
            The `max` value is in place in order to prevent excessive iteration.
        """
        output = []
        item1 = item2 = MarkovChain.START
        for i in range(max-2):
            item3 = self[(item1, item2)].roll()
            if item3 is MarkovChain.END:
                break
            output.append(item3)
            item1 = item2
            item2 = item3
        return output

It's not really complicated as long as you understand Markov chains. Let's use this in an example to generate some fake words:

import markov
chain = markov.MarkovChain()
 
# Load the dictionary into the markov chain
dictionary = "/usr/share/dict/words"
for word in open(dictionary):
    word = word.strip()
    if word != "" and not word.endswith("'s"):
        chain.add(word)
 
# Let's generate 10 random pseudowords
for i in range(10):
    print "".join(chain.random_output())

The way I did it above isn't actually that good since it loads every word in the English language into the Markov chain. Of course, you don't need all of the words to get good looking output. This is just an example, though. I'll leave it up to you to make a sufficient solution.

As always, if you see a way that the code in this post could be improved, please leave a comment!

Update: I updated it to use the dynamic die object. This solves the linear search problem that Marius mentions below.

Send a message THROUGH TIME in Python

Who says time travel isn't possible?

import time
 
def send_to_future(message, seconds):
    time.sleep(seconds)
    return message

I'll leave it up to someone else to write the send_to_past() function.

Creating fake words: a pseudoword generator

I was trying to figure out how to create a word that's not a word. What I ended up doing was creating a way of generating a random syllable, and then simply appending 2 or 3 of them together. It seems to work well enough. Here's what I got in Python:

import random
 
vowels = ["a", "e", "i", "o", "u"]
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 
              'r', 's', 't', 'v', 'w', 'x', 'y', 'z']
 
def _vowel():
    return random.choice(vowels)
 
def _consonant():
    return random.choice(consonants)
 
def _cv():
    return _consonant() + _vowel()
 
def _cvc():
    return _cv() + _consonant()
 
def _syllable():
    return random.choice([_vowel, _cv, _cvc])()
 
def create_fake_word():
    """ This function generates a fake word by creating between two and three
        random syllables and then joining them together.
    """
    syllables = []
    for x in range(random.randint(2,3)):
        syllables.append(_syllable())
    return "".join(syllables)
 
if __name__ == "__main__":
    print create_fake_word()

The first four functions are for generating an individual type of syllable (V, CV, or CVC) and then _syllable() just chooses one of them at random. Finally, create_fake_word() calls _syllable() a few times and joins them together. Here is some example output:

hojocina
eliphaa
deaketyed
ciboa
tiuzi

I haven't a clue whether or not there is a better means of generating words that look somewhat real. If you know of a better method, I'd love to hear it!

A simple event-driven plugin system in Python

So I just wanted to share my solution for integrating a plugin system into Python applications. Overall it's separated into four parts: the plugin manager, the plugin(s), a config file listing the plugins, and the application that triggers the events. In this system, plugins are just functions that get registered through the use of a decorator.

Let's start with the code for the plugin manager:

# pluginmanager.py
from collections import defaultdict
 
 
plugins = defaultdict(list)
def register(*events):
    """ This decorator is to be used for registering a function as a plugin for
        a specific event or list of events. 
    """
    def registered_plugin(funct):
        for event in events:
            plugins[event].append(funct)
        return funct
    return registered_plugin
 
 
def trigger_event(event, *args, **kwargs):
    """ Call this function to trigger an event. It will run any plugins that
        have registered themselves to the event. Any additional arguments or
        keyword arguments you pass in will be passed to the plugins.
    """
    for plugin in plugins[event]:
        plugin(*args, **kwargs)
 
 
def load_plugins(config_file):
    """ This reads a config file of a list of plugins to load. It ignores
        empty lines or lines beginning with a hash mark (#). It is so plugin
        imports are more dynamic and you don't need to continue appending
        import statements to the top of a file.
    """
    with open(config_file, "r") as fh:
        for line in fh:
            line = line.strip()
            if line.startswith("#") or line == "":
                continue
            __import__(line,  globals(), locals(), [], -1)

The plugin system is separated into three functions. The first is register() which is the decorator for registering plugins. The function trigger_event() is used by your application notify the plugin system that an event has occured. It goes out and runs any plugins registered with the event. Lastly, load_plugins() accepts a file location that contains the plugins to be loaded. We'll investigate these more closely later on in the article.

An example plugin

Let's build a package called "plugins" and create a module inside that called "example.py" with the following data:

import pluginmanager
 
@pluginmanager.register("FOO_ACTIVE")
def print_data(*args, **kwargs):
    if "data" in kwargs:
        print "Received the following: %s" % kwargs["data"]
    else:
        print "Didn't receive any data."

Here, the print_data plugin is registered for the FOO_ACTIVE event, and will be run if that event is triggered.

How to trigger the plugin

First, the plugin needs to be loaded. Right now, the source file is just sitting out there and unless we import it we'll never be able to do anything with it. So, you can either do a manual import or you can call the load_plugins() function and have it parse a list of plugin locations. We'll use the latter method.

The config file (plugins.list)

Let's create a config file with data resembling the following:

# Blank lines and lines starting with a hash (#) are ignored.
# So be as verbose as you want.
plugins.example

As you can see from above, it specifies to import the plugins.example module (since plugins is a package and example is a module inside of it). I'm doing it this way so adding or removing additional plugins is simple.

Application use

Finally, let's look at how an application would utilize this system:

import pluginmanager
pluginmanager.load_plugins("plugins.list") # load any plugins in the list
 
# ... code ...
 
# The FOO_ACTIVE event occurs somewhere
pluginmanager.trigger_event("FOO_ACTIVE", 
                            data="this data is sent to the plugin", 
                            foo="so is this")
 
# ... code ...

What happenes here is pretty straightforward. The manger loads the plugins, the application eventually triggers the FOO_ACTIVE event and sends data along with it. Any arguments or keyword arguments after the event in trigger_event() are sent directly to the plugin. When the event is triggered, it finds any plugins that have registered with it and activates it.

So yeah, that's what I came up with. It does have the downside of not being able to easily extend a plugin like you can with class-based systems since you can't inherit. Still, it's pretty nice for smaller apps that want a taste of pluggability.

If you have your own system for dealing with this type of situation, I'd love to hear from you.

Sexy Lexing with Python

Lexical analysis, a daunting task right? Wrong! In the following document we'll walk through different methods of lexical scanning in Python. First, we'll look at a pre-built solution found in the library, and then at a custom-built solution.

Using re.Scanner

In the re module there is a class called Scanner that can do lexical sanning. It is completely undocumented other than for a small example code block found on the Python-Dev mailing list, but well worth mentioning. It works by feeding in a list of regular-expressions and callback functions linked to them. When it matches a token, it first runs its value through the appropriate callback and then appends it to the token list being returned. If the scanner reaches a spot where a token match cannot be found, it returns what matches (if any) it did have along with the rest of the document that couldn't be matched. Here is an example:

import re
 
def identifier(scanner, token): return "IDENT", token
def operator(scanner, token):   return "OPERATOR", token
def digit(scanner, token):      return "DIGIT", token
def end_stmnt(scanner, token):  return "END_STATEMENT"
 
scanner = re.Scanner([
    (r"[a-zA-Z_]\w*", identifier),
    (r"\+|\-|\\|\*|\=", operator),
    (r"[0-9]+(\.[0-9]+)?", digit),
    (r";", end_stmnt),
    (r"\s+", None),
    ])
 
tokens, remainder = scanner.scan("foo = 5 * 30; bar = bar - 60;")
for token in tokens:
    print token

Which provides the output:

('IDENT', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
END_STATEMENT
('IDENT', 'bar')
('OPERATOR', '=')
('IDENT', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
END_STATEMENT

Truly easy, fast, and relatively simple to understand.
Using this is perfect for small projects, but it has some downsides such as not allowing simple error handling and not implicitly handling whitespace. Additionally, it suffers from having to tokenize the whole document before being able to provide anything, and that can get costly on larger projects.

Custom-Built Lexer

I had decided to build a custom lexer as a means to break away from the re.Scanner. Here is the code for the actual lexer. It is broken into three classes: UnknownTokenError which gets thrown when a non-recognized token is found, Lexer which holds the settings for scanning, and _InputScanner which is in charge of scanning specific input, as the name implies. A few benefits built into the Lexer include automatic whitespace handling (if desired) and the ability to easily make the scan case-insensitive. Additionally, you can optionally provide a callback with the rule to run the token through before returning it by making the rule a tuple of the rule and callback.

import re
 
 
class UnknownTokenError(Exception):
    """ This exception is for use to be thrown when an unknown token is
        encountered in the token stream. It hols the line number and the
        offending token.
    """
    def __init__(self, token, lineno):
        self.token = token
        self.lineno = lineno
 
    def __str__(self):
        return "Line #%s, Found token: %s" % (self.lineno, self.token)
 
 
class _InputScanner(object):
    """ This class manages the scanning of a specific input. An instance of it is
        returned when scan() is called. It is built to be great for iteration. This is
        mainly to be used by the Lexer and ideally not directly.
    """
 
    def __init__(self, lexer, input):
        """ Put the lexer into this instance so the callbacks can reference it 
            if needed.
        """
        self._position = 0
        self.lexer = lexer
        self.input = input
 
    def __iter__(self):
        """ All of the code for iteration is controlled by the class itself.
            This and next() (or __next__() in Python 3.0) are so syntax
            like `for token in Lexer(...):` is valid and works.
        """
        return self
 
    def next(self):
        """ Used for iteration. It returns token after token until there
            are no more tokens. (change this to __next__(self) if using Py3.0)
        """
        if not self.done_scanning():
            return self.scan_next()
        raise StopIteration
 
    def done_scanning(self):
        """ A simple boolean function that returns true if scanning is
            complete and false if it isn't.
        """
        return self._position >= len(self.input)
 
    def scan_next(self):
        """ Retreive the next token from the input. If the
            flag `omit_whitespace` is set to True, then it will
            skip over the whitespace characters present.
        """
        if self.done_scanning():
            return None
        if self.lexer.omit_whitespace:
            match = self.lexer.ws_regexc.match(self.input, self._position)
            if match:
                self._position = match.end()
        match = self.lexer.regexc.match(self.input, self._position)
        if match is None:
            lineno = self.input[:self._position].count("\n") + 1
            raise UnknownTokenError(self.input[self._position], lineno)
        self._position = match.end()
        value = match.group(match.lastgroup)
        if match.lastgroup in self.lexer._callbacks:
            value = self.lexer._callbacks[match.lastgroup](self, value)
        return match.lastgroup, value
 
 
class Lexer(object):
    """ A lexical scanner. It takes in an input and a set of rules based
        on reqular expressions. It then scans the input and returns the
        tokens one-by-one. It is meant to be used through iterating.
    """
 
    def __init__(self, rules, case_sensitive=True, omit_whitespace=True):
        """ Set up the lexical scanner. Build and compile the regular expression
            and prepare the whitespace searcher.
        """
        self._callbacks = {}
        self.omit_whitespace = omit_whitespace
        self.case_sensitive = case_sensitive
        parts = []
        for name, rule in rules:
            if not isinstance(rule, str):
                rule, callback = rule
                self._callbacks[name] = callback
            parts.append("(?P<%s>%s)" % (name, rule))
        if self.case_sensitive:
            flags = re.M
        else:
            flags = re.M|re.I
        self.regexc = re.compile("|".join(parts), flags)
        self.ws_regexc = re.compile("\s*", re.MULTILINE)
 
    def scan(self, input):
        """ Return a scanner built for matching through the `input` field. 
            The scanner that it returns is built well for iterating.
        """
        return _InputScanner(self, input)

This version does on-the-fly scanning through the use of building the class as an iterator. So, you can work with a token the moment it gets scanned, and before any other tokens get scanned. This can help reduce overhead in case you have a large document and may need to exit prematurely. And, of course, when you write your own lexer, it is much easier to modify it to your needs. Now let's test the above code and see what sort of token stream we arrive with.

def stmnt_callback(scanner, token):
    """ This is just an example of providing a function to run the
        token through.
    """
    return ""
 
rules = [
    ("IDENTIFIER", r"[a-zA-Z_]\w*"),
    ("OPERATOR",   r"\+|\-|\\|\*|\="),
    ("DIGIT",      r"[0-9]+(\.[0-9]+)?"),
    ("END_STMNT",  (";", stmnt_callback)), 
    ]
 
lex = Lexer(rules, case_sensitive=True)
for token in lex.scan("foo = 5 * 30; bar = bar - 60;"):
    print token

Outputs:

('IDENTIFIER', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
('END_STMNT', '')
('IDENTIFIER', 'bar')
('OPERATOR', '=')
('IDENTIFIER', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
('END_STMNT', '')

Pretty easy to understand, right? A great thing about the `Lexer` is that it is easy to subclass. For instance, in a project that I'm doing for a complex template parser, I added in the ability to only do scanning inside specific tags while treating non-tag data as their own type of token. Maybe I'll cover that in more detail in a future post.

Update: The custom lexer has been updated to accept a list of tuples as the rules instead of the dict. This is so one can implement an order on the rules.

Random password generator in Python and Tkinter

This is always a fun project. The task? To create a random password of random length. The reason for a password generator obvious: you suck at choosing a password. Let's start with how to create the actual generator, and then we'll focus on the presentation.

from random import *
import string
 
# The characters to make up the random password
chars = string.ascii_letters + string.digits
 
def random_password():
    """ Create a password of random length between 8 and 16
        characters long, made up of numbers and letters.
    """
    return "".join(choice(chars) for x in range(randint(8, 16)))

The above function is pretty easy to follow. What it does is build a generator object that creates a random list of characters between 8 and 16 in length. It then just compresses the list into a string. You can do something as simple as print random_password() and it'll display a password in the terminal. This, of course, isn't really the best way of acheiving it. After all, you don't want to have to navigate the terminal each and every time. So, let's add in a graphical user interface using the Tkinter window builder:

from Tkinter import *
from random import *
import string
 
# The characters to make up the random password
chars = string.ascii_letters + string.digits
 
def random_password():
    """ Create a password of random length between 8 and 16
        characters long, made up of numbers and letters.
    """
    return "".join(choice(chars) for x in range(randint(8, 16)))
 
#
# BEGIN GUI CODE
#
 
root = Tk()
root.title("Password Generator")
root.resizable(0,0)
root.minsize(300,0)
 
frame = Frame(root)
frame.pack(pady=10, padx=5)
 
content = StringVar()
updater = lambda:content.set(random_password())
 
gen_btn = Button(frame, text="Generate", command=updater)
gen_btn.config(font=("sans-serif", 14),  bg="#92CC92")
gen_btn.pack(side=LEFT, padx=5)
 
field = Entry(frame, textvariable=content)
field.config(fg='blue', font=('courier',  16, "bold"), justify='center')
field.pack(fill=BOTH, side=RIGHT, padx=5)
 
root.mainloop()

The above should be pretty simple to follow. As you can see, pressing the gen_btn activates the updater lambda function which populates the entry field. Here is a sample output:

password generator

Simple, clean, and easy. This is just a quick project I made for myself and decided to share it.

UNIX Time Clock, in honor of "1234567890 Day"

On this Friday the 13th, we are to witness a very unique moment in history. That is, it is the day when UNIX time will roll over to 1234567890. Here's a simple Python/Tk app that'll show the timestamp.

import time
from Tkinter import *
 
class TimestampClock(Frame):
    def __init__(self, root):
        Frame.__init__(self, root)
        self.pack()
        self.time = Label(self, text=int(time.time()))
        self.time.config(fg='red', font=('Monospace', 20, 'bold'))
        self.time.pack(padx=10, pady=10)
        self.update()
 
    def update(self):
        self.time.config(text=int(time.time()))
        self.after(1000, self.update)
 
if __name__ == "__main__":
    root = Tk()
    root.title("Timestamp-Clock")
    root.wm_attributes("-topmost", 1)
    root.resizable(0,0)
    root.minsize(300,50)
    TimestampClock(root).mainloop()

I hope you did something special for the occasion. Me? I was sitting in a computer lab at Chemeketa waiting for the clock to hit that special number. Here is my screen grab of this great event:

1234567890