where to buy misoprostol online how to buy valtrex
Sexy Lexing with Python | Evan Fosmark

Sexy Lexing with Python

Lexical analysis, a daunting task right? Wrong! In the following document we’ll walk through different methods of lexical scanning in Python. First, we’ll look at a pre-built solution found in the library, and then at a custom-built solution.

Using re.Scanner

In the re module there is a class called Scanner that can do lexical sanning. It is completely undocumented other than for a small example code block found on the Python-Dev mailing list, but well worth mentioning. It works by feeding in a list of regular-expressions and callback functions linked to them. When it matches a token, it first runs its value through the appropriate callback and then appends it to the token list being returned. If the scanner reaches a spot where a token match cannot be found, it returns what matches (if any) it did have along with the rest of the document that couldn’t be matched. Here is an example:

import re
 
def identifier(scanner, token): return "IDENT", token
def operator(scanner, token):   return "OPERATOR", token
def digit(scanner, token):      return "DIGIT", token
def end_stmnt(scanner, token):  return "END_STATEMENT"
 
scanner = re.Scanner([
    (r"[a-zA-Z_]\w*", identifier),
    (r"\+|\-|\\|\*|\=", operator),
    (r"[0-9]+(\.[0-9]+)?", digit),
    (r";", end_stmnt),
    (r"\s+", None),
    ])
 
tokens, remainder = scanner.scan("foo = 5 * 30; bar = bar - 60;")
for token in tokens:
    print token

Which provides the output:

('IDENT', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
END_STATEMENT
('IDENT', 'bar')
('OPERATOR', '=')
('IDENT', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
END_STATEMENT

Truly easy, fast, and relatively simple to understand.
Using this is perfect for small projects, but it has some downsides such as not allowing simple error handling and not implicitly handling whitespace. Additionally, it suffers from having to tokenize the whole document before being able to provide anything, and that can get costly on larger projects.

Custom-Built Lexer

I had decided to build a custom lexer as a means to break away from the re.Scanner. Here is the code for the actual lexer. It is broken into three classes: UnknownTokenError which gets thrown when a non-recognized token is found, Lexer which holds the settings for scanning, and _InputScanner which is in charge of scanning specific input, as the name implies. A few benefits built into the Lexer include automatic whitespace handling (if desired) and the ability to easily make the scan case-insensitive. Additionally, you can optionally provide a callback with the rule to run the token through before returning it by making the rule a tuple of the rule and callback.

import re
 
 
class UnknownTokenError(Exception):
    """ This exception is for use to be thrown when an unknown token is
        encountered in the token stream. It hols the line number and the
        offending token.
    """
    def __init__(self, token, lineno):
        self.token = token
        self.lineno = lineno
 
    def __str__(self):
        return "Line #%s, Found token: %s" % (self.lineno, self.token)
 
 
class _InputScanner(object):
    """ This class manages the scanning of a specific input. An instance of it is
        returned when scan() is called. It is built to be great for iteration. This is
        mainly to be used by the Lexer and ideally not directly.
    """
 
    def __init__(self, lexer, input):
        """ Put the lexer into this instance so the callbacks can reference it 
            if needed.
        """
        self._position = 0
        self.lexer = lexer
        self.input = input
 
    def __iter__(self):
        """ All of the code for iteration is controlled by the class itself.
            This and next() (or __next__() in Python 3.0) are so syntax
            like `for token in Lexer(...):` is valid and works.
        """
        return self
 
    def next(self):
        """ Used for iteration. It returns token after token until there
            are no more tokens. (change this to __next__(self) if using Py3.0)
        """
        if not self.done_scanning():
            return self.scan_next()
        raise StopIteration
 
    def done_scanning(self):
        """ A simple boolean function that returns true if scanning is
            complete and false if it isn't.
        """
        return self._position >= len(self.input)
 
    def scan_next(self):
        """ Retreive the next token from the input. If the
            flag `omit_whitespace` is set to True, then it will
            skip over the whitespace characters present.
        """
        if self.done_scanning():
            return None
        if self.lexer.omit_whitespace:
            match = self.lexer.ws_regexc.match(self.input, self._position)
            if match:
                self._position = match.end()
        match = self.lexer.regexc.match(self.input, self._position)
        if match is None:
            lineno = self.input[:self._position].count("\n") + 1
            raise UnknownTokenError(self.input[self._position], lineno)
        self._position = match.end()
        value = match.group(match.lastgroup)
        if match.lastgroup in self.lexer._callbacks:
            value = self.lexer._callbacks[match.lastgroup](self, value)
        return match.lastgroup, value
 
 
class Lexer(object):
    """ A lexical scanner. It takes in an input and a set of rules based
        on reqular expressions. It then scans the input and returns the
        tokens one-by-one. It is meant to be used through iterating.
    """
 
    def __init__(self, rules, case_sensitive=True, omit_whitespace=True):
        """ Set up the lexical scanner. Build and compile the regular expression
            and prepare the whitespace searcher.
        """
        self._callbacks = {}
        self.omit_whitespace = omit_whitespace
        self.case_sensitive = case_sensitive
        parts = []
        for name, rule in rules:
            if not isinstance(rule, str):
                rule, callback = rule
                self._callbacks[name] = callback
            parts.append("(?P<%s>%s)" % (name, rule))
        if self.case_sensitive:
            flags = re.M
        else:
            flags = re.M|re.I
        self.regexc = re.compile("|".join(parts), flags)
        self.ws_regexc = re.compile("\s*", re.MULTILINE)
 
    def scan(self, input):
        """ Return a scanner built for matching through the `input` field. 
            The scanner that it returns is built well for iterating.
        """
        return _InputScanner(self, input)

This version does on-the-fly scanning through the use of building the class as an iterator. So, you can work with a token the moment it gets scanned, and before any other tokens get scanned. This can help reduce overhead in case you have a large document and may need to exit prematurely. And, of course, when you write your own lexer, it is much easier to modify it to your needs. Now let’s test the above code and see what sort of token stream we arrive with.

def stmnt_callback(scanner, token):
    """ This is just an example of providing a function to run the
        token through.
    """
    return ""
 
rules = [
    ("IDENTIFIER", r"[a-zA-Z_]\w*"),
    ("OPERATOR",   r"\+|\-|\\|\*|\="),
    ("DIGIT",      r"[0-9]+(\.[0-9]+)?"),
    ("END_STMNT",  (";", stmnt_callback)), 
    ]
 
lex = Lexer(rules, case_sensitive=True)
for token in lex.scan("foo = 5 * 30; bar = bar - 60;"):
    print token

Outputs:

('IDENTIFIER', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
('END_STMNT', '')
('IDENTIFIER', 'bar')
('OPERATOR', '=')
('IDENTIFIER', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
('END_STMNT', '')

Pretty easy to understand, right? A great thing about the `Lexer` is that it is easy to subclass. For instance, in a project that I’m doing for a complex template parser, I added in the ability to only do scanning inside specific tags while treating non-tag data as their own type of token. Maybe I’ll cover that in more detail in a future post.

Update: The custom lexer has been updated to accept a list of tuples as the rules instead of the dict. This is so one can implement an order on the rules.

 

 

21 Comments

  1. sudo rm -r / » lexical analysis for processing derivatives in python wrote,

    [...] Uncategorized — Tags: lexical, pyderiv, Python — admin @ 6:29 pm 21 February 2009 This page has introduced me to lexical analysis, a method of identifying different parts/groups of [...]

  2. Bob wrote,

    Thanks!

    I moved the whitespace skipping into the done_scanning() method to make it not bug out at end of input, but other than that it works great.

  3. Lenny wrote,

    Thanks for the info and the iterator implementation.

    A bug I found: _callbacks is defined in the class, not for each instance.

    >>> lex1 = Lexer({‘ALPHA’: (‘[A-Za-z]‘, lambda scanner, token: token)})
    >>> lex2 = Lexer({‘DIGIT’: (‘[0-9]‘, lambda scanner, token: token)})
    >>> lex1._callbacks
    {‘ALPHA’: <function <lambda> at 0xa59adf4>, ‘DIGIT’: <function <lambda> at 0xa500144>}

    The fix is to define self._callbacks in Lexer.__init__.

  4. RJ Ryan wrote,

    Very nice. One comment: You’re using the Python regular expression library, which uses backtracking to process regular expressions. Traditionally for performance reasons you will want to construct a DFA/NFA. There’s a good construction/proof of the algorithm in the Dragon book.

    Cheers,
    RJ

  5. david wrote,

    It would be cool to connect this with PyParsing

  6. Jeremy Mandelson wrote,

    Interesting article. I am just wondering why you didn’t use something like Antlr which has a python target if I recall.

  7. Mark wrote,

    Jeremy: why didn’t he use Lex and swig to create a python interface to some C code? Why didn’t he use X to do Y? Antlr is a Java application and introduces an entirely differnt dependancy and tool set. This is a toy anyway so who cares? I’m wondering why you think you know what this person wants to do?

  8. Nick Leaton wrote,

    http://www.devincook.com/goldparser/engine/python/index.htm

    Is a parser generator with lexical analysis that has a python back end

    Nick

  9. gguai wrote,

    nice job!

    for a job which is simpler than a serious language, make a own lexer/scanner even a parser is a better choice than use a tool.

  10. Alysse wrote,

    You’re amazing! Absolutely brilliant!

  11. Reid K wrote,

    The reason that re.Scanner takes a list of tuples with names instead of a dict is that the order you apply the lexer rules matters. Consider in the simple case A -> ‘a’ and AA -> ‘aa’. Applying A before AA to the string ‘aa’ results in [A, A] instead of [AA].

  12. Luke McCarthy wrote,

    I have tried the old “|”.join trick myself. The problem with this approach is that Python’s re does first-match with “|”, not longest-match. For example this doesn’t work:

    rules = {
    “B” : r”food”,
    “A” : r”foo”,
    }

    lex = Lexer(rules, case_sensitive=True)
    for token in lex.scan(“foo food”):
    print token

    If you swap the names A and B, it does work (because that changes the order they are stored in the dictionary due to hashing order).

    The only easy workaround I can see is iterating through every regexp and finding the longest match.

  13. Evan wrote,

    @Luke,

    That’s actually because the Lexer class I wrote for the article actually does it in a poor fashion (right now) since it isn’t a list of tuples or an OrderedDict. Order is important. I’ll update the code later today.

  14. rajashekhar wrote,

    is there any way to detect the lines in a scanned document using python

  15. Evan wrote,

    @rajashekhar,
    You could always create a regex to match line breaks, and feed it to the scanner.

    rules = [
        ("LINE_BREAK", r"\n"),
        ]
     
    line_number = 1
    lex = Lexer(rules, case_sensitive=True)
    for token in lex.scan("\n\n\n\n\n\n"):
        if token[0] == "LINE_BREAK":
            line_number += 1
  16. Mike wrote,

    Is there a way to have tokens span lines, for example if I want to capture all the text in a quote as a single token? eg the following should be one token:

    “this is a test quote, the
    number 12 should be part of
    this string and not interpreted
    as the number 12 or its own token
    and all this should be one token”

    Sorry, new to regular expressions, but could not come up with a rule to allow multi-line quotes (that the Lexer wouldn’t choke on when a quote was encountered)

    Thanks,
    Mike

  17. Mike wrote,

    Nevermind, just realized I could make the following rule:

    (“quotation”,(r”[\"][^\"]*[\"]“,quotation)),

    And just need to make sure the Lexer is called with the omit_whitespace flag set to False

  18. Mike wrote,

    Is there a way to get the lexor to match the longest match?

    Say, for example, that in the above that the following rule:

    (“END_STMNT”, (“;”, stmnt_callback)),

    Was instead defined as:

    (“END_STMNT”, (“end”, stmnt_callback)),

    How would I get the lexor to correctly identify, “end” as END_STMNT token, but also allow “endgame” to be identified as an “IDENTIFIER”. Currently, it will return “end” as END_STMNT, and “game” as an IDENTIFIER.

    This seems to be a limitation in the match function. I am looking to get behavior similar to flex.

  19. Jonathan Hartley wrote,

    @Mike

    I’m no expert. How about if you make the regex for END_STMNT into something like ‘\bend\b’, (\b matches a ‘word boundary’), then it would match the “end” in “end\n” but not in “endgame”?

  20. Anh Le wrote,

    Thanks for your sharing! It helps me a lot. I found a bug in your script. If I have a rule set defined as follow:

    rules = [
    ("LINE_BREAK", r"\n"),
    ("NONTERMINAL", r"[a-zA-Z_]+”),
    ]

    Your script will fail with an input of “data\r\n\r\n\r\n\r\n\r\n”. The exception I’ve got is “IndexError: string index out of range”. After examining the code, I’ve found the bug in function scan_next(). I think it should be fixed as:

    def scan_next(self):
    ….
    if match is None:
    # I added code for checking the range
    if self.done_scanning():
    raise StopIteration
    lineno = self.input[:self._position].count(“\n”) + 1

  21. 用 python 來寫 lex like parsing 的東西 | Headerguards' Blog wrote,

    [...] http://www.evanfosmark.com/2009/02/sexy-lexing-with-python/ Share this:TwitterFacebookLike this:喜歡 載入中… This entry was posted in Python by tokikanno. Bookmark the permalink. [...]

Leave a comment