Next generation of lexers

From: mitchell f <>
Date: Fri, 16 Apr 2010 18:41:13 -0400


Writing a lexer sucks.

* The "tutorial" (if you can call it that) is awful, consisting of a
bunch of scatterminded thoughts without an end goal.
* The syntax is clunky. Seriously, why is there a 'word_list' function
when all it does is get passed to 'word_match'? LoadTokens? Why do I
have to 'add_token' for every pattern? It's annoying! LoadStyles?
Somewhat more sane, but why does it have to be in a separate function?
* Embedding languages is ridiculously complicated. First there is this
function called 'make_embeddable', but it's not always used in
conjunction with 'embed_language'. WTH? I don't want to have to call
'LoadTokens' and 'LoadStyles' either. It's too much work and what if I
do it too early? Whoops I crashed ta. 'rebuild_token' and
'rebuild_tokens'. Um, what are these for? Why do I have to call them?
Come on...
* Counter-intuitive terms. The word 'token' is used wrong ALL the
time. A token is like an atom; it is the smallest part in a language.
Most of the time if you replace 'token' with 'rule', everything in
documentation and code makes much more sense.
* Rampant confusion. The lexer module makes all of its patterns and
styles available in the global lexer namespace. This means when you
see an identifier like 'word', you don't know if it was defined in the
lexer itself (i.e. 'local word = ...') or the default word pattern is
being referenced. If you're referencing a lexer for your own and see
an identifier that looks like it's a default pattern and later get a
pattern error, you pull your hair out trying to figure out which
pattern is to blame before realizing the mistake.

It's time for a radical change. I have rewritten the lexer
infrastructure, documentation, and lexers themselves.

* The tutorial now coaches a user through writing a simple Lua lexer.
Later sections have real-life examples of embedding CSS, Javascript,
and PHP in HTML. No more scattered thoughts!
* Revised syntax. No more LoadTokens and LoadStyles. In fact there are
no separate functions at all. 'word_match' takes a table as a
parameter. There are only a half-dozen succinct, intuitive functions
to create complex patterns.
* Embedded languages use only ONE function: embed_lexer. That's it.
Everything else is magically taken care of.
* Terminology is accurate. tokens, rules, and grammars are now used as
expected in the documentation and code.
* The 'lexer' module is now available in the global namespace and all
lexers use the 'lexer.' sequence to access functions, patterns,
styles, etc. (i.e. local identifier = token('identifier', lexer.word))

New documentation can be found here: The
attached tgz contains the new lexers/, and ScintillaBase.cxx,
ScintillaBase.h, and LexLPeg.cxx files.

Linux users, please test! Clone the latest hg from, copy the archived C++ files into
your textadept/src/scintillua/src/ folder, and replace the
textadept/lexers/ directory. Compile ta and you're good to go.

Everyone else, please leave feedback on the new documentation. I think
it does a good job teaching you how to write lexers, but maybe I
missed something. Also, feel free to look at the new lexers and
comment on anything you find odd. Maybe the syntax, variable names,
etc. could be termed better. I want lexers to be as easy to write as

I'm considering putting this in the final 2.2 release.

Sorry for the long post,

You received this message because you are subscribed to the Google Groups "textadept" group.
To post to this group, send email to
To unsubscribe from this group, send email to
For more options, visit this group at

Received on Fri 16 Apr 2010 - 18:41:13 EDT

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:43:24 EST