Re: [code] [scintillua] More lexer improvements from the vis editor community

From: Mitchell <>
Date: Sat, 25 Feb 2017 10:00:19 -0500 (EST)

Hi Marc,

On Sat, 25 Feb 2017, Marc André Tanner wrote:

> On Wed, Feb 22, 2017 at 10:18:46AM -0500, Mitchell wrote:
>> On Wed, 22 Feb 2017, Marc André Tanner wrote:
> [snip]
>>> I noticed that you removed some lexers. Any particular reason for that?
>>> Did they have specific problems I should be aware of? I know that at
>>> least one of my users cares about APL, so I will most likely add that
>>> back in my repo.
>> No reason other than I figured it's not worth the effort in refactoring them
>> if they're not likely to be used. I'm surprised to hear of a user of APL.
> Ok, fair enough. I assume he would be willing to update the more obscure
> lexers himself (he contributed the APL, Faust, Man, Protobuf, Pure and Spin
> lexers).

Now that you mention it, deleting APL was a mistake. For some reason, I
thought I authored it... Like I said in a previous thread, it's been a
long week :(

> [snip]
>>> Also is there a place where I can read upon the motivation / goals of
>>> the refactoring? I'm not sure I agree with some of the changes (e.g.
>>> word_match taking a string rather than a table?). But then you have
>>> much more experience in Lua, I'm sure there is a good reason for it.
>> No, it's all in my head :) Feel free to ask questions. I'm not even 100%
>> sure this was/is a good idea. It's an experiment right now that I might end
>> up throwing away.
>> First I'd like to point out that one of my goals is to keep compatibility
>> with legacy lexers.
> I'm not sure I like that in the long term. If we manage to come up
> with a clear improvement we should spend a one time effort to convert
> all existing lexers to the new mechanism. Then deprecate the old one
> and eventually remove it. Unless the maintenance effort for both schemes
> is negligible. Though, I understand you probably provide backward
> compatibility guarantees for textadept?

Sure, in the long term that is reasonable. For the short term I'd want to
maintain backwards compatibility for people that are using old, custom
lexers they have written that are not in the repository.

> [snip]
>> Since compatibility is important, you can keep the table form of
>> `word_match()` if you want. I personally don't like the idea of creating
>> giant tables of keywords and having them stick around in memory. That's why
>> I've moved to a single string.
> I understand (and support) the motivation. However, does it actually
> achieve that?
> I meant to look at the actual implementation (can't seem to find your branch
> right now?) and what LPeg does under the hood, but didn't yet have the time
> to do so, meaning the following high-level argumentation might be wrong.
> You start out with one large string which has better memory characteristics
> than many little ones (mostly due to the associated meta data). However,
> then you split it, causing the creation of many tiny strings anyway. At this
> point you are consuming more memory (the same splitted strings plus the
> additional long one) than in the old scheme. Now it depends whether you
> keep references to them, thus preventing the GC from collecting them.
> At first glance this seems to be the case because `word_match()` captures
> the local variable `word_list` which uses the strings as table indices.
> I haven't analyzed how Lua's short string optimization (keywords will
> typically be shorter than 40 bytes) and string interning play into that.

I have not committed a branch yet, but I've attached my working draft for
your reference until I do (if I still continue with this endeavor). It is
a drop-in replacement if you want to play around with it.

The key is that the giant string passed to `word_match()` is an argument,
and not local. Thus, after the lexer finishes loading, the GC throws it
away, and only keeps the little strings in memory. You are correct that
Lua interns small strings, so the original implementation of giant tables
only has the table itself taking up memory. I suppose the gain is not as
large as I thought.

>> Some other motivations are I don't like the idea of "magic fields" (e.g.
>> `_rules`, `_tokenstyles`). I think the object-style of `:add_rule()` and
>> `:add_style()` is better practice.
> I can see the benefits in that. Although the `_rules` approach had the
> nice effect that all rule references were grouped together making it easier
> to spot mistakes in rule ordering.

Yes, you are correct.

Anyway, I ran into some issues converting old lexers to the new format, so
I'm currently rethinking my approach (and whether or not it is still worth
doing). I also like the idea of retaining comments in tables of words
passed to `word_list()` for documentation of purposes. Giant strings
cannot have the same thing.

Thanks very much for your feedback. It's quite helpful. I think I got a
bit overzealous and should take a more methodical approach moving forward.


You are subscribed to
To change subscription settings, send an e-mail to
To unsubscribe, send an e-mail to
Received on Sat 25 Feb 2017 - 10:00:19 EST

This archive was generated by hypermail 2.2.0 : Sun 26 Feb 2017 - 06:26:53 EST