Re: [code] [scintillua] More lexer improvements from the vis editor community

From: Marc André Tanner <mat.att.brain-dump.org>
Date: Sat, 25 Feb 2017 09:18:53 +0100

On Wed, Feb 22, 2017 at 10:18:46AM -0500, Mitchell wrote:
> On Wed, 22 Feb 2017, Marc André Tanner wrote:
> >I noticed you started the lexer refactorings without first integrating
> >the C and Scheme lexer changes, do you still plan to look at them or
> >have you concluded that they are unsuitable for inclusion?
>
> Thanks for the follow up, and I'm sorry for the lack of communication. It
> certainly may appear that I have silently rebuffed your e-mail. That is not
> true and I am sorry. Your lexers are still on my to-do list, but after I
> finish my current set of work. I simply have not looked at them yet.

No problem at all. I'm busy too (which among other things is why this
mail got delayed).

> >I noticed that you removed some lexers. Any particular reason for that?
> >Did they have specific problems I should be aware of? I know that at
> >least one of my users cares about APL, so I will most likely add that
> >back in my repo.
>
> No reason other than I figured it's not worth the effort in refactoring them
> if they're not likely to be used. I'm surprised to hear of a user of APL.

Ok, fair enough. I assume he would be willing to update the more obscure
lexers himself (he contributed the APL, Faust, Man, Protobuf, Pure and Spin
lexers).

> Regardless, you can re-add the legacy lexer to your repo. I've designed the
> refactoring to handle legacy lexers without issue.

I will probably do that for now.

> >Also is there a place where I can read upon the motivation / goals of
> >the refactoring? I'm not sure I agree with some of the changes (e.g.
> >word_match taking a string rather than a table?). But then you have
> >much more experience in Lua, I'm sure there is a good reason for it.
>
> No, it's all in my head :) Feel free to ask questions. I'm not even 100%
> sure this was/is a good idea. It's an experiment right now that I might end
> up throwing away.
>
> First I'd like to point out that one of my goals is to keep compatibility
> with legacy lexers.

I'm not sure I like that in the long term. If we manage to come up
with a clear improvement we should spend a one time effort to convert
all existing lexers to the new mechanism. Then deprecate the old one
and eventually remove it. Unless the maintenance effort for both schemes
is negligible. Though, I understand you probably provide backward
compatibility guarantees for textadept?

> Right now my repository is a mix of legacy lexers and
> refactored lexers and my unit tests still pass, so that's a good sign.

Yes, introducing unit tests is certainly desirable independent of the
actual implementation.

> Since compatibility is important, you can keep the table form of
> `word_match()` if you want. I personally don't like the idea of creating
> giant tables of keywords and having them stick around in memory. That's why
> I've moved to a single string.

I understand (and support) the motivation. However, does it actually
achieve that?

I meant to look at the actual implementation (can't seem to find your branch
right now?) and what LPeg does under the hood, but didn't yet have the time
to do so, meaning the following high-level argumentation might be wrong.

You start out with one large string which has better memory characteristics
than many little ones (mostly due to the associated meta data). However,
then you split it, causing the creation of many tiny strings anyway. At this
point you are consuming more memory (the same splitted strings plus the
additional long one) than in the old scheme. Now it depends whether you
keep references to them, thus preventing the GC from collecting them.
At first glance this seems to be the case because `word_match()` captures
the local variable `word_list` which uses the strings as table indices.
I haven't analyzed how Lua's short string optimization (keywords will
typically be shorter than 40 bytes) and string interning play into that.

> Some other motivations are I don't like the idea of "magic fields" (e.g.
> `_rules`, `_tokenstyles`). I think the object-style of `:add_rule()` and
> `:add_style()` is better practice.

I can see the benefits in that. Although the `_rules` approach had the
nice effect that all rule references were grouped together making it easier
to spot mistakes in rule ordering.

Marc

-- 
You are subscribed to code.att.foicica.com.
To change subscription settings, send an e-mail to code+help.att.foicica.com.
To unsubscribe, send an e-mail to code+unsubscribe.att.foicica.com.
Received on Sat 25 Feb 2017 - 03:18:53 EST

This archive was generated by hypermail 2.2.0 : Sat 25 Feb 2017 - 06:54:20 EST