Re: [code] [scintillua] More lexer improvements from the vis editor community

From: Mitchell <m.att.foicica.com>
Date: Sun, 19 Nov 2017 13:56:35 -0500 (EST)

Hi Marc,

On Sat, 25 Feb 2017, Mitchell wrote:

> Hi Marc,
>
> On Sat, 25 Feb 2017, Marc André Tanner wrote:
>
>> On Wed, Feb 22, 2017 at 10:18:46AM -0500, Mitchell wrote:
>>> On Wed, 22 Feb 2017, Marc André Tanner wrote:
>>
>> [snip]
>>
>>>> I noticed that you removed some lexers. Any particular reason for that?
>>>> Did they have specific problems I should be aware of? I know that at
>>>> least one of my users cares about APL, so I will most likely add that
>>>> back in my repo.
>>>
>>> No reason other than I figured it's not worth the effort in refactoring
>>> them
>>> if they're not likely to be used. I'm surprised to hear of a user of APL.
>>
>> Ok, fair enough. I assume he would be willing to update the more obscure
>> lexers himself (he contributed the APL, Faust, Man, Protobuf, Pure and Spin
>> lexers).
>
> Now that you mention it, deleting APL was a mistake. For some reason, I
> thought I authored it... Like I said in a previous thread, it's been a long
> week :(
>
>> [snip]
>>
>>>> Also is there a place where I can read upon the motivation / goals of
>>>> the refactoring? I'm not sure I agree with some of the changes (e.g.
>>>> word_match taking a string rather than a table?). But then you have
>>>> much more experience in Lua, I'm sure there is a good reason for it.
>>>
>>> No, it's all in my head :) Feel free to ask questions. I'm not even 100%
>>> sure this was/is a good idea. It's an experiment right now that I might
>>> end
>>> up throwing away.
>>>
>>> First I'd like to point out that one of my goals is to keep compatibility
>>> with legacy lexers.
>>
>> I'm not sure I like that in the long term. If we manage to come up
>> with a clear improvement we should spend a one time effort to convert
>> all existing lexers to the new mechanism. Then deprecate the old one
>> and eventually remove it. Unless the maintenance effort for both schemes
>> is negligible. Though, I understand you probably provide backward
>> compatibility guarantees for textadept?
>
> Sure, in the long term that is reasonable. For the short term I'd want to
> maintain backwards compatibility for people that are using old, custom lexers
> they have written that are not in the repository.
>
>> [snip]
>>
>>> Since compatibility is important, you can keep the table form of
>>> `word_match()` if you want. I personally don't like the idea of creating
>>> giant tables of keywords and having them stick around in memory. That's
>>> why
>>> I've moved to a single string.
>>
>> I understand (and support) the motivation. However, does it actually
>> achieve that?
>>
>> I meant to look at the actual implementation (can't seem to find your
>> branch
>> right now?) and what LPeg does under the hood, but didn't yet have the time
>> to do so, meaning the following high-level argumentation might be wrong.
>>
>> You start out with one large string which has better memory characteristics
>> than many little ones (mostly due to the associated meta data). However,
>> then you split it, causing the creation of many tiny strings anyway. At
>> this
>> point you are consuming more memory (the same splitted strings plus the
>> additional long one) than in the old scheme. Now it depends whether you
>> keep references to them, thus preventing the GC from collecting them.
>> At first glance this seems to be the case because `word_match()` captures
>> the local variable `word_list` which uses the strings as table indices.
>> I haven't analyzed how Lua's short string optimization (keywords will
>> typically be shorter than 40 bytes) and string interning play into that.
>
> I have not committed a branch yet, but I've attached my working draft for
> your reference until I do (if I still continue with this endeavor). It is a
> drop-in replacement if you want to play around with it.
>
> The key is that the giant string passed to `word_match()` is an argument, and
> not local. Thus, after the lexer finishes loading, the GC throws it away, and
> only keeps the little strings in memory. You are correct that Lua interns
> small strings, so the original implementation of giant tables only has the
> table itself taking up memory. I suppose the gain is not as large as I
> thought.
>
>>> Some other motivations are I don't like the idea of "magic fields" (e.g.
>>> `_rules`, `_tokenstyles`). I think the object-style of `:add_rule()` and
>>> `:add_style()` is better practice.
>>
>> I can see the benefits in that. Although the `_rules` approach had the
>> nice effect that all rule references were grouped together making it easier
>> to spot mistakes in rule ordering.
>
> Yes, you are correct.
>
> Anyway, I ran into some issues converting old lexers to the new format, so
> I'm currently rethinking my approach (and whether or not it is still worth
> doing). I also like the idea of retaining comments in tables of words passed
> to `word_list()` for documentation of purposes. Giant strings cannot have the
> same thing.
>
> Thanks very much for your feedback. It's quite helpful. I think I got a bit
> overzealous and should take a more methodical approach moving forward.

I wanted to follow up with the fact that I have gone ahead and converted all but a half-dozen lexers to the new object-oriented lexer format along with documentation and committed to hg. We can start analyzing memory consumption, etc. for any further improvements.

Cheers,
Mitchell

-- 
You are subscribed to code.att.foicica.com.
To change subscription settings, send an e-mail to code+help.att.foicica.com.
To unsubscribe, send an e-mail to code+unsubscribe.att.foicica.com.
Received on Sun 19 Nov 2017 - 13:56:35 EST

This archive was generated by hypermail 2.2.0 : Mon 20 Nov 2017 - 06:52:37 EST