Re: Encodings

From: vais <vsalik....at.gmail.com>
Date: Sat, 28 Feb 2009 11:56:00 -0800 (PST)

Alex,

> I do not see the 80 bit in your proposal, Vais. :)

Like I said - it's a start. Also, Mitchell now has a strategy for what
to do once the encoding has been detected:

> On GTK, Scintilla's code_page can be set to ASCII, DBCS (Japanese), or
> UTF8. If say we detect UTF-16 (LE), there's no way to tell Scintilla
> that. The only thing I can see to do is convert it to UTF8 for display
> in Scintilla, and UTF-16 (LE) when writing back to file.

So, when a more comprehensive way of detecting encodings presents
itself, TA will be ready for it.

For now, at least the various Unicode encodings will be detected,
loaded into Scintilla as UTF-8 for viewing and editing, then saved
back into the file with the original encoding.

Scite and Textmates both have a File sub-menu for viewing and
controlling the encoding of the file (both for opening and saving).

Here is what Scite has:

File -> Encoding -> [Code Page Property, UCS-2 Big Endian, UCS-2
Little Endian, UTF-8 with BOM, UTF-8]

Here is what Textmate has:

File -> Re-Open With Encoding -> [MacRoman, ISO-8859-1 (Latin 1),
ISO-8859-1 (Windows), UTF8, UTF16 (Big Endian), UTF16 (Little Endian)]

The idea is that as a result of adding such menu you can SEE what
Textadept thinks the file encoding is, you can TELL Textadept what
encoding the file is in (if it cannot guess it or guesses it wrong),
and you can TELL Textadept what encoding you WANT the file to be SAVED
with. These three functions constitute the minimum support for
multiple encodings that TA should have. Detecting other encodings
automatically is just icing on the cake.

These are my last two cents on the subject :)

Vais

On Feb 28, 8:59 am, Alex <alex.bep....at.gmail.com> wrote:
> > I think that at least reading the BOM is a good start and gets us
> > parity with Scite. It is the good ole 80/20 rule, right - reading the
> > BOM gets results with minimal effort.
>
> I do not see the 80 bit in your proposal, Vais. :) Yes, it offers a
> simple solution to detecting BOMs. But as I said, in my experience
> they are rare. Personally, I see the 80 bit in distinguishing between
> Unicode and the platform encoding.
>
> - Alex
>
> On Feb 27, 6:28 am, vais <vsalik....at.gmail.com> wrote:
>
> > Mitchell, this is just a proof of concept. I imagine you would call it
> > from the open_helper function, right after you check to see whether
> > the text is binary (contains null char) and set the buffer's encoding
> > accordingly.
>
> > I think that at least reading the BOM is a good start and gets us
> > parity with Scite. It is the good ole 80/20 rule, right - reading the
> > BOM gets results with minimal effort.  Heuristical analysis of text to
> > determine encoding  is a whole different beast that I personally have
> > no interest in - I never had any problems with Scite's encoding
> > detection, and if TA does the same, I am happy.
>
> > Vais
>
> > On Feb 26, 8:52 pm, mitchell <mforal.n....at.gmail.com> wrote:
>
> > > Vais,
>
> > > > Without further ado, here is a very lame Lua implementation of Unicode
> > > > encoding detection for textadept that actually gets the job done (I
> > > > put it into file_io.lua and call it from open_helper, right after the
> > > > null byte detection routine used to detect binary files):
> > > > <snip>
>
> > > Forgive me, but how do you use this function? Or is it just a proof of
> > > concept in detecting encoding?
>
Received on Sat 28 Feb 2009 - 14:56:00 EST

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:37 EST