Re: Encodings

From: Alex <alex.bep....at.gmail.com>
Date: Thu, 26 Feb 2009 11:09:07 -0800 (PST)

Unfortunately detecting encodings is not exactly Scite’s strength. In
my experience, BOMs are rare. So, yes I was thinking of a heuristic,
at least one that covers major cases, like distinguishing between
UTF-8 and the default system encoding. I know, Groovy supports this
via
   new CharsetToolkit(file).guessEncoding()
So this might be a place to look.

- Alex

On Feb 26, 2:53 am, vais <vsalik....at.gmail.com> wrote:
> Here is an interesting read on the subject - a discussion on SciTE's
> implementation for the historically inclined:
>
> http://www.mail-archive.com/scite-inter...@lyra.org/msg02629.html
>
> Google ate this url last time I posted it, so if it does again, just
> search google for "[scite] Re: UTF-8 Encoding and BOM handling"
>
> Vais
>
> On Feb 25, 8:35 pm, vais <vsalik....at.gmail.com> wrote:
>
> > This is a surprisingly simple problem to solve... well, for Unicode
> > text anyway :) Other than that, this is actually a theoretically
> > impossible problem to solve based on text alone. Allow me to
> > elaborate...
>
> > In Unicode Explained, Jukka K. Korpela writes: "It is in principle
> > impossible to determine the encoding of text from the text alone, but
> > in practice, you can often guess right even using automated tools."
> > The way guessing works is by using heuristic analysis. Jukka provides
> > a link to a Java implementation of the heuristic code used in Mozilla,
> > but the link is unfortunately dead. For anyone interested, the text of
> > Unicode Explained is here:http://books.google.com/books?id=PcWU2yxc8WkC&pg=PA512&lpg=PA512&dq=t...
>
> > So, heuristic analysis used in web browsers sounds far out and
> > probably overkill for textadept. What (and how) is Scite doing about
> > this? Well, let's take a look. Scite's documentation states: "SciTE
> > will automatically detect the encoding scheme used for Unicode files
> > that start with a Byte Order Mark (BOM). The UTF-8 and UCS-2 encodings
> > are recognized including both Little Endian and Big Endian variants of
> > UCS-2." If that is not a clue, I don't know what is :) and, if it is
> > good enough for Scite, it should be at a minimum good enough for TA.
> > So, I started digging around on the web for BOM. It did not take long
> > to find this document:http://en.wikipedia.org/wiki/Byte_order_mark
>
> > There are many other resources as well, but wikipedia is always a good
> > baseline. The page provides a table of "magic" BOM codes that can be
> > used to detect which Unicode encoding is in play in a piece of text.
>
> > Without further ado, here is a very lame Lua implementation of Unicode
> > encoding detection for textadept that actually gets the job done (I
> > put it into file_io.lua and call it from open_helper, right after the
> > null byte detection routine used to detect binary files):
>
> > function detect_encoding(text)
> >   local b1, b2, b3, b4 = string.byte(text, 1, 4)
> >   if b1 == 239 and b2 == 187 and b3 == 191 then
> >     return 'UTF-8 BOM'
> >   end
> >   if b1 == 254 and b2 == 255 then
> >     return 'UTF-16 (BE)'
> >   end
> >   if b1 == 255 and b2 == 254 then
> >     return 'UTF-16 (LE)'
> >   end
> >   if b1 == 0 and b2 == 0 and b3 == 254 and b4 == 255 then
> >     return 'UTF-32 (BE)'
> >   end
> >   if b1 == 255 and b2 == 254 and b3 == 0 and b4 == 0 then
> >     return 'UTF-32 (LE)'
> >   end
> >   return 'UTF-8'
> > end
>
> > UTF-8 should be the catch-all because it does not strictly require
> > BOM, and depending on the source of the document may or may not have
> > it. More encodings can (and perhaps should) be added to this function,
> > but it is a start. I would not worry about Scite's UCS-2 encoding
> > because it is just an obsolete version of UTF-16.
>
> > Hope this helps,
>
> > Vais
>
> > On Feb 24, 2:24 pm, Alex <alex.bep....at.gmail.com> wrote:
>
> > > I will put it in my backlog too then. :)
>
> > > Maybe someone else will come up with something. Like the Lanes thing.
> > > I had tried sockets before but got nowhere because of multithreading.
>
> > > - Alex
>
> > > On Feb 24, 3:44 am, mitchell <mforal.n....at.gmail.com> wrote:
>
> > > > Hi Alex,
>
> > > > > do you plan to support any other content encoding than UTF-8? I would
> > > > > like to see this in the long-term, even if it is a low priority.
>
> > > > I would love to, but don't have the slightest idea how. It's been on
> > > > TODO, and I even looked it up earlier today, but didn't get anywhere.
> > > > I think it should be a somewhat high priority though, I just don't
> > > > know how to go about it.
>
Received on Thu 26 Feb 2009 - 14:09:07 EST

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:35 EST