Re: Encodings

From: vais <vsalik....at.gmail.com>
Date: Wed, 25 Feb 2009 17:35:40 -0800 (PST)

This is a surprisingly simple problem to solve... well, for Unicode
text anyway :) Other than that, this is actually a theoretically
impossible problem to solve based on text alone. Allow me to
elaborate...

In Unicode Explained, Jukka K. Korpela writes: "It is in principle
impossible to determine the encoding of text from the text alone, but
in practice, you can often guess right even using automated tools."
The way guessing works is by using heuristic analysis. Jukka provides
a link to a Java implementation of the heuristic code used in Mozilla,
but the link is unfortunately dead. For anyone interested, the text of
Unicode Explained is here:
http://books.google.com/books?id=PcWU2yxc8WkC&pg=PA512&lpg=PA512&dq=text+encoding+heuristics&source=bl&ots=Typ1FddM1u&sig=IJLtDhUDZaWr_0pLoI5KR1jtLg0&hl=en&ei=beulSajDNZmRmQfxhujiDQ&sa=X&oi=book_result&resnum=7&ct=result

So, heuristic analysis used in web browsers sounds far out and
probably overkill for textadept. What (and how) is Scite doing about
this? Well, let's take a look. Scite's documentation states: "SciTE
will automatically detect the encoding scheme used for Unicode files
that start with a Byte Order Mark (BOM). The UTF-8 and UCS-2 encodings
are recognized including both Little Endian and Big Endian variants of
UCS-2." If that is not a clue, I don't know what is :) and, if it is
good enough for Scite, it should be at a minimum good enough for TA.
So, I started digging around on the web for BOM. It did not take long
to find this document: http://en.wikipedia.org/wiki/Byte_order_mark

There are many other resources as well, but wikipedia is always a good
baseline. The page provides a table of "magic" BOM codes that can be
used to detect which Unicode encoding is in play in a piece of text.

Without further ado, here is a very lame Lua implementation of Unicode
encoding detection for textadept that actually gets the job done (I
put it into file_io.lua and call it from open_helper, right after the
null byte detection routine used to detect binary files):

function detect_encoding(text)
  local b1, b2, b3, b4 = string.byte(text, 1, 4)
  if b1 == 239 and b2 == 187 and b3 == 191 then
    return 'UTF-8 BOM'
  end
  if b1 == 254 and b2 == 255 then
    return 'UTF-16 (BE)'
  end
  if b1 == 255 and b2 == 254 then
    return 'UTF-16 (LE)'
  end
  if b1 == 0 and b2 == 0 and b3 == 254 and b4 == 255 then
    return 'UTF-32 (BE)'
  end
  if b1 == 255 and b2 == 254 and b3 == 0 and b4 == 0 then
    return 'UTF-32 (LE)'
  end
  return 'UTF-8'
end

UTF-8 should be the catch-all because it does not strictly require
BOM, and depending on the source of the document may or may not have
it. More encodings can (and perhaps should) be added to this function,
but it is a start. I would not worry about Scite's UCS-2 encoding
because it is just an obsolete version of UTF-16.

Hope this helps,

Vais

On Feb 24, 2:24 pm, Alex <alex.bep....at.gmail.com> wrote:
> I will put it in my backlog too then. :)
>
> Maybe someone else will come up with something. Like the Lanes thing.
> I had tried sockets before but got nowhere because of multithreading.
>
> - Alex
>
> On Feb 24, 3:44 am, mitchell <mforal.n....at.gmail.com> wrote:
>
> > Hi Alex,
>
> > > do you plan to support any other content encoding than UTF-8? I would
> > > like to see this in the long-term, even if it is a low priority.
>
> > I would love to, but don't have the slightest idea how. It's been on
> > TODO, and I even looked it up earlier today, but didn't get anywhere.
> > I think it should be a somewhat high priority though, I just don't
> > know how to go about it.
>
Received on Wed 25 Feb 2009 - 20:35:40 EST

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:34 EST