Re: Encodings

From: vais <vsalik....at.gmail.com>
Date: Wed, 25 Feb 2009 17:53:32 -0800 (PST)

Here is an interesting read on the subject - a discussion on SciTE's
implementation for the historically inclined:

http://www.mail-archive.com/scite-inter...@lyra.org/msg02629.html

Google ate this url last time I posted it, so if it does again, just
search google for "[scite] Re: UTF-8 Encoding and BOM handling"

Vais

On Feb 25, 8:35 pm, vais <vsalik....at.gmail.com> wrote:
> This is a surprisingly simple problem to solve... well, for Unicode
> text anyway :) Other than that, this is actually a theoretically
> impossible problem to solve based on text alone. Allow me to
> elaborate...
>
> In Unicode Explained, Jukka K. Korpela writes: "It is in principle
> impossible to determine the encoding of text from the text alone, but
> in practice, you can often guess right even using automated tools."
> The way guessing works is by using heuristic analysis. Jukka provides
> a link to a Java implementation of the heuristic code used in Mozilla,
> but the link is unfortunately dead. For anyone interested, the text of
> Unicode Explained is here:http://books.google.com/books?id=PcWU2yxc8WkC&pg=PA512&lpg=PA512&dq=t...
>
> So, heuristic analysis used in web browsers sounds far out and
> probably overkill for textadept. What (and how) is Scite doing about
> this? Well, let's take a look. Scite's documentation states: "SciTE
> will automatically detect the encoding scheme used for Unicode files
> that start with a Byte Order Mark (BOM). The UTF-8 and UCS-2 encodings
> are recognized including both Little Endian and Big Endian variants of
> UCS-2." If that is not a clue, I don't know what is :) and, if it is
> good enough for Scite, it should be at a minimum good enough for TA.
> So, I started digging around on the web for BOM. It did not take long
> to find this document:http://en.wikipedia.org/wiki/Byte_order_mark
>
> There are many other resources as well, but wikipedia is always a good
> baseline. The page provides a table of "magic" BOM codes that can be
> used to detect which Unicode encoding is in play in a piece of text.
>
> Without further ado, here is a very lame Lua implementation of Unicode
> encoding detection for textadept that actually gets the job done (I
> put it into file_io.lua and call it from open_helper, right after the
> null byte detection routine used to detect binary files):
>
> function detect_encoding(text)
>   local b1, b2, b3, b4 = string.byte(text, 1, 4)
>   if b1 == 239 and b2 == 187 and b3 == 191 then
>     return 'UTF-8 BOM'
>   end
>   if b1 == 254 and b2 == 255 then
>     return 'UTF-16 (BE)'
>   end
>   if b1 == 255 and b2 == 254 then
>     return 'UTF-16 (LE)'
>   end
>   if b1 == 0 and b2 == 0 and b3 == 254 and b4 == 255 then
>     return 'UTF-32 (BE)'
>   end
>   if b1 == 255 and b2 == 254 and b3 == 0 and b4 == 0 then
>     return 'UTF-32 (LE)'
>   end
>   return 'UTF-8'
> end
>
> UTF-8 should be the catch-all because it does not strictly require
> BOM, and depending on the source of the document may or may not have
> it. More encodings can (and perhaps should) be added to this function,
> but it is a start. I would not worry about Scite's UCS-2 encoding
> because it is just an obsolete version of UTF-16.
>
> Hope this helps,
>
> Vais
>
> On Feb 24, 2:24 pm, Alex <alex.bep....at.gmail.com> wrote:
>
> > I will put it in my backlog too then. :)
>
> > Maybe someone else will come up with something. Like the Lanes thing.
> > I had tried sockets before but got nowhere because of multithreading.
>
> > - Alex
>
> > On Feb 24, 3:44 am, mitchell <mforal.n....at.gmail.com> wrote:
>
> > > Hi Alex,
>
> > > > do you plan to support any other content encoding than UTF-8? I would
> > > > like to see this in the long-term, even if it is a low priority.
>
> > > I would love to, but don't have the slightest idea how. It's been on
> > > TODO, and I even looked it up earlier today, but didn't get anywhere.
> > > I think it should be a somewhat high priority though, I just don't
> > > know how to go about it.
>
Received on Wed 25 Feb 2009 - 20:53:32 EST

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:34 EST