Re: Encodings

From: Vais Salikhov <vsalik....at.gmail.com>
Date: Thu, 26 Feb 2009 16:54:45 -0500

Good luck ;)

Sent from my iPhone

On Feb 26, 2009, at 2:09 PM, Alex <alex.bep....at.gmail.com> wrote:

>
> Unfortunately detecting encodings is not exactly Scite’s strength. In
> my experience, BOMs are rare. So, yes I was thinking of a heuristic,
> at least one that covers major cases, like distinguishing between
> UTF-8 and the default system encoding. I know, Groovy supports this
> via
> new CharsetToolkit(file).guessEncoding()
> So this might be a place to look.
>
> - Alex
>
>
> On Feb 26, 2:53 am, vais <vsalik....at.gmail.com> wrote:
>> Here is an interesting read on the subject - a discussion on SciTE's
>> implementation for the historically inclined:
>>
>> http://www.mail-archive.com/scite-inter...@lyra.org/msg02629.html
>>
>> Google ate this url last time I posted it, so if it does again, just
>> search google for "[scite] Re: UTF-8 Encoding and BOM handling"
>>
>> Vais
>>
>> On Feb 25, 8:35 pm, vais <vsalik....at.gmail.com> wrote:
>>
>>> This is a surprisingly simple problem to solve... well, for Unicode
>>> text anyway :) Other than that, this is actually a theoretically
>>> impossible problem to solve based on text alone. Allow me to
>>> elaborate...
>>
>>> In Unicode Explained, Jukka K. Korpela writes: "It is in principle
>>> impossible to determine the encoding of text from the text alone,
>>> but
>>> in practice, you can often guess right even using automated tools."
>>> The way guessing works is by using heuristic analysis. Jukka
>>> provides
>>> a link to a Java implementation of the heuristic code used in
>>> Mozilla,
>>> but the link is unfortunately dead. For anyone interested, the
>>> text of
>>> Unicode Explained is here:http://books.google.com/books?id=PcWU2yxc8WkC&pg=PA512&lpg=PA512&dq=t
>>> ...
>>
>>> So, heuristic analysis used in web browsers sounds far out and
>>> probably overkill for textadept. What (and how) is Scite doing about
>>> this? Well, let's take a look. Scite's documentation states: "SciTE
>>> will automatically detect the encoding scheme used for Unicode files
>>> that start with a Byte Order Mark (BOM). The UTF-8 and UCS-2
>>> encodings
>>> are recognized including both Little Endian and Big Endian
>>> variants of
>>> UCS-2." If that is not a clue, I don't know what is :) and, if it is
>>> good enough for Scite, it should be at a minimum good enough for TA.
>>> So, I started digging around on the web for BOM. It did not take
>>> long
>>> to find this document:http://en.wikipedia.org/wiki/Byte_order_mark
>>
>>> There are many other resources as well, but wikipedia is always a
>>> good
>>> baseline. The page provides a table of "magic" BOM codes that can be
>>> used to detect which Unicode encoding is in play in a piece of text.
>>
>>> Without further ado, here is a very lame Lua implementation of
>>> Unicode
>>> encoding detection for textadept that actually gets the job done (I
>>> put it into file_io.lua and call it from open_helper, right after
>>> the
>>> null byte detection routine used to detect binary files):
>>
>>> function detect_encoding(text)
>>> local b1, b2, b3, b4 = string.byte(text, 1, 4)
>>> if b1 == 239 and b2 == 187 and b3 == 191 then
>>> return 'UTF-8 BOM'
>>> end
>>> if b1 == 254 and b2 == 255 then
>>> return 'UTF-16 (BE)'
>>> end
>>> if b1 == 255 and b2 == 254 then
>>> return 'UTF-16 (LE)'
>>> end
>>> if b1 == 0 and b2 == 0 and b3 == 254 and b4 == 255 then
>>> return 'UTF-32 (BE)'
>>> end
>>> if b1 == 255 and b2 == 254 and b3 == 0 and b4 == 0 then
>>> return 'UTF-32 (LE)'
>>> end
>>> return 'UTF-8'
>>> end
>>
>>> UTF-8 should be the catch-all because it does not strictly require
>>> BOM, and depending on the source of the document may or may not have
>>> it. More encodings can (and perhaps should) be added to this
>>> function,
>>> but it is a start. I would not worry about Scite's UCS-2 encoding
>>> because it is just an obsolete version of UTF-16.
>>
>>> Hope this helps,
>>
>>> Vais
>>
>>> On Feb 24, 2:24 pm, Alex <alex.bep....at.gmail.com> wrote:
>>
>>>> I will put it in my backlog too then. :)
>>
>>>> Maybe someone else will come up with something. Like the Lanes
>>>> thing.
>>>> I had tried sockets before but got nowhere because of
>>>> multithreading.
>>
>>>> - Alex
>>
>>>> On Feb 24, 3:44 am, mitchell <mforal.n....at.gmail.com> wrote:
>>
>>>>> Hi Alex,
>>
>>>>>> do you plan to support any other content encoding than UTF-8? I
>>>>>> would
>>>>>> like to see this in the long-term, even if it is a low priority.
>>
>>>>> I would love to, but don't have the slightest idea how. It's
>>>>> been on
>>>>> TODO, and I even looked it up earlier today, but didn't get
>>>>> anywhere.
>>>>> I think it should be a somewhat high priority though, I just don't
>>>>> know how to go about it.
>>
>>>>> -Mitchell;
> >
Received on Thu 26 Feb 2009 - 16:54:45 EST

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:35 EST