Re: Encodings

From: vais <vsalik....at.gmail.com>
Date: Thu, 12 Mar 2009 17:36:43 -0700 (PDT)

Mitchell,

I cannot find any way in Textmate to see what the file's original
encoding is. I created a file with Textmate and saved it with UTF16LE
encoding. Then opened it and tried to do a Save As, thinking that it
will use the original encoding as the "recommended" encoding. Nope.
Still recommending UTF8.

The moral of the story is don't sweat the small stuff, I guess :) UTF8
is king. Keeping Textadept small and platform independent is much more
important than recognizing obsolete encodings - it would be a shame to
see a dependency added to TA in order to support this.

It is very important, though, NOT to treat something as UTF8 unless
you are very confident it is in fact UTF8 - remember TA crashing when
opening binary files? That's bad. So, the current solution is looking
for null bytes, and it works. Apparently it is "easy" to detect UTF8
without using BOM - perhaps this technique could be used to tell if a
file is UTF8 instead of telling that it is NOT UTF8 by looking for
null bytes? Here is some food for thought:
http://blog.macromates.com/2005/handling-encodings-utf-8/

I quote from that post:

<snip>
3. Generating a random 15 byte sequence containing characters in the
range 0170xFF has a probability of 0.000081 to be valid UTF-8 (the
probability gets lower, the longer the sequence is, and is also lower
for actual text).
...
Property 3 turns out to be attractive because it means we can
heuristically recognize UTF-8 with a near 100% certainty by checking
if the file is valid. Some software think its a good idea to embed a
BOM (byte order mark) in the beginning of an UTF-8 file, but it is
not, because the file can already be recognized, and placing a BOM in
the beginning of a file means placing three bytes in the beginning of
the file which a program that use the file may not expect (e.g. the
shell interpreter looks for #! as the first two bytes of an
executable).
</snip>

So, BOM is a low-hanging fruit I went after, but there is apparently a
smarter way if you are interested to dig in deeper.

At any rate, I am happy with the current state of affairs (so far,
anyway). You are not adding a BOM to UTF8 files created from scratch
with TA, right? That's a no-no.

Thanks,

Vais

On Mar 12, 6:03pm, mitchell <mforal.n....at.gmail.com> wrote:
> Vais,
>
> > I have tested all available encodings - they seem to work (whatever
> > that means), and the status bar reflects the encoding with which the
> > file was saved last. However, just so there is no confusion: MacRoman
> > and the ISO-8859-1 are recognized as UTF8 in the status bar. Is this a
> > bug, or there is not supposed to be recognition for there? My
> > understanding is that using iconv you can save into potentially any of
> > the encodings provided by iconv, but that does not mean these
> > encodings will be automatically recognized as such when opening the
> > file. Is my understanding correct?
>
> Yeah, I don't know of a small, elegant way to do charset detection
> without using a big library of some sort. I think Textmate stores
> encoding in the file's attributes portion in the filesystem, but I
> doubt this would be portable in Linux and Win32.
>
Received on Thu 12 Mar 2009 - 20:36:43 EDT

This archive was generated by hypermail 2.2.0 : Thu 08 Mar 2012 - 11:37:38 EST