#24 ✓resolved
Andre Torrez

Dropping a UTF-8 encoded file onto the Dock icon

Reported by Andre Torrez | December 24th, 2010 @ 01:01 AM

What I did:
1. Opened Kod.app
2. Dropped a UTF-8 encoded file onto Kod's dock icon

What I expected to happen:
Text would be property detected and displayed, available to edit.

What actually happened:
Mis-decoded text.
See screenshot attached.

Comments and changes to this ticket

  • Deleted User

    Deleted User December 24th, 2010 @ 01:43 AM

    Are you sure it was UTF-8? I only have UTF-8 files and this still hasn't happened to me. Can you provide a sample file to reproduce this bug?

  • Andre Torrez

    Andre Torrez December 24th, 2010 @ 01:50 AM

    I just tried with a different UTF-8 encoded file and everything was fine. Attached is the original file I dropped on it. It is generated by Querious, a MySQL GUI app. I think it might not be a UTF-8 problem now, but I can repeat the problem with the attached file so there seems to be a bug lurking here somewhere. (zipped to preserve any header info)

  • rsms

    rsms December 24th, 2010 @ 12:07 PM

    • State changed from “new” to “resolved”
    • Tag set to encoding, file, loadfile, text

    I've digged into this and this is what happens:

    1. Kod loads the data and tries to guess encoding (since there is no com.apple.TextEncoding xattr)
    2. Kod checks for BOM in the following order: UTF-16 BE, UTF-16 LE, UTF-32 BE, UTF-32 LE, UTF-8 (unofficially introduced by Windows). In this case the file contains no BOM
    3. Kod decodes the first 1024 bytes as ISO-8859-1 (a 8-bit encoding which never fails) and then using regexp tries to locate an explicit encoding marker (as those favored by Emacs, Vim, Python, etc).
    4. In this case, kod matches the second line of the file: "# Encoding: Unicode (UTF-8)"

    Now, the regexp for finding encoding markers looks like this:

    content=".charset=([^"]+)"|(?:charset|encoding)\s[=:]\s(?:"([^"]+)"|'([^']+)'|(\w+))|coding:\s(\w+)

    This means that the encoding line in the file do match, but the value is "Unicode". Next, Kod passes the value to CFStringConvertIANACharSetNameToEncoding (a CoreFoundation method) which interprets "Unicode" as UTF-16, thus the file gets decoded as UTF-16 (little endian -- the host byte order in my case).

    Summary: The encoding marker is in an illegal format. Change it to "encoding: utf8" or simply remove it (UTF-8 is the default fallback encoding).

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

A text editor for Mac OS X

People watching this ticket

Attachments

Pages