#123 open
Wil

Encoding Behavior Could've Been Better

Reported by Wil | December 29th, 2010 @ 12:26 PM

What I did:

  1. Open a non-utf8-encoded file (in my case, Shift-JIS)
  2. The encoding detection algorithm of Kod will fallback to using UTF-8 encoding (which is wrong).
  3. Attempt to fix the file by adding an "encoding: shift-JIS" comment line at the top.
  4. Save the file. At this point, Kod will save the file with the xattr of "utf-8".
  5. Reopen the file, Kod will first look at the xattr and resolve the encoding to "utf-8". Instead of "Shift-JIS"

Note that step 3 and 4 could be done with TextEdit (since it doesn't write the file's xattr) and reopening it in Kod will show the correct encoding... though this doesn't quite "solve the problem".

What I expected to happen: The behavior is pretty much up to discussion. Though here are the suggestions I offer:

  1. Change the order of detection: honor the explicit "charset/encoding" line in the file over the xattr of the file. The rationale behind this is that it's easier to modify the file's contents than manually editing the xattr.
  2. If encoding detection utterly fails, instead of defaulting to UTF-8 or "NSISOLatin1StringEncoding", just ask the user what encoding to use.
  3. In the event the encoding detection utterly fails, don't write the xattr when saving.
  4. In the file open dialog, provide an option to selection which encoding to use (default to "auto-detect")
  5. The status bar could be more useful by showing the current encoding of the file plus the file type.
  6. Provide a menu for the user to select which encoding to use. Upon inspecting the code however, this is impossible without reloading the file (the NSData is released after assigning the converted text to the view), which is not a good solution because:
    • it discards the changes in the current buffer and asking the user if he wants to save first is kinda funky (what if the file is over HTTP?)
    • it will throw away the undo history
    • if the file is over the network, it will take some time to load again (I don't think Kod caches files over networks)

I'm willing to work on this but I'll need some inputs from the devs on what is the best course of action. I don't think implementing all of the above items will be good (6, for instance is already a no-go).

A related question to this is "can the user select the encoding to use when saving?" because right now, the encoding when saving will always be UTF-8 unless the file opening already has a previous marked encoding (xattr, file markers, etc.).

Comments and changes to this ticket

  • rsms

    rsms December 31st, 2010 @ 04:20 PM

    • State changed from “new” to “open”
    • Assigned user set to “rsms”
    • Tag set to encoding, file, text, writing

    First; thanks for a very good and "meaty" ticket.

    I've given this some thought and believe Kod should only be able to write Unicode data, but be able to read whatever encoding. The world of text encoding is a scary place which streets are slowly being cleaned up by Unicode.

    This is what should happen:

    1. Read a file and interpret it in any encoding possible (should have a vast support for different encodings)

    2. Text is stored internally as UTF-16 (host byte order AFAIK)

    3. Upon writing to file, encode the text as UTF-8 OR is an explicit output encoding has been chosen, use that (the list should be fairly short).

    List of output encodings:

    • UTF-8
    • UTF-16
    • UTF-32
    • ISO-8859-1 (aka Latin-1)

    For the record, TextMate only allows writing files using MacRoman, UTF-8, UTF-16 or Latin-1 text encoding.

  • Wil

    Wil January 2nd, 2011 @ 01:31 PM

    Hmm, when reading the file, what do you think about suggestions (1) and (2)? That we honor the source-declaration of encoding over the xattr settings and that we ask the user for the type of encoding of the file upon failure of detection? (gonna start working on it if you give it a go)

    In your third point, I presume that it only applies to saving a new file right? If the file already has an encoding (i.e. you opened it from an existing writable location), just use that encoding right?

    Some cases that I thought of that we should consider (in both cases, assume that the text is one of those bazillion non-UTF encodings):

    1. What if the user opened it from a read-only location, prompting a "save as"? Just save the text as any of the UTF-formats or follow the original encoding?
    2. What if the user opened a new document, placed an "encoding:Shift-JIS" and saved the file... do we save it as "Shift-JIS" or force it into one of the UTF encodings?

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

A text editor for Mac OS X

People watching this ticket

Pages