20 comments

  • ryandrake 2 days ago ago

    Joel covered this[1] topic over 20 years ago (!!) and we still regularly see "senior" programmers who just casually think of text as a string and strings as text, and that's all there is to it. I still regularly see websites full of ????? and U+FFFD and apostrophes becoming ’ everywhere.

    1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

  • danhau a day ago ago

    At my job I have to deal with an old system that invented its own encoding, named TSS. The idea was to unify multiple charsets and encodings into one, before Unicode was a thing. But instead of coming up with one big a charset and assigning codepoints plus an encoding scheme, they thought it was wise to just repackage other encodings and charsets. Think Matroska, but for text. And yes, I do mean charsets AND encodings. Sometimes they repackage an encoding, sometimes just a charset where the codepoints are the encoding.

    TSS supports the ISO-8859 charsets and corresponding (but deviating) Windows codepages, traditional and simplified Chinese, half- and fullwidth Japanese, Korean via Wansung and Johab, and others I'm forgetting right now. And in newer version of the software, they also support Unicode, but using a custom encoding.

    Thankfully a good chunk of all that is well documented, like the byte values introducing a fullwidth Japanese character, for example. But they don't describe what charset or encoding is actually used. EUC-JP? Shift-JIS? Turns out it's JIS X 0208. You'd think they would just use Shift-JIS, which gives them both full- and halfwidth Japanese in one shot, but no. They package those explicitly as JIS X 0208 and JIS X 0201. Similar questions arise for Chinese and the others. It took a lot of reverse engineering to figure that stuff out. But if you think that is hard, have fun finding tables to map those old encodings to Unicode and back. Java is a godsend in this case. Charset.availableCharsets has them all!

    What's kinda charming is that TSS also contains text formatting commands. "Make all following text bold! Make it underlined! Now make it both bold and underlined!" Stuff like that.

    What's less charming is that TSS is actually a superset (an extension of) the ISO-8859 family, similar to how ISO-8859 is a superset of ASCII. In other words, all ISO-8859-1 (or any other variant) is perfectly valid TSS, but not all TSS is valid ISO-8859-1. This creates a lot of fun meetings with other departments when they query the database and are puzzled as to where those weird characters in their ISO-8859-1 text came from.

  • dang 2 days ago ago

    Related:

    What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011) - https://news.ycombinator.com/item?id=30384223 - Feb 2022 (58 comments)

    What programmers need to know about encodings and charsets (2011) - https://news.ycombinator.com/item?id=24162499 - Aug 2020 (22 comments)

    What to know about encodings and character sets - https://news.ycombinator.com/item?id=9788253 - June 2015 (30 comments)

    What Every Programmer Needs To Know About Encodings And Character Sets - https://news.ycombinator.com/item?id=4771987 - Nov 2012 (5 comments)

  • fainpul 2 days ago ago

    Highly related recommendation: https://i18n-puzzles.com/

    It's a series of tasks ("puzzles") in the style of Advent of Code. Some deal with text handling, some with dates and times.

    In my opinion it's a fun way to really get this stuff in your brain (by doing, not just reading about it) and especially learn about what your programming language of choice has to offer in this department.

    I find the later puzzles have a bit of an artificial difficulty increase, which makes them seem a bit far fetched and unrealistic. But the first few are definitely reasonable and applicable to real-world scenarios. You also don't have to do them in order. Unlike with AoC, all the puzzles are available from the start.

  • ColinWright 4 days ago ago

    Full title:

    "What every programmer absolutely, positively needs to know about encodings and character sets to work with text"

  • jibal 2 days ago ago

    > Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits.

    This is wrong and it goes downhill from there. I don't want to take the time and effort to fisk it, but it's full of errors like mistaking characters for codepoints and saying things like "In other words, ASCII maps 1:1 unto UTF-8" -- a bizarre and wrong way to say what he said in the previous sentence: "All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII".

    • torstenvl 2 days ago ago

      It isn't wrong. Computers, broadly speaking, can only store binary digits.

      I'm not sure if you're thinking of the Mark II, or the term as meaning human arithmeticians, or what, but that seems pedantic to the point of sophistry.

  • random3 2 days ago ago

    The best things are those that get out of the way.

  • geocar 2 days ago ago

    > Say, your app must accept files uploaded in GB18030, but internally you are handling all data in UTF-32. A tool like iconv can cleanly convert the uploaded file with a one-liner like iconv('GB18030', 'UTF-32', $string). That is, it will preserve the characters while changing the underlying bits:

    Oh for goodness sake please please don't do this: Despite the appearance of the "representations" given, GB18030 is bigger than Unicode so this potentially destroys information. Almost any other `character (encoding) set' would have been a better example, but definitely not this one, and unless you already know why it might work for a long time until you discover a problem.

    Actually, I do not generally recommend converting anything ever; I try to save the original customer/user submission and then any derivative use of it that needs some specific conversion can use that. If you save the bytes you were given, you can fix problems like this when they come up, but if you normalise everything before saving your golden record in your database, you might actually lose something important.

    Three other things to know about "encoding and character sets" that I feel like are more important than code points:

    1. If you don't know the language, you can't sort/compare, so if you think this saves you keeping track of the 'character set', well you _should_ have been tracking 'character set+language' anyway, so even if UTF32 worked, you'd still need the field for language anyway. And yeah, this affects "latin" languages too.

    2. If you don't know the font, you can't figure out how big something is, draw it, wrap it, count the "characters", and so on. If you're beginning to wonder what you can do with text you can't read, you're starting to get the idea.

    3. Microsoft is a massive fucking company and can't get RTL right. Bananas, right? You have no hope if you do not talk to actual human beings that use the language. This guy https://www.notarabic.com gave a talk a few years ago which I recommend if that sounds incredible.

    tl;dr: text is hard, let's go to the beach.

  • ____tom____ 2 days ago ago

    > Because Unicode is not an encoding.

    > Overall, Unicode is yet another encoding scheme.

    ?

    • btilly 2 days ago ago

      That's just somewhat sloppy.

      Unicode is not an encoding of text to bits. It is an encoding of text to numbers. There are a variety of encodings of text to bits based on how those numbers are to be encoded into bits.

      Though technically Unicode isn't even quite that. For example "é" can be encoded as U+00E9 or as U+0065,U+0301. Going the other way, "水", U+6C34, is drawn differently in simplified Chinese, Japanese, and traditional Chinese. Unicode calls this, "language-sensitive glyph variation".

      Which means that the correspondence between text and Unicode is many to many both ways. And then the Unicode can show up in bits and bytes again in multiple ways.

    • Terr_ 2 days ago ago

      Yeah, author seems to have made a mistake there.

      > Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.

      I would guess this represents a confusion between the narrow abstract definition of Unicode versus the way it is casually used as an umbrella term which includes stuff like Transformation Formats.

      • jibal 2 days ago ago

        The author doesn't understand what a character is, despite the Unicode standard making it very clear that character != codepoint

  • TacticalCoder 2 days ago ago

    > Text is either encoded in UTF-8 or it's not. If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other encoding.

    Nitpicking but if it's encoded in ASCII, it's by definition a validly encoded UTF-8 file.

    • jibal 21 hours ago ago

      This accurate comment was previously dead. Glad that it got resurrected.

  • nick49488171 2 days ago ago

    Bitmaps. Anything outside of ASCII should be a bitmap.

    • Uehreka 2 days ago ago

      This is the encodings equivalent of the “there should just be one timezone” take.

    • bloomca 2 days ago ago

      How would that work? How many bytes per character? How different fonts would work?

      • nick49488171 2 days ago ago

        Sorry, misplaced humor.

      • 2 days ago ago
        [deleted]