A popular but wrong way to convert a string to uppercase or lowercase

(devblogs.microsoft.com)

58 points | by ingve 9 hours ago ago

70 comments

  • blenderob 8 hours ago ago

    It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

    An acceptable solution is given at the end of the article:

    > If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

    Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.

    • bayindirh 7 hours ago ago

      I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

      On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

      Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.

      You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.

      So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.

      • pornel 7 hours ago ago

        Being developed in, and having to stay compatible with, ancient times is a real problem of C++.

        The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.

        Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.

        • cm2187 6 hours ago ago

          That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).

          • BoringTimesGang 6 hours ago ago

            Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.

            Or perhaps I should say naïve.

      • relaxing 7 hours ago ago

        It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.

        • bayindirh 7 hours ago ago

          > There’s no excuse.

          I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.

          C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.

          You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".

          • blenderob 7 hours ago ago

            But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.

            But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.

            • bayindirh 7 hours ago ago

              > Converting one Unicode string to another is a purely in-memory, in-CPU operation.

              ...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

              Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

              Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

              Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

              • SAI_Peregrinus 3 hours ago ago

                > Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

                Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

                • bayindirh 3 hours ago ago

                  I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

                  So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

              • 6 hours ago ago
                [deleted]
              • blenderob 6 hours ago ago

                Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

    • Muromec 7 hours ago ago

      Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.

      How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.

    • pistoleer 7 hours ago ago

      > There are so many ways to do something and every way is freaking wrong!

      That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.

      • pornel 7 hours ago ago

        JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).

        Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.

      • 7 hours ago ago
        [deleted]
    • pjmlp 3 hours ago ago

      Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.

      Regardless of what ISO language we are talking about.

    • 7 hours ago ago
      [deleted]
    • BoringTimesGang 7 hours ago ago

      >It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

      These are mostly unicode or linguistics problems.

      • tralarpa 7 hours ago ago

        The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).

        • BoringTimesGang 7 hours ago ago

          to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.

          But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.

  • appointment 8 hours ago ago

    The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.

    If you are handling multilingual text the locale is mandatory metadata.

    • zarzavat 7 hours ago ago

      Different parts of a string can be in different languages too[1].

      The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.

      [1] https://en.wikipedia.org/wiki/Code-switching

      • schoen 7 hours ago ago

        Probably "don't fuss about Fußball" for the same reasons, right?

      • thiht 2 hours ago ago

        I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story

        • 47282847 4 minutes ago ago

          Incorrect. ẞ is still a thing.

  • vardump 8 hours ago ago

    As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)

    That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

    For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.

    • sebstefan 7 hours ago ago

      Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.

      • Muromec 6 hours ago ago

        If it needs to uppercase names it probably interfaces with something forsaken like Sabre/Amadeus that only understands ASCII anyway.

        The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.

        • sebstefan 6 hours ago ago

          From experience anything banking adjacent will usually fuck it up as well

          For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII

          • Muromec 3 hours ago ago

            If it uses ASCII anyway, what's the problem then? Don't accept non-ASCII user input.

            • sebstefan 3 hours ago ago

              First off: And exclude 70% of the world?

              Usually they'll accept it, but some parts of the backend are still running code from the 60's.

              So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet

        • InfamousRece 4 hours ago ago

          Some systems are still using EBCDIC.

    • fhars 8 hours ago ago

      No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.

      • vardump 7 hours ago ago

        > as there is almost no language that can be written using just that.

        99% of use cases I've seen have nothing to do with human language.

        1% human language case that is needs to be handled properly using a proper Unicode library.

        Your mileage (percentages) may vary depending on your job.

        • kergonath 7 hours ago ago

          Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…

          In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.

          ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.

          • vardump 6 hours ago ago

            I said use a Unicode library if input data is actual human language. Which names and addresses are.

            99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)

            • kergonath 4 hours ago ago

              I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.

          • Muromec 6 hours ago ago

            Who and why still tries to lowercase/uppercase names? Please tell them to stop.

            • kergonath 4 hours ago ago

              Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.

        • 9dev 7 hours ago ago

          It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.

        • inexcf 7 hours ago ago

          Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?

          • vardump 7 hours ago ago

            Here's an example. Hypothetically say you want to build an HTML parser.

            You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.

            So first you're going to normalize to either lower- or uppercase.

            • inexcf 4 hours ago ago

              Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.

            • Muromec 6 hours ago ago

              But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.

        • elpocko 6 hours ago ago

          Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.

      • daemin 7 hours ago ago

        I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

        The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.

        • pistoleer 7 hours ago ago

          > I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

          Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

          Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

          • daemin 6 hours ago ago

            I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

            Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

            In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

            • pistoleer 5 hours ago ago

              I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

              And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

        • heisenzombie 7 hours ago ago

          File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

          For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

          • vardump 7 hours ago ago

            File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

            So use standard string processing libraries on path names at your own peril.

            It's a good idea to consider file paths as a bag of bytes.

            • netsharc 7 hours ago ago

              IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

              I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

            • Someone 7 hours ago ago

              > It's a good idea to consider file paths as a bag of bytes

              (Nitpick: sequence of bytes)

              Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

              Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

              That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)

            • daemin 7 hours ago ago

              That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).

        • BoringTimesGang 7 hours ago ago

          Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.

  • PhilipRoman 7 hours ago ago

    Thought this was going to be about and-not-ing bytes with 0x20. Wrong for most inputs but sure as hell faster than anything else.

  • cyxxon 7 hours ago ago

    Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is still widely used, though.

    • mkayokay 7 hours ago ago

      Duden mentions this: "Bei Verwendung von Großbuchstaben steht traditionellerweise SS für ß. In manchen Schriften gibt es aber auch einen entsprechenden Großbuchstaben; seine Verwendung ist fakultativ ‹§ 25 E3›."

      But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?

    • pjmlp 3 hours ago ago

      Lowering case is even better, because a Swiss user would expect the two-character sequence “SS“ to be converted into “ss“ and not “ß“.

      And thus we add country specific locale to the party.

    • Muromec 6 hours ago ago

      But what if you need to uppercase the historical record in a vital records registry from 1950ies, but and OCRed last week? Now you need to not just be locale-aware, but you locale should be versioned.

  • serbuvlad 7 hours ago ago

    The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.

  • high_na_euv 4 hours ago ago

    In cpp basic things are hard

    • onemoresoop 3 hours ago ago

      It's subjective but I find C++ extremely ugly.

  • ahartmetz 7 hours ago ago

    ...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...

    • cannam 6 hours ago ago

      QString::toUpper/toLower are not locale-aware (https://doc.qt.io/qt-6/qstring.html#toLower)

      Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.

      • ahartmetz 5 hours ago ago

        I don't see a problem with that. You can have it done locale-aware or not and "not" seems like a sane default. QString will uppercase 'ü' to 'Ü' just fine without locale-awareness whereas std::string doesn't handle non-ASCII according to the article. The cases where locale matters are probably very rare and the result will probably be reasonable anyway.

    • vardump 7 hours ago ago

      You just want a banana, but you also get the gorilla. And the jungle.