SolidStart - Hacker News

In gamedev there is simple rule: don't try to do any of that.

If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.

If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

[-]

fluoridation 9 months ago ago

>Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

[-]

SleepyMyroslav 9 months ago ago

>What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.

There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.

squeaky-clean 9 months ago ago

> They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).

[-]

fluoridation 9 months ago ago

I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.

[-]

squeaky-clean 9 months ago ago

For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.

[-]

fluoridation 9 months ago ago

Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

[-]

squeaky-clean 9 months ago ago

std::transform() seems like overkill when you can just iterate over the string and modify it in place. And in my opinion, tranform is way less readable than seeing a loop over some array with a single operation inside.

The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.

The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.

[-]

dwattttt 9 months ago ago

> The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).

7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.

[-]

squeaky-clean 9 months ago ago

Yeah If you're using narrow strings for UTF8 you're making a mistake. wstrings also are not a good representation because of the platform differences, unless you don't care about Windows in which case it's fine but still not great semantically.

In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.

You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.

[-]

gpderetta 9 months ago ago

I'm not 100% sure what you mean by narrow string, but if you refer to std::string vs std::wstring, then std::string is perfectly fine for encoding UTF8, as that uses 8 bit code units which are guaranteed to fit in a char. On the other hand, std::wstring would be a bizarre choice for UTF8 on any platform.

account42 9 months ago ago

It's not guaranteed for 7-bit ASCII either because tolower/toupper are locale-dependent and with the tr_TR lowercase I (U+0049) is ı (U+0131, aka dotless i) wich encodes as two bytes in UTF-8.

[-]

squeaky-clean 9 months ago ago

That's not ascii then. It's byte width compatible (to a certain degree as you point out). But it's not ascii. ascii defines 128 code points and the handling of an escape character. It doesn't handle locales.

[-]

account42 9 months ago ago

ASCII is an encoding, it doesn't say anything about locale. The point is that tolower/toupper is not guaranteed to be safe even if the input is 7-bit.

[-]

gpderetta 9 months ago ago

I don't think there is any possibility of doing locale specific lower/upper casing in ASCII. It is really designed for (a subset of) American english.

gpderetta 9 months ago ago

std::u8string, std::u16string and std::u32string are supposed to be the portable unicode string types, but a lot of machinery is missing and some that has been added has since been deprecated.

> there's no way to implement such a module without being very opinionated about something.

indeed! Boost.Nowide[1] is such an opinionated library.

[1] https://www.boost.org/doc/libs/master/libs/nowide/doc/html/i...

[-]

squeaky-clean 9 months ago ago

Yep, there's also ICU and utf8cpp, and many others. They all have trade-offs. So I just don't think the stdlib should cover this because there is no objectively best way to handle it.

fluoridation 9 months ago ago

I know I can simply iterate. The point is that it's a function that should be included, not that it's impossible without it. It's one of the most common string operations.

[-]

squeaky-clean 9 months ago ago

To me that feels like the JS community asking for left-pad or is-even in a module. Why have a dedicated function for 2 lines of code?

And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.

Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.

[-]

fluoridation 9 months ago ago

>Why have a dedicated function for 2 lines of code?

Then why does std::max() exist?

>People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.

>Generalized string conversion is a very complex

Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.

theelous3 9 months ago ago

> There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".

account42 9 months ago ago

std::tolower / std::toupper are rubbish functions that can't do proper Unicode but still pull in the bloated locale machinery for what should be a simple conditional integer addition if all you care about is ASCII. Both have no valid use case and should be marked [[deprecated]] and erased from all teaching materials.

[-]

9 months ago ago

[deleted]

lmm 9 months ago ago

> What if your game needs to talk to a server and do some string manipulation in between requests?

What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.

[-]

fluoridation 9 months ago ago

Case conversion is not the only string manipulation that's locale sensitive.

[-]

lmm 9 months ago ago

No reasonable server API should require locale sensitive string manipulation.

[-]

NBJack 9 months ago ago

[flagged]

NBJack 9 months ago ago

Word censoring? Ease of use? Console commands (i.e. from Quake to minecraft)?

[-]

wongarsu 9 months ago ago

Those sound exactly like the newcomer detectors GP was referring to. What you want is a case-insensitive string comparison, and outside ASCII that's not equivalent to just turning both strings to lowercase and checking equality (or doing a substring search or whatever the task requires)

[-]

account42 9 months ago ago

Exactly and where you want case-insensitive comparison you almost always also want other kinds of Unicode normalization.

lmm 9 months ago ago

> Word censoring?

Should only ever be needed for text from the user, and in that case, as GP said, find a way to examine it as-is, don't "convert".

> Ease of use?

What ease of use? When has futzing around with case ever made anything easier?

> Console commands (i.e. from Quake to minecraft)?

Why would those necessitate changing case?

barrkel 9 months ago ago

Nobody is thinking about converting the case of ASCII characters. To be thinking that, they are explicitly excluding most of the world's cultures from entering common names correctly. Restricting thought to ASCII is a lack of thought, not an active thought.

zahlman 9 months ago ago

>If text is coming from user - then change design until its not needed to 'convert'

In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.

[-]

makeitdouble 9 months ago ago

TBF, if you are caring about string collation, you're already at the entrance of the rabbit hole and probably should go down to the deep end anyway.

I'd assume SleepyMyroslav doesn't apply to devs willing to spend weeks at time to handle all the complexity in full.

cheema33 9 months ago ago

> In gamedev there is simple rule: don't try to do any of that.

I am not in gamedev, but I frequently have to develop middleware that takes in user entered data and formats it in a way that will import into a 3rd party system without errors. And that sometimes means changing the case on strings.

In my experience as a developer, this is very very common requirement.

Luckily I am not forced to use a low level language for any of my work. In C# I can simply do this: "hello world".ToUpper();

[-]

Smaug123 9 months ago ago

If you're putting data into a third-party system, you might want `ToUpperInvariant`, not `ToUpper`. (Just checking that you know the difference, because most people don't!)

crote 9 months ago ago

The problem is that such third-party requirements are usually wrong.

Two decades ago some developer probably went "Yeah, obviously all names start with capital letters!", not realizing that there are in fact plenty of names which start with a lowercase letter. So they added an input validation test which checks for capitals, which meant everyone feeding that system had to format their data. A whole ecosystem grew around the format of the output of that system, and now you're suddenly rewriting the system and you run into weird and plain wrong capitalization requirements for no technical reason whatsoever.

Alternatively, the same but start with punch cards which predate ASCII and don't distinguish between uppercase and lowercase letters.

> In C# I can simply do this: "hello world".ToUpper()

... which does not work.

Take a look at the German word "straße" (street), for example. Until very recently the "ß" character did not have an uppercase variant, so a ToUpper would convert it to "STRASSE". This is a lossy operation, as the reverse isn't true: the lowercase variant of "KONGRESSSTRASSE" (congress street) is not "kongreßstraße" - it's supposed to be "Kongressstraße".

It can get even worse: the phrase "in Maßen" (in moderate amounts) naively has the uppercase variant "IN MASSEN" - but that means "in huge amounts"! In that case it is probably better to stick to "IN MASZEN".

And then there's Turkish, where the uppercase variant of the letter "i" is of course "İ" rather than "I" - note the dot.

So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.

In practice that means your options are a) giving up on the concept of uppercase/lowercase conversion and just passing it as-is, or b) accepting that you are inevitably going to be silently corrupting your data.

[-]

neonsunset 9 months ago ago

> So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.

Have you ever read the documentation? https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...

[-]

crote 9 months ago ago

Yes. Now try applying it to something like this very HN comment section, which is mixing words belonging to different cultures inside a single comment - and in some cases even inside the same word.

Sure, you can now do case conversion for a specific culture, but which one?

wruza 9 months ago ago

It’s a lossy operation, but it does work. By this logic jpeg and mpeg don’t work either. But were watching them videos daily.

Yes we can simply ToUpper(). We just can’t ToUpper().ToLower(), but that’s useless cause we have the original string if we need it and fine if we don’t need it.

[-]

account42 9 months ago ago

The point is that what ToUpper does depends on locale AND Unicode version. This for many applications it only appears to work until it will fail spectacularly in production.

Netch 9 months ago ago

> In C# I can simply do this: "hello world".ToUpper();

Hmm still actual: https://www.moserware.com/2008/02/does-your-code-pass-turkey...

[-]

neonsunset 9 months ago ago

> 2008

This is completely irrelevant because culture-sensitive case conversion relies on ICU/NLS.

[-]

Netch 9 months ago ago

But at least a programmer shall be aware to call it (whatever API is used).

pjmlp 9 months ago ago

Note that the correct way to do that in C# would be to pass an instance of CultureInfo.

jameshart 9 months ago ago

I don’t think you can say this is universally known in ‘game dev’. In fact just last week I stumbled using the UI in a game that let me enter a name for something, which it then displayed in uppercase.

Game UI is the place I’d expect to most likely come across horrific abuses of localization precisely because game UI is such a cobbled together layer of hacks on hacks.

9 months ago ago

[deleted]

beeboobaa3 9 months ago ago

> There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

Your web browser is doing it right now as you are reading this comment.

[-]

rty32 9 months ago ago

And web development is not game development? And chances are that games don't ship chromium with them?

[-]

moron4hire 9 months ago ago

Actually...

  https://github.com/baikety/uWebKit
  https://zenfulcrum.com/browser/docs/Readme.html
  https://github.com/roydejong/chromium-unity-server

There are a lot more, I just got bored at 3.

And it's not just Unity. Several exist for Unreal as well.

Why? Specifically because 2D layout and text rendering suck so much in game engines. What's ~50MB matter when you're shipping several GB of game assets?

blenderob 9 months ago ago

It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

An acceptable solution is given at the end of the article:

> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.

[-]

bayindirh 9 months ago ago

I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.

You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.

So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.

[-]

zahlman 9 months ago ago

>You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....

But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)

pornel 9 months ago ago

Being developed in, and having to stay compatible with, ancient times is a real problem of C++.

The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.

Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.

[-]

cm2187 9 months ago ago

That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).

[-]

BoringTimesGang 9 months ago ago

Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.

Or perhaps I should say naïve.

[-]

cm2187 9 months ago ago

Well pretty much every other more recent language solved that problem.

[-]

kccqzy 9 months ago ago

Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.

[-]

zahlman 9 months ago ago

Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:

    >>> 'ß'.upper()
    'SS'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.casefold()
    'ss'

There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.

(No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

[-]

crote 9 months ago ago

But that's wrong. The uppercase for "in Maßen" ("in moderate amounts") is not "IN MASSEN" ("in Massen", meaning "in massive amounts").

kccqzy 9 months ago ago

Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.

> (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.

[-]

zahlman 9 months ago ago

I'm not seeing anything in the Swift documentation about strings carrying language metadata, either, though?

[-]

kccqzy 9 months ago ago

This lowercase function takes a locale argument https://developer.apple.com/documentation/foundation/nsstrin...

It looks like an old NSString method that's available in both Obj-C and Swift.

The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.

tedunangst 9 months ago ago

But that's wrong. The upper case for ß is ẞ.

[-]

cm2187 9 months ago ago

C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.

[-]

account42 9 months ago ago

This is not a locale issue, it's a Unicode version issue. Which hightlights another problem with adding this to the base standard library.

IncreasePosts 9 months ago ago

That was only adopted in Germany like 7 years ago!

[-]

kccqzy 9 months ago ago

Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.

[-]

extraduder_ire 9 months ago ago

Does unicode have space set aside for those new symbols to slot into? I know it's very rare, but it could get messy.

[-]

account42 9 months ago ago

Unicode is already messy. Chinese characters especially so due to han unificiation.

Towaway69 9 months ago ago

Isn't uppercase for ß just ß - i.e. it's its own uppercase character?

[-]

bratwurst3000 9 months ago ago

there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and converting anything to uppercase SS isn’t something germany wants …

[-]

account42 9 months ago ago

> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.

Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.

Towaway69 9 months ago ago

I don’t think that Germany wants a capital ß or the German language requires one rather technology needs one to dot the eyes and cross the tees.

account42 9 months ago ago

Not generally no, but some applications used it that way because of ambiguity of upppercasing ß to SS - which is why ẞ was added.

[-]

Towaway69 9 months ago ago

On the other hand, the German language has existed for several hundred years without having a capital ß but now it needs one?

True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?

tialaramex 9 months ago ago

Rust will cheerfully:

    assert_eq!("ὀδυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]

Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.

[-]

account42 9 months ago ago

Is this

    assert_eq!("\u1F41δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

or

    assert_eq!("\u03BF\u0314δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.

MBCook 9 months ago ago

So what?

That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.

What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.

That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?

It’s not like this is an esoteric thing.

[-]

wakawaka28 9 months ago ago

The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.

Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.

9 months ago ago

[deleted]

account42 9 months ago ago

No, strong backwards compatiblity a real strength of C++. In fact, it's probably it's main strength these days.

relaxing 9 months ago ago

It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.

[-]

bayindirh 9 months ago ago

> There’s no excuse.

I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.

C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.

You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".

[-]

blenderob 9 months ago ago

But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.

But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.

[-]

bayindirh 9 months ago ago

> Converting one Unicode string to another is a purely in-memory, in-CPU operation.

...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

[-]

SAI_Peregrinus 9 months ago ago

> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

[-]

numpad0 9 months ago ago

It's because Unicode don't allow for language switching.

It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).

AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.

account42 9 months ago ago

The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).

bayindirh 9 months ago ago

I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

[-]

fluoridation 9 months ago ago

Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

[-]

gpderetta 9 months ago ago

you also need two UTF16 code units for plain emojis.

TorKlingberg 9 months ago ago

Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.

9 months ago ago

[deleted]

account42 9 months ago ago

> Germans have their ß to S (or capital ß depending on the year)

FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.

blenderob 9 months ago ago

Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

[-]

bayindirh 9 months ago ago

No problems! If you want a slightly longer write-up, here's a classic I constantly share with people:

https://blog.codinghorror.com/whats-wrong-with-turkey/

wakawaka28 9 months ago ago

Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.

Netch 9 months ago ago

> There’s no excuse.

Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)

gpderetta 9 months ago ago

Java ended up picking UCS-2 and getting screwed.

[-]

throwaway2037 9 months ago ago

Pretty much all Unicode early adopters went for 16-bit chars. Qt and Win32 API are another pair.

[-]

gpderetta 9 months ago ago

Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.

By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.

account42 9 months ago ago

Qt really has no excuse for still using 16-bit characters since unlike the other two they have had multiple ABI breaks since then.

[-]

throwaway2037 9 months ago ago

"no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".

[-]

account42 9 months ago ago

Ah yes, appeal to authority. No better way to admit that you are talking out of your arse.

nitwit005 9 months ago ago

Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.

You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.

account42 9 months ago ago

Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.

akira2501 9 months ago ago

> libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Isn't that mostly just from tables derived from the Unicode standard?

ectospheno 9 months ago ago

> Any tool which is old enough will have a thousand ways to do something.

Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.

Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.

[-]

fluoridation 9 months ago ago

So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?

[-]

the_gorilla 9 months ago ago

You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.

[-]

fluoridation 9 months ago ago

C++ doesn't try to do that. It aims to remain as backwards compatible as possible, which is what the GP is complaining about.

pistoleer 9 months ago ago

> There are so many ways to do something and every way is freaking wrong!

That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.

[-]

pornel 9 months ago ago

JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).

Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.

9 months ago ago

[deleted]

Muromec 9 months ago ago

Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.

How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.

[-]

account42 9 months ago ago

Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.

pjmlp 9 months ago ago

Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.

Regardless of what ISO language we are talking about.

[-]

account42 9 months ago ago

If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.

gpderetta 9 months ago ago

Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.

account42 9 months ago ago

> Makes you wonder why this isn't part of the C++ standard library itself.

Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.

Libraries are fine, not everything needs to be defined by the language itself.

Netch 9 months ago ago

> Makes you wonder why this isn't part of the C++ standard library itself.

Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.

Night_Thastus 9 months ago ago

As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.

wslh 9 months ago ago

Me too, how is case conversion perfectly done in modern languages such as Zig [1], Rust, or Swift?

[1] Ended up looking at https://github.com/JakubSzark/zig-string

[-]

steveklabnik 9 months ago ago

In Rust, the APIs are clear if they're ASCII only or unicode aware.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ‘Lowercase’ is defined according to the terms of the Unicode Derived Core Property Lowercase.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ASCII letters ‘A’ to ‘Z’ are mapped to ‘a’ to ‘z’, but non-ASCII letters are unchanged.

Now, "perfectly" is very strong. For example, the Turkish i problem. That is not solved. But 99% of Unicode stuff is handled correctly by default.

[-]

oguz-ismail 9 months ago ago

> 99% of Unicode stuff

Does that include context-dependent conversion rules like o'reilly -> O'Reilly?

[-]

PaulDavisThe1st 9 months ago ago

that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).

[-]

oguz-ismail 9 months ago ago

So it doesn't. If Unicode doesn't cover non-trivial forms of case-folding, 99% of Unicode doesn't mean anything.

[-]

PaulDavisThe1st 9 months ago ago

I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.

steveklabnik 9 months ago ago

Sort of, see the Greek example elsewhere in this thread. I don’t think that specific situation is part of Unicode though.

hoseja 9 months ago ago

>Makes you wonder why this isn't part of the C++ standard library itself.

Because then every change in Unicode would need to be standardized in C++ as well. Yup. Can't have Unicode due to committee friction.

dennis_jeeves2 9 months ago ago

> There are so many ways to do something and every way is freaking wrong!

Stroustrup, laugheth!

BoringTimesGang 9 months ago ago

>It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

These are mostly unicode or linguistics problems.

[-]

tralarpa 9 months ago ago

The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).

[-]

BoringTimesGang 9 months ago ago

to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.

But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.

[-]

account42 9 months ago ago

> Is the alternative that it should be made unusable, and more existing code broken?

It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.

9 months ago ago

[deleted]

appointment 9 months ago ago

The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.

If you are handling multilingual text the locale is mandatory metadata.

[-]

zarzavat 9 months ago ago

Different parts of a string can be in different languages too[1].

The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.

[1] https://en.wikipedia.org/wiki/Code-switching

[-]

schoen 9 months ago ago

Probably "don't fuss about Fußball" for the same reasons, right?

thiht 9 months ago ago

I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story

[-]

47282847 9 months ago ago

Incorrect. ẞ is still a thing.

[-]

CamperBob2 9 months ago ago

Going by what you and the grandparent wrote, it's not just a thing, but two different things: ẞ ß

It is probably time for an Esperanto advocate to show up and set us all straight.

[-]

selenography 9 months ago ago

> set us all straight.

Se fareblus oni, jam farintus oni. (It definitely won't happen on an echo-change day like today, either. ;))

Contra my comrade's comment, Esperanto orthography is firmly European, and so retains European-style casing distinctions; every sound thus still has two letters -- or at least two codepoints.

(There aren't any eszettesque bigraphs, but that's not saying much.)

D-Coder 9 months ago ago

Pri kio vi parolas? En Esperanto, unu letero egalas unu sonon.

What are you talking about? In Esperanto, one letter equals one sound.

TZubiri 9 months ago ago

Germans run Uber Long Term Support dialects

Kwpolska 9 months ago ago

The Swiss have dropped ß, but it's still a thing in Germany or Austria.

jeroenhd 9 months ago ago

Language is just part of the problem. Unicode lets you store text as entered, but what you do with that text completely depends on what your problem domain is. When you're writing software to validate that the name on someone's ID matches that on a ticket, you're probably going to normalise that name to your (customer's) locale rather than render each name in the locale it was originally written in. As long as you keep your locale settings consistent and don't do bad stuff like "iterate over characters and individually transform them", you're probably fine, unless your problem domain calls for something else.

If you're printing a name, you're probably printing the name for the current user, not for the person who entered it at some point. If you're going to try to convert back like that, you also need to store a timestamp with every string in case a language changes its rules (such as permitting ẞ instead of SS when capitalising ß). And even then, someone might intend to use the new spelling rules, or they might not, who knows!

This article probably boils down to "programmers don't realise graphemes aren't characters and characters aren't bytes even though they usually are in US English". The core problem, "text processing looks easy as long as you only look at your own language", is one that doesn't just affect computers.

Your best bet is to just avoid the entire problem by not processing input further than basic input sanitisation, such as removing whitespace prefixes/suffixes and maybe stripping out invalid unicode so it can't be used as a weird stored attack.

thayne 9 months ago ago

Not quite.

islower is actually supposed to account for the user's "locale", which includes their language.

The key takeway is that lowercasing a string needs to be done on the whole string, not individual characters, even if std::string had a way to iterate over codepoints instead of bytes (or code units, in the case of wstring).

And there isn't a standard way to do that, you either meed to use a platform specific API, like the windows function mentioned, or use a library like ICU.

vardump 9 months ago ago

As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)

That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.

[-]

crazygringo 9 months ago ago

> That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

I think it's more the exact opposite.

The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.

On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product names and whatnot. Which are never in ASCII because this isn't 1990.

[-]

bigstrat2003 9 months ago ago

> Which are never in ASCII because this isn't 1990.

This is a very silly statement. I'm willing to believe that you have lots of cases where those things are outside the ASCII range. Perhaps even most of the cases, depending on where you live. But I do not believe for one second that it never happens.

[-]

crazygringo 9 months ago ago

Never stored in ASCII, never limited to ASCII. They're UTF-8, usually.

If somebody's name happens to fit into ASCII that's irrelevant because it's not guaranteed, so you can never blindly do an ASCII case conversion.

For text data meant for users, I literally cannot remember the last time I used a string in ASCII format as opposed to UTF-8 (or UTF-16 in JS). It's certainly over a decade ago.

So yes, when I say never, I literally mean never. Nothing "very silly" about it, sorry.

(Again, excepting identifiers, where case conversion is not generally applicable.)

hinkley 9 months ago ago

And you could argue that if the internal identifiers need to be capitalized or lower-cased, you've already lost.

On an enterprise app these little string manipulations are a drop in the bucket. In a game they might not be. Sort that stuff out at compile time, or commit time.

[-]

account42 9 months ago ago

You can't always control the case you get but often you can not care about anything outside ASCII. Scripts and configuration or text-based data formats are common examples.

sebstefan 9 months ago ago

Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.

[-]

Muromec 9 months ago ago

If it needs to uppercase names it probably interfaces with something forsaken like Sabre/Amadeus that only understands ASCII anyway.

The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.

[-]

sebstefan 9 months ago ago

From experience anything banking adjacent will usually fuck it up as well

For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII

[-]

Muromec 9 months ago ago

If it uses ASCII anyway, what's the problem then? Don't accept non-ASCII user input.

[-]

sebstefan 9 months ago ago

First off: And exclude 70% of the world?

Usually they'll accept it, but some parts of the backend are still running code from the 60's.

So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet

[-]

Muromec 9 months ago ago

>First off: And exclude 70% of the world?

Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.

Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.

So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.

People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.

Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).

[-]

numpad0 9 months ago ago

Why can't these people understand that that 70% of the world consider ASCII to be "the computer language", not English, and UTF-8 to be "whatever soup that only works inside files and forms and can't be program manipulated"?

Maybe it needs to be communicated more often, like way more often, until it sticks.

[-]

Muromec 9 months ago ago

Well, it's much easier to understand the difference when one and another are using different alphabets.

account42 9 months ago ago

You are not being excluded just because you need to use a romanized version of your name. Clear example of a first world problem.

[-]

sebstefan 9 months ago ago

>first world problem

? The more first world you are the more your alphabet is taken into consideration

Hint: You use the word """romanized"""

InfamousRece 9 months ago ago

Some systems are still using EBCDIC.

account42 9 months ago ago

Cool, I will.

MajimasEyepatch 9 months ago ago

It’s totally reasonable to assume your users are in the US if your business only sells to people in the US. I work in the health insurance sector; there’s absolutely no chance my company ever sells these products internationally. We can’t even sell them in every state.

[-]

davidcbc 9 months ago ago

It's not reasonable to assume that users in the US have names that only use 7-bit ASCII

[-]

account42 9 months ago ago

It's reasonable to assume that all users can deal with having to encode their names in 7-bit ASCII. Otherwise you might as well demand that computer systems need to support arbitrary drawings in the name field at which point you might as well not have a name field at all because even most humans won't be able to deal with what you want to put in there.

[-]

davidcbc 9 months ago ago

Nice slippery slope you've got there

bigstrat2003 9 months ago ago

It actually is. That covers the vast, vast majority of people in the US.

[-]

saagarjha 9 months ago ago

That is not very reasonable, is it?

[-]

MajimasEyepatch 9 months ago ago

It is if all your customers are in the US.

fhars 9 months ago ago

No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.

[-]

vardump 9 months ago ago

> as there is almost no language that can be written using just that.

99% of use cases I've seen have nothing to do with human language.

1% human language case that is needs to be handled properly using a proper Unicode library.

Your mileage (percentages) may vary depending on your job.

[-]

kergonath 9 months ago ago

Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…

In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.

ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.

[-]

vardump 9 months ago ago

I said use a Unicode library if input data is actual human language. Which names and addresses are.

99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)

[-]

kergonath 9 months ago ago

I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.

[-]

account42 9 months ago ago

User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.

[-]

kergonath 9 months ago ago

Exactly. But in this case, don’t try to upper-case or otherwise transform anything.

Factory 9 months ago ago

"The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language. And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers. So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

[-]

kergonath 9 months ago ago

> "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.

That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.

> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.

numpad0 9 months ago ago

> That’s why I still get mail with my name mangled

Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

> it’s better if a Chinese user name does not break your reporting or logging systems

You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.

[-]

kergonath 9 months ago ago

> Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.

Muromec 9 months ago ago

Who and why still tries to lowercase/uppercase names? Please tell them to stop.

[-]

kergonath 9 months ago ago

Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.

9dev 9 months ago ago

It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.

elpocko 9 months ago ago

Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.

[-]

account42 9 months ago ago

Search needs a whole lot more normalization than just case folding.

[-]

elpocko 9 months ago ago

Okay.

inexcf 9 months ago ago

Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?

[-]

vardump 9 months ago ago

Here's an example. Hypothetically say you want to build an HTML parser.

You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.

So first you're going to normalize to either lower- or uppercase.

[-]

ARandumGuy 9 months ago ago

Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.

inexcf 9 months ago ago

Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.

[-]

recursive 9 months ago ago

Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.

[-]

tannhaeuser 9 months ago ago

Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.

And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.

[-]

recursive 9 months ago ago

HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.

ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.

Muromec 9 months ago ago

But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.

daemin 9 months ago ago

I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.

[-]

pistoleer 9 months ago ago

> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

[-]

daemin 9 months ago ago

I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

[-]

pistoleer 9 months ago ago

I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

heisenzombie 9 months ago ago

File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

[-]

vardump 9 months ago ago

File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

So use standard string processing libraries on path names at your own peril.

It's a good idea to consider file paths as a bag of bytes.

[-]

netsharc 9 months ago ago

IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

Someone 9 months ago ago

> It's a good idea to consider file paths as a bag of bytes

(Nitpick: sequence of bytes)

Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)

daemin 9 months ago ago

That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).

numpad0 9 months ago ago

Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.

account42 9 months ago ago

Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.

BoringTimesGang 9 months ago ago

Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.

[-]

account42 9 months ago ago

Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.

PaulDavisThe1st 9 months ago ago

He may be right, but approximately 75% of the problems he describes are all Microsoft-ecosystem specific.

In Unix-land we don't use wchar_t or UTF-16, and his article is a good demonstrations of why not.

[-]

pjmlp 9 months ago ago

UNIX land is even worse in international languages support.

As in, there isn't even something on POSIX at the level other operaring systems support for localisation.

Yes there is some locale stuff, however not enough for all stuff, hence why every modern programming language happens to have this as part of their standard library.

[-]

PaulDavisThe1st 9 months ago ago

When the notion of what a "string" is differs so much from language to language, i18n is never going to be an effective part of POSIX.

account42 9 months ago ago

Is there a platform where you can't use ICU?

account42 9 months ago ago

> That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

And std::tolower/toupper is the wrong tool for that as well.

yas_hmaheshwari 9 months ago ago

Wow, I came here to write exactly that, and its heartening to see that I am not crazy

Just reading the title, with microsoft.com in bracket, I knew two things: 1. It would be written by Raymond Chen 2. That article is going to be awesome

Animats 9 months ago ago

> From the article: "I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something."

I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.

Emoji translation would cause the machine to hammer out

    (RED-HEART)

or whatever emoji description was needed.

Used the emoji list at [2], an older version.

[1] https://vimeo.com/124065314

[2] http://unicode.org/emoji/charts-beta/full-emoji-list.html

alkonaut 9 months ago ago

Handle text in two ways: either it's controlled by you and you can do simple, efficient, and naive processing, or it's not (it's translated resources, or user input) and you can't.

For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!

Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.

Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!

[-]

account42 9 months ago ago

> For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!

C++ std::tolower/toupper (which are really just C tolower/toupper) are the wrong tool for that too though because they depend on the process locale which makes them a) horribly inefficient and b) prone to blow your program up in interesting ways on customer systems. Not quite as bad as the locale-dependent standard number parsing functions that want . in some localses and , in others but still should never be used.

PhilipRoman 9 months ago ago

Thought this was going to be about and-not-ing bytes with 0x20. Wrong for most inputs but sure as hell faster than anything else.

cyxxon 9 months ago ago

Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is still widely used, though.

[-]

mkayokay 9 months ago ago

Duden mentions this: "Bei Verwendung von Großbuchstaben steht traditionellerweise SS für ß. In manchen Schriften gibt es aber auch einen entsprechenden Großbuchstaben; seine Verwendung ist fakultativ ‹§ 25 E3›."

But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?

[-]

NoInkling 9 months ago ago

According to Wikipedia:

> "Since 2024 the capital ⟨ẞ⟩ is preferred over ⟨SS⟩."

https://en.wikipedia.org/wiki/%C3%9F

Check reference #5 and compare it to the older wording in reference #12.

Kwpolska 9 months ago ago

I don't think there exists any code that makes uppercasing decisions based on the selected font. Besides, if it doesn't exist in the current font, there's probably a fallback font.

Muromec 9 months ago ago

But what if you need to uppercase the historical record in a vital records registry from 1950ies, but and OCRed last week? Now you need to not just be locale-aware, but you locale should be versioned.

pjmlp 9 months ago ago

Lowering case is even better, because a Swiss user would expect the two-character sequence “SS“ to be converted into “ss“ and not “ß“.

And thus we add country specific locale to the party.

[-]

account42 9 months ago ago

Not just a Swiss user as there are many German words that use ss and not ß. And having an ss where there should be an ß will be a lot less disruptive as the inverse because people are used to ASCII limitations.

Rygian 9 months ago ago

The footnote #3 in the article (called as part of your quote) covers the different ways to uppercase ß with more detail.

himinlomax 9 months ago ago

> And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS.

That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.

ChrisMarshallNY 9 months ago ago

I generally just use the language-supported tolower/upper() (or similar) routines. I assume that they take things like UTF and alternative type systems into account.

I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.

Someone actually wrote a whole book about just Swift Strings[1].

[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...

[1] https://flight.school/books/strings/

serbuvlad 9 months ago ago

The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.

[-]

criddell 9 months ago ago

Strings in C++ standard library do suck (and C++ is my favorite language).

As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:

> And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.

[-]

recursive 9 months ago ago

UTF-16 has all the complexity of UTF-8 plus surrogate pairs.

[-]

zahlman 9 months ago ago

Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).

zzo38computer 9 months ago ago

First, you should consider if you even need case folding; for many uses it will be unnecessary, anyways.

Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.

There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).

If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.

Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.

You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.

These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.

(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)

high_na_euv 9 months ago ago

In cpp basic things are hard

[-]

johnnyjeans 9 months ago ago

nothing about working with locales, or text in general, is basic. we were decades into working with digital computers before we moved past switchboards and LEDs. don't take for granted just how high of a perch upon the shoulders of giants you have. that's exactly how the mistakes in the blog post get made.

[-]

high_na_euv 9 months ago ago

Ive worked in various languages like C#, C and CPP and I know where Ive been fighting what kinds of problems.

People always had some fancy reasoning about why things that should just work are not, but then a few years pass and things are improved.

C++ is getting closer and closer to langs like C# in terms of making it harder to shot yourself, but still there is a huge room for improvement

onemoresoop 9 months ago ago

It's subjective but I find C++ extremely ugly.

flareback 9 months ago ago

He gave 4 examples of how it's done incorrectly, but zero actual examples of doing it correctly.

[-]

TheGeminon 9 months ago ago

> Okay, so those are the problems. What’s the solution?

> If you need to perform a case mapping on a string, you can use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

crote 9 months ago ago

The correct thing to do is to not do it at all. If text is 3rd-party supplied, treat it like an opaque byte sequence. Alternatively, pay a well-trained human to do it by hand.

All other options are going to result in edge cases where you're not handling it properly. It's like trying to programmatically split a full name into a first name and a last name: language doesn't work like that.

commandlinefan 9 months ago ago

    for (int i = 0; i < strlen(s); i++) {
        s[i] ^= 0x20;
    }

[-]

calibas 9 months ago ago

Thank you for this universal approach. I can now toggle capitalization on/off for any character, instead of just being limited to alphabetic ones!

Jokes aside, I was kinda hoping for a good answer that doesn't rely on a Windows API or an external library, but I'm not sure there is one. It's a rather complex problem when you account for more than just ASCII and the English language.

[-]

TZubiri 9 months ago ago

Next up, check out our vector addition implementation of Hello+World. Spoiler alert, the result is Zalgo

vardump 9 months ago ago

Surely you meant:

  s[i] &= ~0x20;

We're talking about converting to upper case after all! As an added benefit, every space character (0x20) is now a NUL byte!

[-]

account42 9 months ago ago

Free strtok!

9 months ago ago

[deleted]

HPsquared 9 months ago ago

I thought this was going to be about adding or subtracting 32. Old school.

[-]

klyrs 9 months ago ago

I do hope you mean bitwise "addition" and "subtraction" -- (c => c&0xdf) or (c => c|0x20)

[-]

HPsquared 9 months ago ago

Tbh I come at this as a plebeian Excel user

9 months ago ago

[deleted]

codr7 9 months ago ago

C++, where every line of code is a book waiting to be written.

9 months ago ago

[deleted]

9 months ago ago

[deleted]

guerrilla 9 months ago ago

C is hard. It seems like C++ just made things way harder. I don't regret skipping it. Why not just go right to Java, C#, JS, Haskell, etc. and do what you need in C.

account42 9 months ago ago

A popular but wrong way to do Unicode

> wchar_t

PoignardAzur 9 months ago ago

So I'm going to be that guy and say it:

Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.

[-]

lilyball 9 months ago ago

While certainly much better, you still need to be aware that doing case conversion absent any locale information will never be perfect. If you want proper locale-aware conversion you can use the icu crate (https://docs.rs/icu/latest/icu/).

[-]

account42 9 months ago ago

Exactly, simple "unicode-aware" case conversions are a trap. You are always going to need much more.

the_gorilla 9 months ago ago

Why are some functions addressable in C++ and others not? Seems like a pointless design oversight.

[-]

bialpio 9 months ago ago

Footnote in the article provides the following explanation: "The standard imposes this limitation because the implementation may need to add default function parameters, template default parameters, or overloads in order to accomplish the various requirements of the standard."

ahartmetz 9 months ago ago

...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...

[-]

cannam 9 months ago ago

QString::toUpper/toLower are not locale-aware (https://doc.qt.io/qt-6/qstring.html#toLower)

Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.

[-]

ahartmetz 9 months ago ago

I don't see a problem with that. You can have it done locale-aware or not and "not" seems like a sane default. QString will uppercase 'ü' to 'Ü' just fine without locale-awareness whereas std::string doesn't handle non-ASCII according to the article. The cases where locale matters are probably very rare and the result will probably be reasonable anyway.

[-]

account42 9 months ago ago

That attitude is how you end up with exploits because your case folding is different from some other system you interact with.

vardump 9 months ago ago

You just want a banana, but you also get the gorilla. And the jungle.

aetherspawn 9 months ago ago

I will admit I don’t love the Qt licensing model, but most things in Qt just work as they are supposed to, and on every platform too.

account42 9 months ago ago

QString is how you ensure you cannot open/delete some files you WILL eventually encounter.

A popular but wrong way to convert a string to uppercase or lowercase