SolidStart - Hacker News

wvenable 3 days ago ago

This really doesn't have anything to do with C#. This is your classic nvarchar vs varchar issue (or unicode vs ASCII). The same thing happens if you mix collations.

I'm not sure why anyone would choose varchar for a column in 2026 unless if you have some sort of ancient backwards compatibility situation.

[-]

dspillett 3 days ago ago

> I'm not sure why anyone would choose varchar for a column in 2026

The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.

Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.

Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.

--------

[0] Codes with fixed ranges, etc.

[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.

[-]

gfody 2 days ago ago

utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.

[-]

tialaramex 2 days ago ago

UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.

downsplat 2 days ago ago

The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.

If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j

[-]

gfody 2 days ago ago

it certainly isn't the best choice for sql server, see: https://sqlquantumleap.com/2018/09/28/native-utf-8-support-i...

exceptione 2 days ago ago

What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..

[-]

Tuna-Fish 2 days ago ago

Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.

[-]

divingdragon 2 days ago ago

Really, as an East Asian language user the rest of the comments here make me want to scream.

[-]

exceptione 2 days ago ago

I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

[-]

gfody a day ago ago

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.

the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

[-]

exceptione a day ago ago

  > sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:

  «UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹

In the same vein, with reference to³:

  «The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²

The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.

  «Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹

___

1. https://en.wikipedia.org/wiki/UTF-16#Efficiency

2. https://en.wikipedia.org/wiki/UTF-8#Comparison_to_UTF-16

3. https://kitugenz.com/

[-]

gfody 18 hours ago ago

the talk page behind the utf-16 wiki is actually quite interesting. it seems the manifesto guys tried to push their agenda there, and the allusions to "real text" with missing citations are a remnant of that. obv there's no such thing as "real text" and the statements about it containing many spaces and punctuation are nonsense (many languages do not delimit words with spaces, plenty of text is not mostly markup, and so on..)

despite the frothing hoard of web developers desperate to consider utf-16 harmful, it's still a fact that the consortium optimized unicode for 16-bits (https://www.unicode.org/notes/tn12) and their initial guidance to use utf-8 for compatibility and portability (like on the web) and utf-16 for efficiency and processing (like in a database, or in memory) is still sound.

[-]

exceptione 9 hours ago ago

Interesting link! It shows its age though (22 years), as it makes the point that utf-16 is already the "most dominant processing format", but if that would be the deciding factor, then utf-8 would be today's recommendation, as utf-8 is the default for online data exchange and storage nowadays, all my software assumes utf-8 as the default as well. But I can't speak for people living and trading in places like S.Asia, like you.

If one develops for clients requiring a varying set of textual scripts, one could sidestep an ideological discussion and just make an educated guess about the ratio of utf-8 vs utf-16 penalties. That should not be complicated; sometimes utf-8 would require one more byte than utf-16 would, sometimes it's the other way around.

gfody 2 days ago ago

hn often makes me want to scream

SigmundA 2 days ago ago

The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.

[-]

gfody 2 days ago ago

implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences

[-]

SigmundA 2 days ago ago

It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?

I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.

By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.

This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.

jklowden 4 hours ago ago

I’m not sure why the top-rated reply begins by presuming anything about the problem domain. Many domains have a specified language and implied if not explicit collation. Rejecting characters outside that domain is part of the job. There are no emojis listed on the NASDAQ.

beart 3 days ago ago

I agree with your first point. I've seen this same issue crop up in several other ORMs.

As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.

[-]

wvenable 3 days ago ago

> The vast majority of character fields in databases I've worked with do not need to store unicode values.

This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.

[-]

SigmundA 3 days ago ago

Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.

Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.

[-]

simonask 3 days ago ago

English has plenty of Unicode — claiming otherwise is such a cliché…

Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.

[-]

SigmundA 3 days ago ago

I am talking about coded values, like Status = 'A', 'B' or 'C'

Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.

[-]

wvenable 2 days ago ago

It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.

[-]

SigmundA 2 days ago ago

How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.

Typical code tables with code, description and anything else needed for that value which the user can configure in the app.

Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.

[-]

wvenable 2 days ago ago

Enums are for non user-configurable values.

For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.

Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.

[-]

SigmundA 2 days ago ago

>Enums are for non user-configurable values

In the systems I work with most coded values are user configurable.

>But those should have an auto-integer primary key and if you need the description, join for it.

Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'

select * from addresses where state = 11, no thanks.

Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.

>the length of the string is stored as an int

No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.

You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.

[-]

wvenable 2 days ago ago

> They do not complicate storage access in any meaningful way.

Sure they do, because now your row / index is variable length rather than fixed length. Way more complicated. Even 3 bytes is way more complicated to deal with than 4 bytes.

> select * from addresses where state = 11, no thanks.

I will agree that isn't fun. Is it still the trade off I do make? Absolutely. And it's not really that big of a problem; I just do a join. It also helps prevent people from using codes instead of querying the database for the correct value -- what's the point of user-configuration of someone hard-codes 'NY' in a query or in the code.

[-]

SigmundA 2 days ago ago

>Sure they do, because now your row / index is variable length rather than fixed length. Way more complicated.

Come on its literally a 2 byte per column header in the row so it just sums the column lengths to get the offset, it does the same thing for fixed length except it gets the col length from the schema.

It's not much more complicated than a fixed length column only the column length is stored in row vs schema. I am not sure where you are getting this idea it way more complicated, nor the 3 vs 4 byte thing, the whole row is a variable length structure and designed as such, null values change the row length fixed or variable data type and have to be accounted for since a null takes up no space in the column data its only in the null bitmap.

> what's the point of user-configuration of someone hard-codes 'NY' in a query or in the code

Because it doesn't matter, 'NY' isn't changing just like 11 the int wouldn't change, but 'NY' is way easier to understand and catch mistakes with and search for code without hitting a bunch of nonsense and distinguish when 10 columns are all coded next to each other in a result set.

I prefer my rows to be a little more readable than 1234, 1, 11, 2, 15, 1 ,3 and the users do too.

I have had my fill of transposition bugs where someone accidentally uses the wrong int on a pk id from a different table and still gets a valid but random result that passes a foreign key check almost enough for me to want to use guid's for pk's almost. At least with the coded values it is easier to spot because even with single character code people tend to pick things that make sense for the column values you know 'P' for pending, 'C' for complete etc, vs 1 2 3 4 used over and over across every different column with an auto increment.

[-]

wvenable 2 days ago ago

> Come on its literally...

You're the one saying a 2 character string is somehow a space savings. If we're going to split hairs that finely then you have to know that any row with a variable length string makes the entire row/index variable length and that is a net storage and performance loss. It's worse in every way than a simple integer. I will admit that it ultimately doesn't matter. But I'd also argue using an nvarchar in place of varchar for this also doesn't matter. It's not just premature optimization it's practically useless optimization.

> Because it doesn't matter, 'NY' isn't changing just like 11 the int wouldn't change, but 'NY

That's not what happens but what happens is that somebody renames New York to New Eburacum and now your code doesn't match the value and it just adds more confusion.

But I'll grant you that it's totally fine. It's even more fine if you don't use varchar and instead use char(x).

[-]

SigmundA a day ago ago

>You're the one saying a 2 character string is somehow a space savings. If we're going to split hairs that finely then you have to know that any row with a variable length string makes the entire row/index variable length and that is a net storage and performance loss.

The row is always variable lengths as a structure it has flags noting how many columns there are with values and if there is a variable length section or not, only rows with no variable length fields at all has no variable length section and that is a bit flag check in the header.

You are making a non argument, variable length fields can be a space savings over an int with single char codes which is very common, and do not impact performance in any meaningful way. Besides that one could use fixed length chars and still get the other benefits I mentioned while having the same exact space usage and processing as a fixed length ints.

>That's not what happens but what happens is that somebody renames New York to New Eburacum

Changing the descriptive meaning of an entry causes all sorts of problems and even more so if it is a int because it's completely opaque its much harder to see an issue in the system because everything is a bunch of ints that do not correlate in any way to their meaning.

Changing the description to something that has the same meaning worded differently is usually not an issue and still gives good debug visibility to the value. If you and your users consider New Eburacum synonymous with New York, then having the code stay 'NY' should not be an issue and still be obvious when querying the data.

Unless someone is using the short code in a user visible way and it has to be updated. State is a common one that does this and nobody is changing state names or codes because it is a common settled natural key.

In the rare situation this actually needed to be done then one can update existing data, this is a not an issue in practice. You have the be extremely cautious updating the description of a code because much data was entered under the previous description and the meaning that it carries, having the code have some human meaning makes it more obvious to maintainers this should be done with care, many times it would involve deprecating the old one and making a new one with a different code because they have different meanings, having a table instead of a enum allows other columns to have this metadata.

This is not the same issue as say using a SSN for a person ID.

simonask 2 days ago ago

Please take literally one course.

Do NOT use mnemonics as primary keys. It WILL bite you.

[-]

SigmundA 2 days ago ago

https://en.wikipedia.org/wiki/Natural_key you should have learned learn this in your courses.

Clam down, I am not suggesting using this for actual domain entity keys, these are used in place of enums and have their advantages. I have doing this a long time and it has not bit me, I have also seen many other system designed this way as well working just fine.

Using an incrementing surrogate key for say postal state code serves no purpose other than making things harder to use and debug. Most systems have many code values such as this and using surrogate key would lead to a bunch of overlapping hard to distinguish int data that leads to all sorts of issues.

andy81 2 days ago ago

The way to do enums in SQL (generally, not just MSSQL) is another table. It's better that they don't offer several ways to do the same thing.

[-]

SigmundA 2 days ago ago

Mostly agree separate tables can have multiple attributes besides a text description and can be exposed for modification to the application easily so users or administrators can add and modify codes.

A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.

Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.

Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.

sgarland 2 days ago ago

While I generally would prefer lookup tables, it's much easier to sell dev teams on "it looks and acts like a string - you don't have to change anything."

2 days ago ago

[deleted]

kstrauser 2 days ago ago

Those are all single byte characters in UTF-8.

[-]

SigmundA 2 days ago ago

We are talking nvarchar here, yes UTF-8 solves this issue completely and MSSQL supports it now days with varchar.

croes 2 days ago ago

But nvarchar is UTF-16

simonask 2 days ago ago

No. Look closer.

Slothrop99 2 days ago ago

Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.

Not that I disagree  Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.

[-]

simonask 2 days ago ago

My comment contains two glyphs that are not in CP1252.

zabzonk 2 days ago ago

> Unicode is a requirement everywhere human language is used

Strange then how it was not a requirement for many, many years.

[-]

Macha 2 days ago ago

Oh, it was. It was fun being unable to type a euro sign or the name Seán without it being garbled. Neither were matched quotation marks, and arguably computer limitations killed off naïve and café too.

Don’t confuse people groaning and putting up with limitations as justifying those limitations.

pjmlp a day ago ago

In Portugal it always was, that is why we got to use eh for é, ah for á, he for è, c, for ç and many other tricks.

Shared by other European languages, like ou for ö in German, kalimera for καλημέρα, and so on all around the world in non-English speaking countries during the early days of computing.

procaryote 2 days ago ago

It was a mess back then though. Unicode fixed that.

[-]

zabzonk 2 days ago ago

I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.

[-]

procaryote 20 hours ago ago

The options were never ASCII or unicode though. Before unicode we had ASCII + lots of different incompatible encodings that relied on metadata to be properly rendered. That's what unicode fixed

Besides I like being able to put things like →, €, ∞ or ∆ into text. With ascii a lot of things that are nowadays trivial would need markup languages

simonask 2 days ago ago

For whom? Certainly not any of the humans trying to use the computer.

paulddraper a day ago ago

Or rather, computers had inadequate support.

NegativeLatency 3 days ago ago

Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji

[-]

SigmundA 3 days ago ago

Specifically not talking about a name field

psidebot 3 days ago ago

Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…

SigmundA 3 days ago ago

To complicate matters SQL Server can do Nvarchar compression, but they should have just done UTF-8 long ago:

https://learn.microsoft.com/en-us/sql/relational-databases/d...

Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?

_3u10 3 days ago ago

Generally if it stores user input it needs to support Unicode. That said UTF-8 is probably a way better choice than UTF-16/UCS-2

[-]

SigmundA 3 days ago ago

UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

I have avoided it and have not followed if the issues are fully resolved, I would hope they are.

[-]

kstrauser 3 days ago ago

> UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.

[-]

da_chicken 3 days ago ago

I would be more critical of Microsoft choosing to support UCS-2/UTF-16 if Microsoft hadn't completed their implementation of Unicode support in the 90s and then been pretty consistent with it.

Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.

downsplat 2 days ago ago

It's not really a Linux vs MS thing though. When Unicode first came out, it was 16-bit, so all the early adopters went with that. That includes Java, Windows, JavaScript, the ICU lintaries, LibreOffice and its predecessors, .NET, the C language (remember wchar_t?), and probably a few more.

Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.

Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.

[-]

wvenable 2 days ago ago

Database systems are inherently conservative -- once you add something you have to support it forever. Microsoft went hog wild on XML in the database and I haven't seen it used in over a decade now.

recursive 3 days ago ago

In 92 it was a conference talk. In 98 it was adopted by the IETF. Point probably stands though.

swasheck 3 days ago ago

the data types were introduced with SQL Server 7 (1998) so i’m not sure it’s accurate to state that it’s considered as the new thing.

[-]

SigmundA 3 days ago ago

UTF-8 was introduced in SQL Server 2019:

https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-...

Dwedit 2 days ago ago

The one place UTF-16 massively wins is text that would be two bytes as UTF-16, but three bytes as UTF-8. That's mainly Chinese, Japanese, Korean, etc...

croes 2 days ago ago

Since MS SQL Server 2019 varchar supports unicode so now it’s the opposite, you use nvarchar instead of varchar for backwards compatibility reasons.

paulsutter 3 days ago ago

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

[-]

wvenable 3 days ago ago

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

[-]

kstrauser 3 days ago ago

The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.

The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".

[-]

wvenable 2 days ago ago

I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

It's now 2026, everything always looks different in hindsight.

[-]

gpvos a day ago ago

At the time it was introduced it was understandable, and Microsoft also needed some time to implement it before that of course. But by about 2000 it was clear that UTF-8 was going to win, and Microsoft should have just properly implemented it in NT instead of dithering about for the next almost 20 years. Linux had quite good support of it by then.

kstrauser 2 days ago ago

I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point.

Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.

[-]

wvenable 2 days ago ago

Microsoft did the hard work of supporting Unicode when UTF-8 didn't exist (and mostly when UTF-16 didn't exist).

Any system that continued with only ASCII well into the 2000s could mostly just jump into UTF-8 without issue. Doing nothing for non-English users for almost two decades turned out to be a solid plan long term. Microsoft certainly didn't have that option.

paulsutter a day ago ago

Blame Java - their use of utf-16 is the sole reason that Microsoft chose it.

Sun sued Microsoft in 1996 for making nonportable extensions to Java (a license violation). Microsoft lost, and created C# in 2000.

At the time, “Starting Java” was the most feared message on the internet. People really thought that in-browser Java would take over over the world (yes Java, not Javascript)

Sun chose UTF16 in 1995 believing that Unicode would never need more than 64k characters. In 1996 that changed. UTF16 got variable length encoding and became a white elephant

So Microsoft chose UTF16 know full well that it had no advantages. But at least they can say code pages were far worse :)

gpvos 2 days ago ago

MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.

[-]

kstrauser 2 days ago ago

Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway.

Dwedit 2 days ago ago

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.

applfanboysbgon 3 days ago ago

I think this is a rather pertinent showcase of the danger of outsourcing your thinking to LLMs. This article strongly indicates to me that it is LLM-written, and it's likely the LLM diagnosed the issue as being a C# issue. When you don't understand the systems you're building with, all you can do is take the plausible-sounding generated text about what went wrong for granted, and then I suppose regurgitate it on your LLM-generated portfolio website in an ostensible show of your profound architectural knowledge.

[-]

ziml77 3 days ago ago

This is not at all just an LLM thing. I've been working with C# and MS SQL Server for many years and never even considered this could be happening when I use Dapper. There's likely code I have deployed running suboptimally because of this.

And it's not like I don't care about performance. If I see a small query taking more than a fraction of a second when testing in SSMS or If I see a larger query taking more than a few seconds I will dig into the query plan and try to make changes to improve it. For code that I took from testing in SSMS and moved into a Dapper query, I wouldn't have noticed performance issues from that move if the slowdown was never particularly large.

cosmez 3 days ago ago

This is a common issue, and most developers I worked with are not aware of it until they see the performance issues.

Most people are not aware of how Dapper maps types under the hood; once you know, you start being careful about it.

Nothing to do with LLMs, just plain old learning through mistakes.

keithnz 3 days ago ago

actually, LLMs do way better, with dapper the LLM generates code to specify types for strings

SigmundA 3 days ago ago

Yes I have run into this regardless of client language and I consider it a defect in the optimizer.

[-]

wvenable 3 days ago ago

I wouldn't consider it a defect in the optimizer; it's doing exactly what it's told to do. It cannot convert an nvarchar to varchar -- that's a narrowing conversion. All it can do is convert the other way and lose the ability to use the index. If you think that there is no danger converting an nvarchar that contains only ASCII to varchar then I have about 70+ different collations that say otherwise.

[-]

SigmundA 3 days ago ago

Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?

If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.

[-]

wvenable 2 days ago ago

For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.

[-]

munch117 2 days ago ago

Making a plan that works for the general case, but is also efficient, is rather trivial. Here's pseudocode from spending two minutes on the problem:

    # INPUT: lookfor: unicode
    var lower, upper: ascii
    lower = ascii_lower_bound(lookfor)
    upper = ascii_upper_bound(lookfor)
    for candidate:ascii in index_lookup(lower .. upper):
        if expensive_correct_compare_equal(candidate.field, lookfor):
            yield candidate

The magic is to have functions ascii_lower_bound and ascii_upper_bound, that compute an ASCII string such that all ASCII strings that compare smaller (greater) cannot be equal to the input. Those functions are not hard to write. Although you might have to implement versions for each supported locale-dependent text comparison algorithm, but still, not a big deal.

Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.

jstrong 2 days ago ago

optimizer can't inspect the value? pretty dumb optimizer, then.

[-]

zabzonk 2 days ago ago

It's not "the value", it's "the values".

wvenable 2 days ago ago

Running the optimizer for every execution of the same query is... not very optimal.

[-]

SigmundA 2 days ago ago

It can run it for a range of values: https://learn.microsoft.com/en-us/sql/relational-databases/p...

Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.

The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...

[-]

wvenable 2 days ago ago

I'm not sure it makes sense to add more checks and another operation to every single query just for the case where the user explicitly mislabels the types. You're going to slow down everything everywhere (slightly) for a pretty obscure case. I suspect, in the long term, this would be a bad choice.

[-]

SigmundA 2 days ago ago

The check is added if it sees a varchar column and nvarchar parameter predicate on it.

It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.

There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.

2 days ago ago

[deleted]

briHass 3 days ago ago

I've found and fixed this bug before. There are 2 other ways to handle it

Dapper has a static configuration for things like TypeMappers, and you can change the default mapping for string to use varchar with: Dapper.SqlMapper.AddTypeMap(typeof(string),System.Data.DbType.AnsiString). I typically set that in the app startup, because I avoid NVARCHAR almost entirely (to save the extra byte per character, since I rarely need anything outside of ANSI.)

Or, one could use stored procedures. Assuming you take in a parameter that is the correct type for your indexed predicate, the conversion happens once when the SPROC is called, not done by the optimizer in the query.

I still have mixed feelings about overuse of SQL stored procedures, but this is a classic example of where on of their benefits is revealed: they are a defined interface for the database, where DB-specific types can be handled instead of polluting your code with specifics about your DB.

(This is also a problem for other type mismatches like DateTime/Date, numeric types, etc.)

[-]

ziml77 3 days ago ago

Sprocs are how I handle complex queries rather than embedding them in our server applications. It's definitely saved me from running into problems like this. And it comes with another advantage of giving DBAs more control to manage performance (DBAs do not like hearing that they can't take care of a performance issue that's cropped up because the query is compiled into an application)

bonesss 2 days ago ago

As a general issue of hygiene I tend to wrap any ORM and access it through an internal interface.

1) The joy of writing and saying DapperWrapper can’t be overstated.

2) in conjunction with meaningful domain types it lets you handle these issues across the app at a single point of control, and capture more domain semantics for testing.

jklowden 3 hours ago ago

> SQL Server has to convert every single value in the column to nvarchar before it can compare.

This of course is not true. It is a defect in Microsoft’s query planner. And the proof lies in the remedy.

The recommended solution is to convert the search argument type to match that of the index. The user is forced to discover the problem and adjust manually. SQL Server could just as well have done that automatically.

No information is lost converting nvarchar to varchar if the index is varchar. If the search argument is ‘’, no conversion from varchar will match it (unless the index data is UTF8, which the server should know).

This is a longstanding bug in SQLserver, and not the only one. Instead of patting ourselves on the back for avoiding what SQL Server “has to do”, we should be insisting it not do it. Anymore.

diath 3 days ago ago

It's weird that the article does not show any benchmarks but crappy descriptions like "milliseconds to microseconds" and "tens of thousands to single digits". This is the kind of vague performance description LLMs like to give when you ask them about performance differences between solutions and don't explicitly ask for a benchmark suite.

[-]

pllbnk 2 days ago ago

I disagree. I think it's a nice discovery many might be unaware of and later spend a lot of time on tracking down the performance issue independently. I also disagree that a rigorous benchmark is needed for every single performance-related blog post because good benchmarks are difficult to write, you have to account for multiple variables. Here, the author just said - "trust me, it's much faster" and I trust them because they explained the reasoning behind the degradation.

nmeofthestate 2 days ago ago

The writing style certainly screams LLM.

_vertigo 2 days ago ago

> No schema changes. No new indexes. No query rewrites. Just telling Dapper the correct parameter type.

[-]

pllbnk 2 days ago ago

Are we automatically discarding everything that might or might not have been written or assisted by an LLM? I get it when the articles are the type of meaningless self improvement or similar kind of word soup. However, if hypothetically an author uses LLM assistance to improve their styling to their liking, I see nothing wrong with that as long as the core message stands out.

[-]

rmunn 2 days ago ago

I've seen so many LLM-generated articles by this point that obviously had no human editing done beforehand — just prompt and slap it onto the Web — that it makes me wonder every time. If I read this article, will I actually learn only truth? Or are there some key parts of this article that are actually false because the LLM hallucinated them, and the human involved didn't bother to double-check the article before publishing it?

If someone was just using the LLM for style, that's fine. But if they were using it for content, I just can't trust that it's accurate. And the time cost for me to read the article just isn't worth it if there's a chance it's wrong in important ways, so when I see obvious signs of LLM use, I just skip and move on.

Now, if someone acknowledged their LLM use up front and said "only used for style, facts have been verified by a human" or whatever, then I'd have enough confidence in the article to spend the time to read it. But unacknowledged LLM use? Too great a risk of uncorrected hallucinations, in my experience, so I'll skip it.

downsplat 2 days ago ago

Did this post come out of a freezer from 1998? Who on earth creates databases in Latin1 in 2026?

Nevermind, looks like Sql Server didn't add utf8 collations until 2019 (!) and for decades people had to choose column by column between the 16-bit overhead of "nvarchar" and latin1... And still do if they want a bit of backwards compatibility. Amazing.

[-]

rmunn 2 days ago ago

"Just use Postgres" (which defaults to UTF-8 encoding unless specifically configured to use something else) is looking like better and better advice every day.

Doesn't help those tied to legacy systems that would require a huge, expensive effort to upgrade, though. Sorry, folks. There's a better system, you know it's a better system, and you can't use it because switching is too expensive? I've been there (not databases, in my case) and it truly sucks.

elmigranto 2 days ago ago

Third party dependencies are very easy: you just have to intimately know how it is implemented in addition to knowing your own code and stack, and then you are golden!

Nothing to learn, just focus on making your app, it’s all taken care of by This One Simple Package ;)

These things are so far from free as our tooling presents with “just nuget it or whatever”.

[-]

DeathMetal3000 2 days ago ago

I’m sure writing their own ORM would have given them instantaneous insight into this issue and introduced no other challenges. Open source developers hate this one weird trick!

[-]

elmigranto 2 days ago ago

Especially for things used directly, you need to understand both, own and third party code, roughly to the same level. With own code, you only care for your own use case; with third-party — you have to kind of get everyone else's.

Depending on what you do and the dependency's scope, either way can make sense.

maciekkmrk 3 days ago ago

Interesting problem, but the AI prose makes me not want to read to the end.

jiggawatts 3 days ago ago

This feels like a bug in the SQL query optimizer rather than Dapper.

It ought to be smart enough to convert a constant parameter to the target column type in a predicate constraint and then check for the availability of a covering index.

[-]

valiant55 3 days ago ago

There's a data type precedence that it uses to determine which value should be casted[0]. Nvarchar is higher precedence, therefore the varchar value is "lifted" to an nvarchar value first. This wouldn't be an issue if the types were reversed.

0: https://learn.microsoft.com/en-us/sql/t-sql/data-types/data-...

[-]

Sankozi a day ago ago

So such issues can appear in more products and more datatypes (int and bigint have same problem).

This is really bad rule for SQL's "equality" operator.

Still optimizer should be able to handle it - if the result is the same, optimizer should take faster path.

wvenable 3 days ago ago

It's the optimizer caching the query plan as a parameterized query. It's not re-planning the index lookup on every execution.

[-]

SigmundA 3 days ago ago

The parameter type is part of the cache identity, nvarchar and varchar would have two cache entries with possibly different plans.

[-]

3 days ago ago

[deleted]

beart 3 days ago ago

How do you safely convert a 2 byte character to a 1 byte character?

[-]

jiggawatts 3 days ago ago

Easily! If it doesn't convert successfully because it includes characters outside of the range of the target codepage then the equality condition is necessarily false, and the engine should short-circuit and return an empty set.

smithkl42 3 days ago ago

Been bit by that before: it's not just an issue with Dapper, it can also hit you with Entity Framework.

wronex a day ago ago

This makes me sad. We have these type safe languages, then a DB comes along and brakes the type barrier. What are we to do? Property attributes and an ORM? Is there a Linq to SQL thing?

mvdtnz 3 days ago ago

This is a really interesting blog post - the kind of old school stuff the web used to be riddled with. I must say - would it have been that hard to just write this by hand? The AI adds nothing here but the same annoying old AI-isms that distract from the piece.

andrelaszlo 3 days ago ago

I thought, having just read the title, that maybe it's time to upgrade if you're still on Ubuntu 6.06.

pjmlp 2 days ago ago

I never had this issue with Dapper, as others point out, an holding it wrong problem.

adzm 3 days ago ago

even better is Entity Framework and how it handles null strings by creating some strange predicates in SQL that end up being unable to seek into string indexes

enord 3 days ago ago

This is due to utf-16, an unforgivable abomination.

bunbun69 2 days ago ago

AI slop article

Also no meaningful benchmarking was done

ltbarcly3 3 days ago ago

Life is too short to use SQL Server. I know people that use it will swear it's "not bad anymore" but yes it is.

[-]

bni 2 days ago ago

yes it is

C# strings silently kill your SQL Server indexes in Dapper