SolidStart - Hacker News

rmunn 21 hours ago ago

Not mentioned there, because it's discussing history that nearly all happened before Unicode was ever conceived, is the fact that the kernel expects the first two bytes to be #! and therefore a UTF-8 BOM will mess up this logic. If the first two bytes are 0xEF 0xBB (because the first five bytes where 0xEF 0xBB 0xBF hash bang), you'll get errors like "./doit.sh: line 1: #!/bin/bash: No such file or directory" and be left scratching your head. /bin/bash is right there, I can see it with ls, why can't my script see it?

Do you see the invisible BOMb in the error message? Neither did I the first time. (And, in fact, Ghostty apparently stripped it out when I copied and pasted, so it's not actually there in this comment). But if I were to load that doit.sh script I created for this example into VS Code, I'd see the telltale "UTF-8 with BOM" file format.

Most people already know this, but maybe this will help someone out there. If you see a "No such file or directory" error and the program being executed apparently starts with #!, it probably actually starts with U+FFEF#! and you need to re-save the script in UTF-8 without a BOM(b) at the start.

[-]

jstanley 21 hours ago ago

How are you ending up with a byte-order mark in your shell scripts though? This has literally never happened to me. I don't know a single piece of software that writes byte-order marks, they are super niche.

[-]

dspillett 19 hours ago ago

BOM is officially recommended against for UTF-8, but I've seen some tools include it when converting from UCS or UTF16 in Windows. A number of text editors support it, and may stick in that mode for subsequent files, which might be how a BOM could accidentally get into a new file.

Irritatingly, you'll find BOMs to not be uncommon in CSV files because of Excel, which interprets files as CP1252 (a superset of the printable characters of ISO 8859-1, sometimes known as Win1252 or Windows-1252) if the BOM is not present, causing anything beyond ASCII to be misinterpreted (accented characters are usually the first thing people in Europe notice getting garbled, currently symbols other than $ too).

apple1417 19 hours ago ago

My most common source of unintentional BOMs is powershell. By default, 'echo 1 > test' writes the raw bytes 'FF FE 31 00 0D 00 0A 00'. Not too likely for that to end up in a shell script though.

[-]

rmunn 6 hours ago ago

That's UTF-16, though, which is basically the only place where byte-order marks are actually useful. (It was arguably the thing they were designed for, being able to tell whether you have little-endian or big-endian UTF-16. I've never encountered UTF-16BE personally so I can't easily tell how widespread it is, but I suspect that while its use is rare, it's unfortunately not zero so you still need to know what byte order you have when encountering UTF-16.

Though that would have been a far worse problem a decade ago. Thankfully, these days any random document you encounter on the Web is 99% likely to be UTF-8, so the correct heuristic these days is "if you don't know the encoding, try decoding with UTF-8 first. If that fails, try a different encoding." Decoding UTF-8 is practically guaranteed to fail on a document that wasn't actually UTF-8, so trying UTF-8 first is a very safe bet. (Though there's one failure mode: ASCII encoded as UTF-16. So you might also try a heuristic that says "if the UTF-8 parses successfully but has a null byte every other character in the first 256 bytes, then try reparsing the document as UTF-16").

I'm rambling and getting off my point. The point is, I don't find BOMs in UTF-16 documents to be a problem, because UTF-16 actually needs them. (Well, probably needs them). UTF-8 doesn't need them at all.

rmunn 20 hours ago ago

The coworker who created the script runs Windows. When I informed him that he'd gotten a BOM into the shell script, he checked his IDE settings (JetBrains Rider) and his encoding default was set to UTF-8 without BOM, so neither of us have any clue how that script ended up with a BOM in it. Perhaps he edited the script with a different tool at one point. But it was definitely because the script was created or edited on Windows. (I forgot to mention earlier that you'll only ever run into this when you work on projects where devs are using different OSes to check files into Git. Many people will therefore never see this issue).

[-]

bux93 19 hours ago ago

Notepad.exe does "UTF-8 with BOM" (as well as "UTF-8"); you'd have to select it in the Save As dialogue, but it's right there with no indication that this is a Bad Thing. Just checked, it's still there to this day.

Elfener 17 hours ago ago

The answer is (also confirmed by other replies) Windows. It seems in the Unix world, everyone uses UTF-8 (without BOM of course) and text encoding mistakes don't exist.

When you involve Windows, which likes a random mix of UTF-16, UCS-2, CP1252, and I guess also UTF-8 with BOM, you're screwed.

defrost 21 hours ago ago

Notepad++ (popular with some on Windows) does optional Byte Order Marks on text files (subtitles, bash scripts, anything UTF-8 etc).

Not my editor of choice but some swear by it and are prone to work cross platform across NAS's and SSH terminals with either windows or some *nix as 'primary' work space.

I'm sure other editors have this as an option, the time I ran into BOM issues I traced it back to the use of Notepad++ by a third party.

ndsipa_pomu 9 hours ago ago

Microsoft Excel writes them out when loading/saving CSV files.

Ferret7446 a day ago ago

The article actually mentions this in passing, but POSIX will default to running the file with the system shell if it's not an executable binary. So hashbang scripts can work even if the system doesn't support hashbangs (as long as the script is in shell).

rmunn a day ago ago

Also relevant: https://news.ycombinator.com/item?id=45970885 (article about how #! allows relative paths)

[-]

kreetx 20 hours ago ago

People should just go and read https://en.wikipedia.org/wiki/Shebang_(Unix), nothing wrong with reading the above, though it's somewhat click-bait-ish title.

AnssiH 19 hours ago ago

On Linux, the maximum length was doubled to 256 in v5.1 (2019-05-05).

[-]

aallaall 14 hours ago ago

Windows95 also doubled the dos limit of a cmdline from 128 to 256 in… 1995.

xrd 18 hours ago ago

I got really excited about a #!magic invocation only to be disappointed. But this is still a fun romp.

John-Tony12 15 hours ago ago

[dead]

Details about the shebang/hash-bang mechanism on various Unix flavours (2001)