"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
That's the real shame but also the lesson, a perfectly good and specified format, but the apparent simplicity makes everyone ignore the spec and yolo out broken stuff.
This is why SQL is "broken", it's powerful, simple and people will always do the wrong thing.
Was teaching a class on SQL, half my class was reminding them that examples with concatenating strings was bad and they should use prepared statements (JDBC).
Come practice time, half the class did string concatenations.
This is why I love Linq and the modern parametrized query-strings in JS, they make the right thing easier than the wrong thing.
I also really like the way Androidx's Room handles query parameters and the corresponding APIs.
@Dao
public interface UserDao {
@Query("SELECT * FROM user")
List<User> getAll();
@Query("SELECT * FROM user WHERE uid IN (:userIds)")
List<User> loadAllByIds(int[] userIds);
@Query("SELECT * FROM user WHERE first_name LIKE :first AND " +
"last_name LIKE :last LIMIT 1")
User findByName(String first, String last);
@Insert
void insertAll(User... users);
@Delete
void delete(User user);
}
It's one of the better abstractions given the lack of first class expressions in Java, having used EfCore/Linq a while I'd be hard pressed to like going back though.
The Linq code is native C# that can be strongly typed for ID's,etc but you can "think" in SQL terms by writing Where,Select,OrderBy and so on (I will admit that the C# world hasn't really gotten there in terms of promoting strongly typed db ID's yet but there support is there).
I'm not sure how "commas inside strings in CSVs can cause bugs" becomes newsworthy, but I guess even the vibecoding generation needs to learn the same old lessons.
I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
I learned to program at 33 or so (in bioinformatics), my first real lesson a couple of days in: "Never ever use csv". I've never used pd.read_csv() without sep="\t". Idk where csv came from, and who thought it was a good idea. It must have been pre-spreadsheet because a tab will put you in the next cell so tabs can simply never be entered into any table by our biologist colleagues.
I guess it's also why all our fancy (as in tsv++?) file types (like GTF and BED) are all tab (or spaces) based. Those fields often have commas in cells for nested lists etc.
I wish sep="\t" was default and one would have to do pd.read/to_tsv(sep=",") for csv. It would have saved me hours and hours of work and idk cross the 79 chars much less often ;)
Funny story: I once bought and started up Galactic Civilizations 3.
It looked horrible, the textures just wouldn't load no matter what I tried. Finally, on a forum, some other user, presumably also from Europe, noted that you have to use decimal point as a decimal separator (my locale uses a comma). And that solved the problem.
CSV can handle commas in fields just fine (quotes are required in that case). The root problem here is not the format, it's a bug in the CSV exporter used.
It's one of those things where people think, it's there, and it works.
The whole business of software engineering exists in the gap between "it works today on this input" and "it will also work tomorrow and the day after and after we've scaled 10x and rewrote the serialization abstraction and..."
See also: "Glorp 5.7 Turbo one-shot this for me and it works!"
For CSV, I don't know how this comes out. It depends on the library/programming language. It might be 73786976294838210000 or it might throw an exception, or whatever. I'm just saying JSON will not solve your problems neither.
If you need something unambiguously specified, then XML with XSD is still a valid option. All number types are specified exactly, and you can use extensions for custom number types.
I've embedded large JSON blobs in CSVs. The format is fine and quite robust, just never open it in Excel unless you are prepared for your data to be silently broken, but that's Excel being abhorrent, not CSV: Libreoffice and Google Sheets don't do that.
The JSON in CSVs does piss off the Jetbrains table viewer sometimes though, it will randomly decide to stop parsing a 50k line CSV at halfway through the JSON of line 300ish even though that JSON is no different from other JSON it parsed just fine.
But python reads and writes them fine, as does whatever SQL engine I'm touching, as does other tools.
> I really don't understand why people think it's a good idea to use csv.
Because it's easy to understand. Non-technical people understand it. There is tremendous value in that, and that it's underspecified with ugly edge cases doesn't change that.
And you get the under reporting of COVID information in the UK as they passed around CSV files with too many rows for the tools they used.
An interchnage format needs to include information showing that you have all the data - e.g. a hash or the number of rows - or JSON/XML/s-expressions having closing symbols to match the start.
If you snapped your fingers and removed CSVs from the world your lights would go out within the hour and you'd starve within the week. Trillions of dollars in business are done every day with separated values files and excel computations. The human relationships solve the data issues.
This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
CSV quoting is dialect dependent. Honestly you should just never use CSV for anything if you can avoid it, it's inferior to TSV (or better yet JSON/JSONL) and has a tendency to appear like it's working but actually be hiding bugs like this one.
I'd go so far as to say any implementation that doesn't conform to RFC 4180[1] is broken and should be fixed. The vast majority of implementations get this right, it's just that some that don't are so high profile it causes people to throw up their hands and give up.
Unrelated to the fundamental issue (a part of your pipeline generates invalid CSV), I would never store the name of the country like this. The country's name is "The Republic of Moldova" and I would store it like this.
Sure, the most common collation scheme for country names is to sort ignoring certain prefixes like "The Republic of", "The", "People's Democratic...", etc. but this is purely a presentation layer issue (how to order a list of countries to a user) that should be independent of your underlying data.
Sure "hacking" the name of the country like this to make the traditional alphabetical ordering match a particular ordering desired to aid human navigation has a lot of history (encyclopedia indexes, library indexes, record stores, etc.) but that was in the age of paper and physical filing systems.
Store the country name correctly and then provide a custom sort or multiple custom sorts where such functionality belongs - in the presentation layer.
While true, the default and naive sort of anything is alphabetical, and you'd need to implement this more advanced nondefault sort in every possible client.
Personally I've never seen any sort where "The Republic of Moldova" would be sorted at "M".
It's reasonably common to sort by a either a short form of the name or the ISO code or similar, with the full name being displayed by the list.
This sometimes causes problems for the UK, which can be sorted as U, G, B, or even E (presumably for "England", making it especially annoying for people in the other countries of the UK).
Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
Sure, but why Moldova of all places? I've seen this form usually for places where there's a dispute for the short name, like Nice/Naughty Korea, Taiwan/West Taiwan, or Macedonia/entitled Greek government.
Come on man. What are we doing here. This is not even anything interesting like Norway being interpreted as False in YAML. This is just a straightforward escaping issue.
RFC 4180 [1] Section 2.6 says:
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
[1] https://www.rfc-editor.org/rfc/rfc4180.html
That's the real shame but also the lesson, a perfectly good and specified format, but the apparent simplicity makes everyone ignore the spec and yolo out broken stuff.
This is why SQL is "broken", it's powerful, simple and people will always do the wrong thing.
Was teaching a class on SQL, half my class was reminding them that examples with concatenating strings was bad and they should use prepared statements (JDBC).
Come practice time, half the class did string concatenations.
This is why I love Linq and the modern parametrized query-strings in JS, they make the right thing easier than the wrong thing.
I also really like the way Androidx's Room handles query parameters and the corresponding APIs.
It's one of the better abstractions given the lack of first class expressions in Java, having used EfCore/Linq a while I'd be hard pressed to like going back though.
The Linq code is native C# that can be strongly typed for ID's,etc but you can "think" in SQL terms by writing Where,Select,OrderBy and so on (I will admit that the C# world hasn't really gotten there in terms of promoting strongly typed db ID's yet but there support is there).
In that case, I'd recommend jooq, which is just linq in Java :D
Yeah I don't understand how the article started with, we want this ["value"] but got this [value] but doing that wasn't the solution?
I'm not sure how "commas inside strings in CSVs can cause bugs" becomes newsworthy, but I guess even the vibecoding generation needs to learn the same old lessons.
When they inevitably move to JSON I sure hope nobody's shop includes a quote in the name.
JSON is only written one way, CSV parses all have tweaks for delimiters and column names or not and such.
A vibecoded article, about a vibecoded solution, to a trivial problem. Not entirely sure what to think of it, tbh.
Just talked with a colleague and we used this as example where vibe coding is already brain draining people.
I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
I learned to program at 33 or so (in bioinformatics), my first real lesson a couple of days in: "Never ever use csv". I've never used pd.read_csv() without sep="\t". Idk where csv came from, and who thought it was a good idea. It must have been pre-spreadsheet because a tab will put you in the next cell so tabs can simply never be entered into any table by our biologist colleagues.
I guess it's also why all our fancy (as in tsv++?) file types (like GTF and BED) are all tab (or spaces) based. Those fields often have commas in cells for nested lists etc.
I wish sep="\t" was default and one would have to do pd.read/to_tsv(sep=",") for csv. It would have saved me hours and hours of work and idk cross the 79 chars much less often ;)
Tabs can absolutely be entered into cells in multiple ways but the easiest is just copy paste.
And if it’s tab delimited usually people call them tsvs.
Funny story: I once bought and started up Galactic Civilizations 3.
It looked horrible, the textures just wouldn't load no matter what I tried. Finally, on a forum, some other user, presumably also from Europe, noted that you have to use decimal point as a decimal separator (my locale uses a comma). And that solved the problem.
CSV can handle commas in fields just fine (quotes are required in that case). The root problem here is not the format, it's a bug in the CSV exporter used.
https://news.ycombinator.com/item?id=47229064
Clearly this is the issue. This article was 2000 words of trying to work around the actual problem
It's one of those things where people think, it's there, and it works.
The whole business of software engineering exists in the gap between "it works today on this input" and "it will also work tomorrow and the day after and after we've scaled 10x and rewrote the serialization abstraction and..."
See also: "Glorp 5.7 Turbo one-shot this for me and it works!"
Yes, but tabs also can appear in text fields. If you are free to pick not csv, then perhaps consider feather or parquet?
JSON, just use JSON. Or XML, if you don't like JSON.
JSON brings its own set of problems. for example, look at the python generated JSON below.
What's the parsing result in javascript ? What's the parsing result in Java ?What's the difference to CSV?
For CSV, I don't know how this comes out. It depends on the library/programming language. It might be 73786976294838210000 or it might throw an exception, or whatever. I'm just saying JSON will not solve your problems neither.
So it always depends on the implementation.
If you need something unambiguously specified, then XML with XSD is still a valid option. All number types are specified exactly, and you can use extensions for custom number types.
what's wrong with protobuf & friends ?
Nothing. Not a very good data exchange format though.
I've embedded large JSON blobs in CSVs. The format is fine and quite robust, just never open it in Excel unless you are prepared for your data to be silently broken, but that's Excel being abhorrent, not CSV: Libreoffice and Google Sheets don't do that.
The JSON in CSVs does piss off the Jetbrains table viewer sometimes though, it will randomly decide to stop parsing a 50k line CSV at halfway through the JSON of line 300ish even though that JSON is no different from other JSON it parsed just fine.
But python reads and writes them fine, as does whatever SQL engine I'm touching, as does other tools.
> I really don't understand why people think it's a good idea to use csv.
Because it's easy to understand. Non-technical people understand it. There is tremendous value in that, and that it's underspecified with ugly edge cases doesn't change that.
And you get the under reporting of COVID information in the UK as they passed around CSV files with too many rows for the tools they used.
An interchnage format needs to include information showing that you have all the data - e.g. a hash or the number of rows - or JSON/XML/s-expressions having closing symbols to match the start.
If you snapped your fingers and removed CSVs from the world your lights would go out within the hour and you'd starve within the week. Trillions of dollars in business are done every day with separated values files and excel computations. The human relationships solve the data issues.
This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
CSV quoting is dialect dependent. Honestly you should just never use CSV for anything if you can avoid it, it's inferior to TSV (or better yet JSON/JSONL) and has a tendency to appear like it's working but actually be hiding bugs like this one.
Most CSV dialects have no problem having double quoted commas.
The "dialect dependent" part is usually about escaping double quotes, new lines and line continuations.
Not a portable format, but it is not too bad (for this use) either considering the country list is mostly static
I'd go so far as to say any implementation that doesn't conform to RFC 4180[1] is broken and should be fixed. The vast majority of implementations get this right, it's just that some that don't are so high profile it causes people to throw up their hands and give up.
[1]: https://datatracker.ietf.org/doc/html/rfc4180
Unrelated to the fundamental issue (a part of your pipeline generates invalid CSV), I would never store the name of the country like this. The country's name is "The Republic of Moldova" and I would store it like this.
Sure, the most common collation scheme for country names is to sort ignoring certain prefixes like "The Republic of", "The", "People's Democratic...", etc. but this is purely a presentation layer issue (how to order a list of countries to a user) that should be independent of your underlying data.
Sure "hacking" the name of the country like this to make the traditional alphabetical ordering match a particular ordering desired to aid human navigation has a lot of history (encyclopedia indexes, library indexes, record stores, etc.) but that was in the age of paper and physical filing systems.
Store the country name correctly and then provide a custom sort or multiple custom sorts where such functionality belongs - in the presentation layer.
While true, the default and naive sort of anything is alphabetical, and you'd need to implement this more advanced nondefault sort in every possible client.
Personally I've never seen any sort where "The Republic of Moldova" would be sorted at "M".
It's reasonably common to sort by a either a short form of the name or the ISO code or similar, with the full name being displayed by the list.
This sometimes causes problems for the UK, which can be sorted as U, G, B, or even E (presumably for "England", making it especially annoying for people in the other countries of the UK).
Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
I was expecting a Markdown-related .md issue. :)
"Sanitize at the boundary"
Ah, but what _is_ the boundary, asks Transnistria?
LoL, good one.
Did you really name your breakaway republic Sealand'); DROP TABLE Countries;--?
just use TSV instead of CSV by default
I dont understand people who dont validate their inputs and outputs - a count of expected values would've prevented such a basic mistake
I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
That's a cool map.
It's almost certainly the result of applying AI stylization to this map https://commons.wikimedia.org/wiki/File:Europe-Moldova.svg without attribution, messing up some of the smaller details in the process.
"EUROPA" kind of gives it away
Why use Shopify if they can't even get their CSVs working right? Worse bugs have to be lurking.
The majority of countries official names are in this format. We just use the short forms. "Republic of ..." is the most common formal country name: https://en.wikipedia.org/wiki/List_of_sovereign_states
Sure, but why Moldova of all places? I've seen this form usually for places where there's a dispute for the short name, like Nice/Naughty Korea, Taiwan/West Taiwan, or Macedonia/entitled Greek government.
Typical vibe coding quality.
It's not the serialisation that is correct, it must be the data.
Lets rename a country because we are not capable of handling commas in data.
Come on man. What are we doing here. This is not even anything interesting like Norway being interpreted as False in YAML. This is just a straightforward escaping issue.
Huge skill issue. Nothing to see here.