Wikidata as a Giant Crosswalk File

(dbreunig.com)

94 points | by dbreunig 5 days ago ago

9 comments

  • e12e 4 days ago ago

    Nice!

    > However, for reasons unknown to me, they wrap these neatly separated rows with brackets ([ and ]) and add a comma to each line

    Well, the reason (misguided or not) is as you say, I imagine:

    > so it’s a valid, JSON array containing 100+ million items.

    > We are not going to attempt to load a this massive array. Instead, we’re running this command:

        zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz'
    
    That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits:

        gunzip -c ../latest-all.json.gz \
         | jq -cn --stream \
           'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
       | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz
    
    Note on MacOS zcat might not be gunzip, hence the change.
    • fiddlerwoaroof 4 days ago ago

      It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.

      [1]: https://github.com/jimhigson/oboe.js

      EDIT: looking around a bit, I found json-stream ( https://github.com/dgraham/json-stream ) for Ruby.

      • e12e 4 days ago ago

        I recently looked a little at streaming large JSON files in ruby - but ran into some problems trying to stream from and to gzipped files via layering ruby io-objects. In theory it should just be stacking streams, but in practice it was convoluted, a little brittle and quite slow.

      • Alifatisk 4 days ago ago

        There's even more alternatives at the bottom https://github.com/dgraham/json-stream?tab=readme-ov-file#al...

  • Alifatisk 4 days ago ago

    Interesting article, I think this is the first time I've seen someone pick Ractors over Parallel gem, cool!

    I love seeing these quick and dirty Ruby scripts used for data processing / filtering or whatever, this is what it is good at!

    • dbreunig 4 days ago ago

      Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.

  • ZeroGravitas 2 days ago ago

    There's a cool tool for adding these ID connections to Wikidata:

    https://mix-n-match.toolforge.org/

    It let's you review existing potential matches as well as upload CSVs or generate regex web scrapers to ingest database IDs to be linked against others.

    Here some related to places, since the article is geographical:

    https://mix-n-match.toolforge.org/#/group/ig_authority_contr...

    But it's got all sorts of people places and things, anything that someone might have built a catalog or list of.

  • nighthawk454 4 days ago ago

    Hey cool article, thanks! Might be time to finally dive in to DuckDB

    • gnulinux 3 days ago ago

      Duckdb is amazing, I've been using it in the last few weeks to analyze data I generate with datalog/souffle and I was completely blown away by the performance and QOL features. I seriously don't understand how it can be this fast...