A little comparison between R and Kap

(blog.dhsdevelopments.com)

28 points | by tosh 3 days ago ago

6 comments

  • RodgerTheGreat 21 hours ago ago

    In Lil, the readcsv[] function takes an optional string specifying a type for each column to decode:

        purchases:readcsv[read["purchases.csv"] "sii"]
    
    Summing a column:

        sum purchases.amount
    
    To create a summary, we need to reduce each group to a single row:

        select first country sum amount by country from purchases
    
    Discounting:

        select first country sum amount-discount by country from purchases
    
    Lil doesn't have a "median" primitive. Decks can contain multiple modules, but we happen to know this one is alone. Your path will vary:

        stats:first import["stats.deck"]
        select first country sum amount-discount by country where amount<stats.median[amount]*10 from purchases
    
    Calculating the median within each group is merely a matter of reordering clauses:

        select first country sum amount-discount where amount<stats.median[amount]*10 by country from purchases
    • lokedhs 20 hours ago ago

      The string to specify the column types is not a terrible idea. Does it have other configuration options, like whether or not to assume the first row is the headers, or specifying the separator character?

      • RodgerTheGreat 19 hours ago ago

        Lil's readcsv[] takes three arguments: a data string, an optional typecode-string (which can also skip columns with "_"), and an optional delimiter character. First row is always assumed to be headers; I find it easy enough to concatenate on a header row before calling the function if I'm ever dealing with a headerless CSV file.

        The typecode-string approach in Lil is very similar to how Q handles it with dyadic 0:.

        In this specific example I could do without the typecode-string since arithmetic operators like sum, -, and * will coerce string columns into numbers, but I think this way is cleaner.

        • lokedhs 18 hours ago ago

          I see. Kap tries to be as generic as possible, so assuming that the table has headers doesn't feel right. If the table dont have headers, and the reader assumes it does, then you'll potentially silently lose the first row of data.

          • RodgerTheGreat 18 hours ago ago

            You have to make the decision somewhere in your code, unless you're willing to lean on a heuristic; all of the examples in R and Lil make assumptions about the names of columns in the file on-disk just as they make assumptions about the delimiter and the presence of headers.

            If I knew the CSV file didn't have built-in headers, I'd write the Lil script like this:

                purchases:readcsv["country,amount,discount\n",read["headerless.csv"] "sii"]
            • lokedhs 17 hours ago ago

              Thanks, that makes sense. I guess most CSV data you see in the real world do have headers. Perhaps I was looking too much about thr default CSV export format from Excel, focusing on making sure it can always be parsed. And Excel doesn't have column headers.