19 comments

  • abeppu 7 hours ago ago

    Though these tools might be interesting, I wish they had called this something else. This isn't at all related to the concept of hyperparameters which people commonly refer to as hyperparams. And in their copy, the only reference to hyperparameters seems to be misusing the term.

    > This stems from an industry-wide realization that model performance is ultimately bounded by data quality, not just model architecture or hyperparameters.

    Generally we think of model architecture + weights (parameters) as making up the model itself, and hyperparam(s|eters) are the more relevant to how one arrives at those weights -- and for this reason are more relevant to the efficacy of training than the performance of the resultant model.

    • platypii 7 hours ago ago

      That's fair criticism... to be honest when I started the project it was more focused on hyperparameters, and it evolved into this javascript-for-ai mission. But now I just kind of liked the name.

  • barabbababoon an hour ago ago

    Very cool stuff. Is this some kind of lighter weight duckdb-wasm? did I get this right?

  • wbradmoore 8 hours ago ago

    Why not WASM? Seems like something like duckdb-wasm or datafusion-wasm can do the same thing?

    • platypii 8 hours ago ago

      Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load. And they add complexity with serving and deploying wasm files.

      Hyparquet is 10kb of pure js, and so its trivial to deploy on a modern webapp, and wins hands down on time-to-first-data metric.

      • abeppu 7 hours ago ago

        > Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load.

        I don't know how to reconcile this with the emphasis in the page on interacting with datasets relevant to AI which are commonly several orders of magnitude larger than this. What's an AI problem where the data data involved has been less than 10s of mb? I think that only toy problems and datasets could plausibly be smaller (e.g. the training images for the classic MNIST dataset are 47MB, and the whole dataset is 55 https://www.kaggle.com/datasets/hojjatk/mnist-dataset?select... ).

        • platypii 7 hours ago ago

          Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need.

          For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:

          https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...

          This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.

          In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.

          Here's the glaive reasoning dataset on the Hyperparam hugging face space:

          https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

          • xoofoog 6 hours ago ago

            Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?

            • chatmasta 4 hours ago ago

              I’m not OP but as this is a common pattern…

              Parquet stores the metadata in the footer so first request is effectively a negative byte range (content length minus footer length). This metadata includes table statistics like “column ‘date_sold’ has minimum date 1-1-1970 and maximum date 12-31-2024,” and row group statistics like “the row group at byte offset X has minimum ‘date_sold’ value of 1-1-2023 and maximum ‘1-1-2024’.”

              So if your query tool gets a SQL query with a predicate like “WHERE date_sold > ‘3-1-2024’ AND date_sold < ‘3-30-2024’” then it can use “partition pruning” to fetch only the RowGroup of the parquet file that includes the March 2024 data.

              My colleague Artjoms (and co-founder of Splitgraph with me) gave a great presentation [0] on how we achieved this with DataFusion, including visualization of the pruning.

              [0] https://youtube.com/watch?v=D_phetiS-4w

  • yujian 5 hours ago ago

    It's super interesting to be able to see the data in the web

  • klntsky 7 hours ago ago

    That's a lot of names for a bunch of tools that do a single task each.

    What I would really benefit of is a hypothetical LLM chat app that is focused on data migration or processing pipelines.

    • platypii 7 hours ago ago

      Funny you say that, because I built these tools because I wanted to build something very much like what you're describing!

      I was trying to look at, filter, and transform large AI datasets, and I was frustrated with how bad the existing tool was for working with datasets with huge amounts of text (web scrapes, github dumps, reasoning tokens, agent chat logs). Jupyter notebook is woefully bad at helping you to look at your data.

      So I wanted to build better browser tools for working with AI datasets. But to do that I had to build these tools (there was no working parquet implementation in JS when I started).

      Anyway I'm still working on building an app for data processing using LLM chat assistant to help a single user curate entire datasets singlehandedly. But for now I'm releasing these components to the community as open source. And having them "do a single task each" was very much intentional. Thanks for the comment!

  • dmosites 8 hours ago ago

    The iceberg reader sounds cool but how does it handle auth? Most iceberg tables are not publicly accessible.

    • platypii 8 hours ago ago

      It does support using S3 presigned requests, but it's admittedly a little awkward to ask a server for a presigned request before every fetch. But does still have the benefit that you can have a small and light server just handing out signed requests, and then the user and their browser does the heavy lifting. This can save a lot on scaling out server costs.

      That being said, I wish there was a better auth story. Open to suggestions if anyone has ideas!

  • doppenhe 8 hours ago ago

    Very cool, does `npx hyperparam dataset.parquet` phone home?

    • platypii 8 hours ago ago

      Zero telemetry, fully local. It spawns `http-server` on port 2048 and opens your browser at `localhost`. Similar pattern as Jupyter Notebooks. Feel free to audit the code... the server is <200 LOC.

  • cyrdax 7 hours ago ago

    Anyone benchmark this vs. duckdb-wasm?

    • platypii 7 hours ago ago

      I don’t have benchmarks specifically against duckdb. I’m sure native C++ will run faster than JavaScript.

      But whats important is that with Hyperparam you can do it in the browser, where the bottleneck will always be network-bound not cpu-bound.

  • lorr1 7 hours ago ago

    You’re right. Pythons the worst