Show HN: Semantic Splitting with WordLlama

(github.com)

2 points | by deepsquirrelnet 15 hours ago ago

3 comments

  • deepsquirrelnet 15 hours ago ago

    Over the last few weeks, I've been working hard at adding semantic splitting/chunking into WordLlama, and I'm excited to share what I came up with. Because of the fast and lightweight nature of WordLlama, I felt it was a great application for the platform, and aligns with our goal of creating a useful utility for LLM-related interfacing tasks.

    In this blog post, I demonstrate the methodology that I arrived at. At the end, I show semantic splitting on a 1 million character text of "The Lord of the Rings". The new (python) method `wl.split(text)` executes on a single cpu core of my T480 Thinkpad in 700ms.

    You might use a feature like this when chunking for building knowledge bases (RAG) -- or extracting and filtering text to send to an LLM (combining wl.spit(...) and wl.filter(...)) in online applications.

    I hope you enjoy the technical deep dive, and find this feature useful.

    • magicalhippo 13 hours ago ago

      Did indeed enjoy, thanks for sharing.

      Have you by any chance compared this to the rolling sentence method described briefly here[1], with discussion here[2]?

      I'd take it your method is more compute optimized, given you have to do far more embedding calculations using the rolling sentence method. However, from what I can gather, you sacrifice some "boundary quality" for speed by doing the chunking "blind", so could perhaps be interesting to compare the quality and performance of the two approaches.

      [1]: https://gpt3experiments.substack.com/p/a-new-chunking-approa...

      [2]: https://news.ycombinator.com/item?id=41643388

      • deepsquirrelnet 4 hours ago ago

        I’ve taken ideas from blog posts like that one, but mostly I’ve found ideas and haven’t really seen a well described approach. There’s a lot of details to be worked out which I feel are all important in the process. I probably spent more time figuring out how to distribute the text into small segments to even perform similarity comparisons than anything else.

        That was one motivation I had in doing a full write up. It has some aspects of a rolling sentence method (rolling window similarity), and some aspects of regular chunking (trying to retain paragraph structure).

        I wanted to demonstrate an approach that I worked on with all the gory details for people to follow if they want to understand what’s happening under the hood or take ideas for their own experiments.