SolidStart - Hacker News

AlbertoGP 2 years ago ago

> [...] self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate for both GPU and CPU environments.

oliwarner 2 years ago ago

> To use Chunkr privately without complying to the AGPL-3.0 license terms you can contact us

AGPL has no bearing on how I use software, only how I can redistribute it. AGPL does not stop a person or company using Chunkr or its product in a commercial environment without further license.

[-]

Onavo 2 years ago ago

Yes but most of the time in tools handling PDFs they are usually through the network. Would be a minefield if legal says the AGPL will affect every single microservice that interacts with your service. This is why there is a blanket AGPL ban at most tech companies. The AGPL is effectively an EULA.

[-]

oliwarner 2 years ago ago

First, the quote talks about what I do privately. AGPL explicitly encourages me to do whatever the hell I like with it. I don't need another license.

But broader, the interpretation that AGPL microservices are viral is just one interpretation. If it really is just a swappable backend interface, why should it affect other subsystems? IANAL but it seems pretty trivial to insulate a microservice with the same sort of GPL condom companies ship to avoid "linking" to (eg) the Kernel.

https://medium.com/swlh/understanding-the-agpl-the-most-misu...

[-]

raffraffraff 2 years ago ago

I laughed at "GPL condom". Well done.

kybernetikos 2 years ago ago

It'd be great to see some examples on the web site.

[-]

creer 2 years ago ago

Probably deserves a paper too really.

chiccomagnus 2 years ago ago

It seems it tries to always extract tables even if the content is just text. Is not working and at a similar price you can get high performing solutions like preprocess.co and similars.

infecto 2 years ago ago

Like all of these startups in this space there never is a comparison of output being made between them and the ($$) competition. I realize they are doing some segmentation in the workflow but imo the valuable part is the actual document text and table extraction piece. Textract in its cheapest and simplest form is cheaper than this service. Turning on tables Textract is more expensive but I would be curious if Textract is doing a better job.

[-]

mistrial9 2 years ago ago

what you want is work in itself.. who pays the reviewer? How do you discover the reviewer? secondly, why must there be one "winner" .. maybe there are niches, local markets, business groups.. they want something and someone provides it.

[-]

infecto 2 years ago ago

Huh? This is a company selling a product/service. I am saying they have done no job to compare themselves to the competition beyond saying they are expensive and I am arguing that the competition is not much more expensive and might offer superior quality.

ollivera 2 years ago ago

Initially, I didn’t like having the tables as images, but using GPT Vision might be a more accurate way to obtain the markdown. I was also considering using the Adobe Extraction API to extract markdown from the CSV file. So, I will try your API over the weekend and see the results.

saaaaaam 2 years ago ago

Although the docs say “get started by creating an account on chunkr.ai” there doesn’t seem to be any way to create an account.

[-]

abusedmedia 2 years ago ago

click login then register

canterburry 2 years ago ago

Task Fails

Chunkr – Vision model based PDF chunking