Not sure where you are based, but if you were in the EU and had no commercial intentions, you might want to consider adding the crawls from OpenWebSearch.eu, an EU-funded research project to provide an open crawl of a substantial part of the Web (they also collaborate with Common Crawl), its plain text and an index:
https://openwebsearch.eu/
It would be fantastic if someone could provide a not-for-profit decent quality Web search engine.
"There was one surprise when I revisited costs: OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model. Even conservatively assuming I had 1 billion crawled pages, each with 1K tokens (abnormally long), it would only cost $100 to generate embeddings for all of them. By comparison, running my own inference, even with cheap Runpod spot GPUs, would cost on the order of 100× more expensive, to say nothing of other APIs."
I wonder if OpenAI uses this as a honeypot to get domain-specific source data into its training corpus that it might otherwise not have access to.
Maybe I misunderstand, but I'm pretty sure they offer an option for cheaper API costs (or maybe its credits?) if you allow them to train on your API requests.
To your point, pretty sure it's off by default, though
"Turn on sharing with OpenAI for inputs and outputs from your organization to help us develop and improve our services, including for improving and training our models. Only traffic sent after turning this setting on will be shared. You can change your settings at any time to disable sharing inputs and outputs."
And I am 'enrolled for complimentary daily tokens.'
> Your data is your data. As of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models (unless you explicitly opt in to share data with us).
i'd not rule out some approach like instead of training directly on the data, may be they would train on a very high dimensional embedding of such a data (or some other similarly "anonymized", yet still very semantically rich representation of the data)
I'm not sure what were the exact limits, but I definitely recall running into server errors with S3 and the OCI equivalent service — not technically 429s but enough to essentially limit throughput. SQS had 429s, I believe due to number of requests and not messages, but they only support batching at most 10.
I definitely wanted these to "just work" out of the box (and maybe I could've worked more with AWS/OCI given more time), as I wanted to focus on the actual search.
Just wow. My greatest respect! Also an incredible write up. I like the take-away that an essential ingredient to a search engine is curated and well filtered data (garbage in garbage out) I feel like this has been a big learning of the LLM training too, rather work with less much higher quality data. I'm curious how a search engine would perform where all content has been judged by an LLM.
I'm currently trying to get a friends small business website to rank. I have a decent understanding of SEO, doing more technically correct things and did a decent amount of hand written content specific to local areas and services provided.
Two months in, bing still hasn't crawled the fav icon. Google finally did after a month.
I'm still getting outranked by tangentially related services, garbage national lead collection sites, yelp top 10 blog spam, and even exact service providers from 300 miles away that definitely don't serve the area.
Something is definitely wrong with pagerank and crawling in general.
At the end, the author thinks about adding Common Crawl data. Our ranking information, generated from our web graph, would probably be a big help in picking which pages to crawl.
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
It's been clear to anyone familiar with encoder only LLMs that Google is effectively dead. The only reason why it still lives is that it takes a while to crawl the whole web and keep the index up to date.
If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.
The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web.
Yes, I've used your data myself on a number of occasions.
But you are pretty much the only people who can save the web from AI bots right now.
The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.
You can see their panic - in my country they are running TV ads for Google search, showing it answering LLM-prompt-like queries. They are desperately trying to win back that mind share, and if they lose traditional keyword search too they’re cooked
It's not dead but will take a huge hit. I still use DuckDuckGo since I get good answers, good discovery, taken right to the sources (whom I can cite), and the search indexes are legal vs all the copyright infringement in AI training.
If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.
Just out of interest, I sent a query I've had difficulties getting good results for with major engines: "what are some good options for high-resolution ultrawide monitors?".
The response in this engine for this query at this point seems to have the same fallacy as I've seen in other engines. Meta-pages "specialising" in broad rankings are preferred above specialist data about the specific sought-after item. It seems that the desire for a ranking weighs the most.
If I were to manually try to answer this query, I would start by looking at hardware forums and geeky blogs, pick N candidates, then try to find the specifications and quirks for all products.
Of course, it is difficult to generically answer if a given website has performed this analysis. It can be favourable to rank sites citing specific data higher in these circumstances.
As a user, I would prefer to be presented with the initial sources used for assembling this analysis. Of course, this doesn't happen because engines don't perform this kind of bottom-to-top evaluation.
You could argue that it is not really a search query. There is not a particular page that answers the question “correctly”, it requires collating multiple sources and reasoning. That is not a search problem.
This argument almost feels disingenuous to me. Of course, there isn't going to be one resource that will completely answer the question. However, there are going to be resources that are much likelier to contain correct parts to the answer and there are resources that are much likely to contain just SEO fluff.
The whole premise of what makes a good search engine has been based on the idea of surfacing those results that most likely contain good information. If that was not the case Google would not have risen to such dominance in the first place.
Mad respect. This is an incredible project to pull together all these technologies. The crown jewel of a search engine is its ranking algorithm. I'm not sure how LLM is being used in this regard in here.
One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
May be feeding the data into a neural net would help ranking. It becomes a classification problem - given these terms, which links have higher probabilities being clicked. More people clicking on a link for a term would strengthening the weights.
> One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
That's not very effective. Ever heard of clickbait?
Like I've said uncountable times before, the only effective technique to clean out the search results of garbage is to use a point system that penalises each 3rd party advertisement placed on the page.
The more adverts, the lower the rank.
And the reason that will work is because you are directly addressing the incentive for producing garbage - money!
The result should be "when two sites have the same basic content, in the search results promote the one without ads over the ones with ads".
Until this is done, search engines will continue serving garbage, because they are rewarding those actors who are producing garbage.
This is often touted as the solution to remove SEO garbage, except that you'd also get rid of the news websites along with it which are fairly reliant on advertising.
This is really really cool. I had earlier wanted to entirely run my searches on it and though that seems possible, I feel like it would be sadly a little bit more waste of time in terms of searches but still I'll maybe try to run some of my searches against this too and give me thoughts on this after doing something like this if I could, like, it is a big hit or miss but it will almost land you to the right spot, like not exactly.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
The author claims "It should be far less susceptible to keyword spam and SEO tactics." however anyone with a cursory knowledge of the limitations of embeddings/LLM's knows the hardest part is that there is no seperation between the prompt and the content to be queried (e.g "ignore all previous instructions" etc...). It would not be hard to adversarially generate embeddings for SEO, in-fact it's almost easier since you know the maths underlying the algorithm to fit to.
The author is using SBERT embeddings, not an instruction-following model, so the "ignore all previous instructions" trick isn't going to work, unless you want to outrank https://en.wikipedia.org/wiki/Ignore_all_rules when people search for what to do after ignoring all previous instructions.
Of course a spammer could try to include one sentence with a very close embedding for each query they want to rank for, but this would require combinatorially more effort than keyword stuffing where including two keywords also covers queries including both together.
As I've become more advanced in my career I've grown more frustrated with search engines for the same problems you described in your write up. This is a fantastic solution and such a refreshing way to use LLMs. I hope this project goes far!
Thank you for sharing! This is one of the coolest articles I have seen in a while on HN. I did some searches and I think the search results looked very useful so far. I particularly loved about your article that most of the questions I had while reading got answered in a most structured way.
I still have questions:
* How long do you plan to keep the live demo up?
* Are you planning to make the source code public?
* How many hours in total did you invest into this "hobby project" in the two months you mentioned in your write-up?
A vector-only search engine will fail for a lot of common use cases where the keywords do matter. I tried searching for `garbanzo bean stew` and got totally irrelevant bean recipes.
> RocksDB and HNSW were sharded across 200 cores, 4 TB of RAM, and 82 TB of SSDs.
Was your experience with ivfpq not good? I’ve seen big recall drops compared to hnsw, but wow, takes some hardware to scale.
Also did you try sparse embeddings like SPLADE? I have no idea how they scale at this size, but seems like a good balance between keyword and semantic searches.
One of the most insightful posts I’ve read recently. I especially enjoy the rationale behind the options you chose to reduce costs and going into detail on where you find the most savings.
I know the post primarily focuses on neural search, but I’m wondering you tried integrating hybrid BM-25 + embeddings search and if this led to any improvements. Also, what reranking models did you find most useful and cost efficient?
This was super cool to read. I'm developing something somewhat similar, but for business search, and ran into a lot of the same challenges. Everyone thinks crawling, processing, and indexing data is easy, but doing it cost effectively at scale is a completely different beast.
Kudos wilsonzlin. I'd love to chat sometime if you see this. It's a small space of people that can build stuff like this e2e.
I been doing a smaller version of the same idea for just domain of job listings. Initially I looked at HNSW but couldn't reason on how to scale it with predictable compute time cost. I ended up using IVF because I am a bit memory starved. I will have to take at look at coreNN.
This then begs the question for me, without an LLM what is the approach to build a search engine? Google search used to be razor sharp, then it degraded in the late 2000s and early 2010s and now its meh. They filter out so much content for a billion different reasons and the results are just not what they used to be. I've found better results from some LLMs like Grok (surprisingly) but I can't seem to understand why what was once a razor exact search engine like Google, it cannot find verbatim or near verbatim quotes of content I remember seeing on the internet.
My understanding was that every few months Google was forced to adjust their algorithms because the search results would get flooded by people using black hat SEO techniques. At least that's the excuse I heard for why it got so much worse over time.
Not sure if that's related to it ignoring quotes and operators though. I'd imagine that to be a cost saving measure (and very rarely used, considering it keeps accusing me of being a robot when I do...)
From what I understand, that good old Google from the 2000s was built entirely without any kind of machine learning. Just a keyword index and PageRank. Everything they added since then seems to have made it worse (though it did also degrade "organically" from the SEO spam).
Google certainly had to update their algorithms to cope with SEO, but that's not why their results have become so poor in the last five years or so. They made a conscious decision to prioritize profit over search quality. This came out in internal emails that were published as part of discovery for one of the antitrust suits.
To reiterate: Google search results are shit because shit ad-laden results make them more money in the short term.
That's it. And it's sad that so many people continue to give them the benefit of the doubt when there is no doubt.
The majority of the public internet shifted to "SEO optimized" garbage while the real user-generated content shifted to walled gardens like Instagram, Facebook, and Reddit (somewhat open). More recently, even use generated content is poisoned by wannabe influencers shilling some snake oil or scam.
This is my take as well. When websites were few, directories were awesome. When websites multiplied, Google was awesome. When websites became SEO trash, social networks were awesome. When social networks are become trash, I'm hoping the Fediverse becomes the next awesome.
I don't see AI in any form becoming the next awesome.
I wish all the best wishes to fediverse too.
I'd like to take this one step too that communities have gone a similar transition too from forums to mostly now discord and I wish them to move to something like matrix which is federated (yes I know it has issues, but trust me sacrifices must be made)
What are your thoughts on things like bluesky/nostr and (matrix) too.
Bluesky does seem centralized in its current stage but its idea of (pds?) makes it fundamentally hack proof in the sense that if you are on a server which gets hacked, then your account is still safe or atleast that's the plan, not sure about its current implementation.
I also agree with AI not being the next awesome. Maybe for coding sure, but not in general yeah. But even in coding man, I feel like its good enough and its hard to catch more progress from now on and its just not worth it but honestly that's just me.
I think BlueSky still needs to prove itself. It is what Twitter/X was a decade ago, before the enshittification, and I enjoy the content a lot, with my reservations.
The weakness of Mastodon (and the Fediverse IMO), is that you can join one of many instances, and it becomes easier to form an echo chamber. Your feed will the the Fediverse hose (lots of irrelevant content), your local instance (an echo chamber), or your subscriptions (curating them takes effort). Nevertheless, that might be as well a strength I'm not truly appreciating.
There was a Neal Stephenson novel where curated feeds had become a big business because it was the only tolerable way to browse the Internet. Lately I've been thinking that's more likely to happen.
I mean both bluesky and fediverse are just decentralized technologies, so lets say that you are worried about bluesky "enshittening"
I doubt it to happen because of its decentralized-enough nature.
I also agree with the subscriptions curation part the last time I checked, but I didn't use mastodon as often as I used lemmy and it was a less of an issue on lemmy.
Still, I feel like bluesky as an technology is goated and doesn't feel like it can be enshittened.
Nostr on the other hand does seem to me as an echo chamber of crypto bros but honestly, that's the most decentralization as you can ever get. Shame that we are going to get mostly nothing meaningful out of it imo. Which in that case bluesky seems to me as good enough but things like search etc. / the current bluesky is definitely centralized but honestly the same problems kept coming up on fediverse too, lemmy.world was getting too bloated with too many members and even mastodon had only one really famous home server afaik iirc mastodon.social right?
Also I may be wrong, I usually am but iirc mastodon only allows you to comment/ interact with posts on your own server like, I wanted to comment on mastodon.social from some other server but I don't remember being able to do so, maybe skill issue from my side.
This is correct. Marketing and Advertising manipulated pages to gain higher rankings because they figured out the algorithm behind it. Forcing Google to change the algorithm. Originally, prior to the flood of <meta> garbage and hidden <div>’s it was very good at linking content together. Now, it’s a weighted database.
This has always been the explanation, but I've always wondered if it wasn't so much battling SEO as balancing the appearance of battling SEO while not killing some factor related to their revenue.
When I encounter the "cannot find verbatim quote I remember" problem and then later find what I was looking for in some other way, I usually discover that I misremembered and the actual quote was different. I do prefer getting zero results in that case, though.
The internet itself has changed over time, and a lot of content has just disappeared. It shouldn't appear in search because it's just not there anymore, it'd be a 404.
A search engine that kept dead entries but maybe put them in an “missing” tab or something would’ve been monstrously useful for me in so many situations. There’s been numerous times I’ve remembered looking at something N years ago only for all but the faintest traces of it to have disappeared from the internet. With a “missing” tab I’d at least have former URLs, page titles, etc to work with (archive.org, etc).
I wish there was an old fashioned n-gram + page rank search engine for those of us who don't mind the issues the older Google had. I've thought about making my own a few times.
I also love how often the author finds that traditionally glazed databases don't scale as you might think, and turn to stronger storage primitves that were designed exactly to do that
i knew i've seen this before, they clearly migrated their blog as all posts are uploaded on the 10th, and i already have multiple mentioned repos starred lol
Adding my kudos to the other commenters here - the polymath skills necessary to take on something like this is remarkable as a solo effort. I was hoping for more detail on the issues found during the request/parsing at a domain/page level.
I love this and think that your write-up is fantastic, thank you for sharing your work in such detail.
What are you thinking in terms of improving [and using] the knowledge graph beyond the knowledge panel on the side? If I'm reading this correctly, it seems like you only have knowledge panel results for those top results that exist in Wikipedia, is that correct?
Wow, looks like a tremendous commitment and depth of knowledge went into this one-man project. I couldn't even read the whole write up, I had to skim part of it. I'm super impressed.
This is awesome, and the low cost is especially impressive. I rarely have the motivation after working on a side project to actually document all the decisions made along the way, much less in such a thorough way. Regarding your CoreNN library, Clearview has a blog post [1] on how they index 30 billion face embeddings that you may find interesting. They combine RocksDB with faiss.
Not sure where you are based, but if you were in the EU and had no commercial intentions, you might want to consider adding the crawls from OpenWebSearch.eu, an EU-funded research project to provide an open crawl of a substantial part of the Web (they also collaborate with Common Crawl), its plain text and an index:
It would be fantastic if someone could provide a not-for-profit decent quality Web search engine."There was one surprise when I revisited costs: OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model. Even conservatively assuming I had 1 billion crawled pages, each with 1K tokens (abnormally long), it would only cost $100 to generate embeddings for all of them. By comparison, running my own inference, even with cheap Runpod spot GPUs, would cost on the order of 100× more expensive, to say nothing of other APIs."
I wonder if OpenAI uses this as a honeypot to get domain-specific source data into its training corpus that it might otherwise not have access to.
> OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model.
Is this the drug dealer scheme? Get you hooked later jack up prices? After all, the alternative would be regenerating all your embeddings no?
I don’t think OpenAI train on data processed via the API, unless there’s an exception specifically for this.
Maybe I misunderstand, but I'm pretty sure they offer an option for cheaper API costs (or maybe its credits?) if you allow them to train on your API requests.
To your point, pretty sure it's off by default, though
Edit: From https://platform.openai.com/settings/organization/data-contr...
Share inputs and outputs with OpenAI
"Turn on sharing with OpenAI for inputs and outputs from your organization to help us develop and improve our services, including for improving and training our models. Only traffic sent after turning this setting on will be shared. You can change your settings at any time to disable sharing inputs and outputs."
And I am 'enrolled for complimentary daily tokens.'
Can you truly trust them though?
Yes, it would be disastrous for OpenAI if it got out they are training on B2B data despite saying they don’t.
We're both talking about the company whose entire business model is built on top of large scale copyright infringement, right?
Have they said they don't? (actually curious)
Yes, they have. [1]
> Your data is your data. As of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models (unless you explicitly opt in to share data with us).
[1]: https://platform.openai.com/docs/guides/your-data
Yeah, so many companies have been completely ruined after similar PR disasters /s
i'd not rule out some approach like instead of training directly on the data, may be they would train on a very high dimensional embedding of such a data (or some other similarly "anonymized", yet still very semantically rich representation of the data)
i am too lazy to ask openai.
It'd be a way to put crap or poisoned data into their training data if that is the case. I wouldn't.
The title should be “10x engineer creates Google in their spare time”
But seriously what an amazing write up, plus animations, analysis etc etc. Bravo.
It was also ironic to see AWS failing quite a few use cases here. Stuff to think about.
Also looking into the AWS limits;
> SQS had very low concurrent rate limits that could not keep up with the throughput of thousands of workers across the pipeline.
I could not find this perhaps the author meant Lambda limits?
> services like S3 have quite low rate limits — there are hard limits, but also dynamic per-account/bucket quotas
You have virtually unlimited throughput with prefix partitions
I'm not sure what were the exact limits, but I definitely recall running into server errors with S3 and the OCI equivalent service — not technically 429s but enough to essentially limit throughput. SQS had 429s, I believe due to number of requests and not messages, but they only support batching at most 10.
I definitely wanted these to "just work" out of the box (and maybe I could've worked more with AWS/OCI given more time), as I wanted to focus on the actual search.
Those are reasonable expectations. I’m very impressed with how it all worked out.
Just wow. My greatest respect! Also an incredible write up. I like the take-away that an essential ingredient to a search engine is curated and well filtered data (garbage in garbage out) I feel like this has been a big learning of the LLM training too, rather work with less much higher quality data. I'm curious how a search engine would perform where all content has been judged by an LLM.
I'm currently trying to get a friends small business website to rank. I have a decent understanding of SEO, doing more technically correct things and did a decent amount of hand written content specific to local areas and services provided.
Two months in, bing still hasn't crawled the fav icon. Google finally did after a month. I'm still getting outranked by tangentially related services, garbage national lead collection sites, yelp top 10 blog spam, and even exact service providers from 300 miles away that definitely don't serve the area.
Something is definitely wrong with pagerank and crawling in general.
Sadly, that ship has sailed. The web is dead. SEO should be called SEM (Search Engine Manipulation).
>something is wrong with pagerank
Do you have any backlinks? If not, it’s working as intended?
i recall from years ago a site index url could be submitted to google. creating that index took some work.
At the end, the author thinks about adding Common Crawl data. Our ranking information, generated from our web graph, would probably be a big help in picking which pages to crawl.
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
It's incredible. I can't believe it but it actually works quite nicely.
If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?
It's been clear to anyone familiar with encoder only LLMs that Google is effectively dead. The only reason why it still lives is that it takes a while to crawl the whole web and keep the index up to date.
If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.
The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web.
Yes, I've used your data myself on a number of occasions.
But you are pretty much the only people who can save the web from AI bots right now.
The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.
I'm all for that.
You can see their panic - in my country they are running TV ads for Google search, showing it answering LLM-prompt-like queries. They are desperately trying to win back that mind share, and if they lose traditional keyword search too they’re cooked
It's not dead but will take a huge hit. I still use DuckDuckGo since I get good answers, good discovery, taken right to the sources (whom I can cite), and the search indexes are legal vs all the copyright infringement in AI training.
If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.
Kagi seems to partially be that. Yes really corpo but way Better wibes than Google. Searxng is a bit diffrent but also a thing.
I think even more spectacularly, we may be witnessing the feature to feature obsolescence of big tech.
Models make it cheap to replicate and perform what tech companies do. Their insurmountable moats are lowering as we speak.
yep, seems the big guys running out of ideas, to some degree.
Very cool project!
Just out of interest, I sent a query I've had difficulties getting good results for with major engines: "what are some good options for high-resolution ultrawide monitors?".
The response in this engine for this query at this point seems to have the same fallacy as I've seen in other engines. Meta-pages "specialising" in broad rankings are preferred above specialist data about the specific sought-after item. It seems that the desire for a ranking weighs the most.
If I were to manually try to answer this query, I would start by looking at hardware forums and geeky blogs, pick N candidates, then try to find the specifications and quirks for all products.
Of course, it is difficult to generically answer if a given website has performed this analysis. It can be favourable to rank sites citing specific data higher in these circumstances.
As a user, I would prefer to be presented with the initial sources used for assembling this analysis. Of course, this doesn't happen because engines don't perform this kind of bottom-to-top evaluation.
You could argue that it is not really a search query. There is not a particular page that answers the question “correctly”, it requires collating multiple sources and reasoning. That is not a search problem.
This argument almost feels disingenuous to me. Of course, there isn't going to be one resource that will completely answer the question. However, there are going to be resources that are much likelier to contain correct parts to the answer and there are resources that are much likely to contain just SEO fluff.
The whole premise of what makes a good search engine has been based on the idea of surfacing those results that most likely contain good information. If that was not the case Google would not have risen to such dominance in the first place.
This wasn't even in the realm of what I thought is possible for a single person to do. Incredible work!
It doesn't seem that far in diatance from a commercial search engine? Maybe even Google?
50k to run is a comically small number. I'm tempted to just give you that money to seed.
Mad respect. This is an incredible project to pull together all these technologies. The crown jewel of a search engine is its ranking algorithm. I'm not sure how LLM is being used in this regard in here.
One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
May be feeding the data into a neural net would help ranking. It becomes a classification problem - given these terms, which links have higher probabilities being clicked. More people clicking on a link for a term would strengthening the weights.
> One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
That's not very effective. Ever heard of clickbait?
Like I've said uncountable times before, the only effective technique to clean out the search results of garbage is to use a point system that penalises each 3rd party advertisement placed on the page.
The more adverts, the lower the rank.
And the reason that will work is because you are directly addressing the incentive for producing garbage - money!
The result should be "when two sites have the same basic content, in the search results promote the one without ads over the ones with ads".
Until this is done, search engines will continue serving garbage, because they are rewarding those actors who are producing garbage.
This is often touted as the solution to remove SEO garbage, except that you'd also get rid of the news websites along with it which are fairly reliant on advertising.
This is really really cool. I had earlier wanted to entirely run my searches on it and though that seems possible, I feel like it would be sadly a little bit more waste of time in terms of searches but still I'll maybe try to run some of my searches against this too and give me thoughts on this after doing something like this if I could, like, it is a big hit or miss but it will almost land you to the right spot, like not exactly.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
The author claims "It should be far less susceptible to keyword spam and SEO tactics." however anyone with a cursory knowledge of the limitations of embeddings/LLM's knows the hardest part is that there is no seperation between the prompt and the content to be queried (e.g "ignore all previous instructions" etc...). It would not be hard to adversarially generate embeddings for SEO, in-fact it's almost easier since you know the maths underlying the algorithm to fit to.
The author is using SBERT embeddings, not an instruction-following model, so the "ignore all previous instructions" trick isn't going to work, unless you want to outrank https://en.wikipedia.org/wiki/Ignore_all_rules when people search for what to do after ignoring all previous instructions.
Of course a spammer could try to include one sentence with a very close embedding for each query they want to rank for, but this would require combinatorially more effort than keyword stuffing where including two keywords also covers queries including both together.
Classic HN dismissive comment.
The talent displayed here is immense. I challenge you to do better.
As I've become more advanced in my career I've grown more frustrated with search engines for the same problems you described in your write up. This is a fantastic solution and such a refreshing way to use LLMs. I hope this project goes far!
Thank you for sharing! This is one of the coolest articles I have seen in a while on HN. I did some searches and I think the search results looked very useful so far. I particularly loved about your article that most of the questions I had while reading got answered in a most structured way.
I still have questions:
* How long do you plan to keep the live demo up?
* Are you planning to make the source code public?
* How many hours in total did you invest into this "hobby project" in the two months you mentioned in your write-up?
A vector-only search engine will fail for a lot of common use cases where the keywords do matter. I tried searching for `garbanzo bean stew` and got totally irrelevant bean recipes.
Yes, indeed. I just tried search "Apple", and apple.com is not on the first page.
Agree. For best results both lexical and vector search results should be fed into a reranker. Slow and expensive but high quality.
What if you build a graph engine then encode those edges into its own embedding space?
Nerdsnipe?
Tried this search: What is an sbert embedding?
Google still gave me a better result: https://towardsdatascience.com/sbert-deb3d4aef8a4/
Nevertheless this project looks great and I'd love to see it continue to improve.
> RocksDB and HNSW were sharded across 200 cores, 4 TB of RAM, and 82 TB of SSDs.
Was your experience with ivfpq not good? I’ve seen big recall drops compared to hnsw, but wow, takes some hardware to scale.
Also did you try sparse embeddings like SPLADE? I have no idea how they scale at this size, but seems like a good balance between keyword and semantic searches.
One of the most insightful posts I’ve read recently. I especially enjoy the rationale behind the options you chose to reduce costs and going into detail on where you find the most savings.
I know the post primarily focuses on neural search, but I’m wondering you tried integrating hybrid BM-25 + embeddings search and if this led to any improvements. Also, what reranking models did you find most useful and cost efficient?
This is incredibly, incredibly cool. Creating a search engine that beats Google in quality in just 2 months and less than a thousand dollars.
Really great idea about the federated search index too! YaCy has it but it's really heavy and never really gave good results for me.
This was super cool to read. I'm developing something somewhat similar, but for business search, and ran into a lot of the same challenges. Everyone thinks crawling, processing, and indexing data is easy, but doing it cost effectively at scale is a completely different beast.
Kudos wilsonzlin. I'd love to chat sometime if you see this. It's a small space of people that can build stuff like this e2e.
This is so cool. A question on the service mesh - is building your own typically the best way to do things?
I'm new to networking..
I been doing a smaller version of the same idea for just domain of job listings. Initially I looked at HNSW but couldn't reason on how to scale it with predictable compute time cost. I ended up using IVF because I am a bit memory starved. I will have to take at look at coreNN.
This then begs the question for me, without an LLM what is the approach to build a search engine? Google search used to be razor sharp, then it degraded in the late 2000s and early 2010s and now its meh. They filter out so much content for a billion different reasons and the results are just not what they used to be. I've found better results from some LLMs like Grok (surprisingly) but I can't seem to understand why what was once a razor exact search engine like Google, it cannot find verbatim or near verbatim quotes of content I remember seeing on the internet.
My understanding was that every few months Google was forced to adjust their algorithms because the search results would get flooded by people using black hat SEO techniques. At least that's the excuse I heard for why it got so much worse over time.
Not sure if that's related to it ignoring quotes and operators though. I'd imagine that to be a cost saving measure (and very rarely used, considering it keeps accusing me of being a robot when I do...)
From what I understand, that good old Google from the 2000s was built entirely without any kind of machine learning. Just a keyword index and PageRank. Everything they added since then seems to have made it worse (though it did also degrade "organically" from the SEO spam).
Google certainly had to update their algorithms to cope with SEO, but that's not why their results have become so poor in the last five years or so. They made a conscious decision to prioritize profit over search quality. This came out in internal emails that were published as part of discovery for one of the antitrust suits.
To reiterate: Google search results are shit because shit ad-laden results make them more money in the short term.
That's it. And it's sad that so many people continue to give them the benefit of the doubt when there is no doubt.
The majority of the public internet shifted to "SEO optimized" garbage while the real user-generated content shifted to walled gardens like Instagram, Facebook, and Reddit (somewhat open). More recently, even use generated content is poisoned by wannabe influencers shilling some snake oil or scam.
This is my take as well. When websites were few, directories were awesome. When websites multiplied, Google was awesome. When websites became SEO trash, social networks were awesome. When social networks are become trash, I'm hoping the Fediverse becomes the next awesome.
I don't see AI in any form becoming the next awesome.
I wish all the best wishes to fediverse too. I'd like to take this one step too that communities have gone a similar transition too from forums to mostly now discord and I wish them to move to something like matrix which is federated (yes I know it has issues, but trust me sacrifices must be made)
What are your thoughts on things like bluesky/nostr and (matrix) too.
Bluesky does seem centralized in its current stage but its idea of (pds?) makes it fundamentally hack proof in the sense that if you are on a server which gets hacked, then your account is still safe or atleast that's the plan, not sure about its current implementation.
I also agree with AI not being the next awesome. Maybe for coding sure, but not in general yeah. But even in coding man, I feel like its good enough and its hard to catch more progress from now on and its just not worth it but honestly that's just me.
I think BlueSky still needs to prove itself. It is what Twitter/X was a decade ago, before the enshittification, and I enjoy the content a lot, with my reservations.
The weakness of Mastodon (and the Fediverse IMO), is that you can join one of many instances, and it becomes easier to form an echo chamber. Your feed will the the Fediverse hose (lots of irrelevant content), your local instance (an echo chamber), or your subscriptions (curating them takes effort). Nevertheless, that might be as well a strength I'm not truly appreciating.
There was a Neal Stephenson novel where curated feeds had become a big business because it was the only tolerable way to browse the Internet. Lately I've been thinking that's more likely to happen.
I mean both bluesky and fediverse are just decentralized technologies, so lets say that you are worried about bluesky "enshittening"
I doubt it to happen because of its decentralized-enough nature.
I also agree with the subscriptions curation part the last time I checked, but I didn't use mastodon as often as I used lemmy and it was a less of an issue on lemmy.
Still, I feel like bluesky as an technology is goated and doesn't feel like it can be enshittened.
Nostr on the other hand does seem to me as an echo chamber of crypto bros but honestly, that's the most decentralization as you can ever get. Shame that we are going to get mostly nothing meaningful out of it imo. Which in that case bluesky seems to me as good enough but things like search etc. / the current bluesky is definitely centralized but honestly the same problems kept coming up on fediverse too, lemmy.world was getting too bloated with too many members and even mastodon had only one really famous home server afaik iirc mastodon.social right?
Also I may be wrong, I usually am but iirc mastodon only allows you to comment/ interact with posts on your own server like, I wanted to comment on mastodon.social from some other server but I don't remember being able to do so, maybe skill issue from my side.
This is correct. Marketing and Advertising manipulated pages to gain higher rankings because they figured out the algorithm behind it. Forcing Google to change the algorithm. Originally, prior to the flood of <meta> garbage and hidden <div>’s it was very good at linking content together. Now, it’s a weighted database.
This has always been the explanation, but I've always wondered if it wasn't so much battling SEO as balancing the appearance of battling SEO while not killing some factor related to their revenue.
That begs the question, if you can recreate their engine from the 2000s with high quality search results, would investors even fund you? Lol
> if you can recreate their engine from the 2000s
Seriously, how? Iam pretty sure you have to have a very different approach than google had in its best times. The web is a very different place now
Sorry to be pedantic but you mean "raises the question." https://en.wikipedia.org/wiki/Begging_the_question
When I encounter the "cannot find verbatim quote I remember" problem and then later find what I was looking for in some other way, I usually discover that I misremembered and the actual quote was different. I do prefer getting zero results in that case, though.
The internet itself has changed over time, and a lot of content has just disappeared. It shouldn't appear in search because it's just not there anymore, it'd be a 404.
A search engine that kept dead entries but maybe put them in an “missing” tab or something would’ve been monstrously useful for me in so many situations. There’s been numerous times I’ve remembered looking at something N years ago only for all but the faintest traces of it to have disappeared from the internet. With a “missing” tab I’d at least have former URLs, page titles, etc to work with (archive.org, etc).
I see you’re also having trouble coping with this. Fact is, “that” internet is simply gone.
Nah, its a series of tubes, just gotta get the right tubes together.
I wish there was an old fashioned n-gram + page rank search engine for those of us who don't mind the issues the older Google had. I've thought about making my own a few times.
This is really well written, especially considering the complexity
I also love how often the author finds that traditionally glazed databases don't scale as you might think, and turn to stronger storage primitves that were designed exactly to do that
i knew i've seen this before, they clearly migrated their blog as all posts are uploaded on the 10th, and i already have multiple mentioned repos starred lol
Getting a CORS error from the API - is the demo at https://search.wilsonl.in/ working for anyone else?
cors error is due to the actual request failing (502 Bad Gateway). hug of death?
Yeah just saw the 502 - probably hug of death.
This must be the best technical article I read on HN in months!
Adding my kudos to the other commenters here - the polymath skills necessary to take on something like this is remarkable as a solo effort. I was hoping for more detail on the issues found during the request/parsing at a domain/page level.
I love this and think that your write-up is fantastic, thank you for sharing your work in such detail.
What are you thinking in terms of improving [and using] the knowledge graph beyond the knowledge panel on the side? If I'm reading this correctly, it seems like you only have knowledge panel results for those top results that exist in Wikipedia, is that correct?
Man! This incredible.. It gives me motivation to continue with my document search engine..
good post, thanks for sharing
Such a big inspiration! One of the few times where I genuinely read and liked the work - didn't even notice how the time flew by.
Feels like it's more and more about consuming data & outputting the desired result.
Wow, looks like a tremendous commitment and depth of knowledge went into this one-man project. I couldn't even read the whole write up, I had to skim part of it. I'm super impressed.
That stack element is amazing
I wish more people showed their whole exploded stack like that and in an elegant way
Really well done writeup!
This is awesome, and the low cost is especially impressive. I rarely have the motivation after working on a side project to actually document all the decisions made along the way, much less in such a thorough way. Regarding your CoreNN library, Clearview has a blog post [1] on how they index 30 billion face embeddings that you may find interesting. They combine RocksDB with faiss.
[1] https://www.clearview.ai/post/how-we-store-and-search-30-bil...
please, I want to pay for this. 10x better than Kagi which I stopped paying for
how much did it cost ?
I couldn't get the search working (there was some cors error) . But what a feat and writeup. Wonder Stuck!
Very nice project. Do you have plans to commercialize it next?
Incredibly cool. What a write-up. What an engineer.