> The feedback cycle in web search engine development is very long....Overall the approach taken to improving search result quality is looking at a query that does not give good results, asking what needs to change for that to improve, and then making that change. Sometimes it’s a small improvement, sometimes it’s a huge game changer.
> To make the most of phrase matching, stop words need to go.
Perhaps I am misunderstanding; does this mean occurrences of stop words like "the" are stored now instead of ignored? That seems like it would add a lot of bloat. Are there any optimizations in place?
Just a shot-in-the-dark suggestion, but if you are storing some bits with each keyword occurrence, can you add a few more bits to store whether the term is adjacent to a common stop word? So maybe if you have to=0 or=1, "to be or not to be" would be able to match the data `0be 1not 0be`, where only "be" and "not" are actual keywords. But the extra metadata bits can be ignored, so pages containing "The Clash" will match both the literal query (via the "the" bit), and just "clash" (without the "the" bit).
One of the problems with stop words is that they vary between languages. "The" is a good candidate in English, but in Danish it just means "tea", which should be a valid search term. And even in English, what looks like a serious stop word, can be an integral part of the phrase. "How to use The in English".
It's not as bad as you might think, we're speaking dozens of GB across the entire index.
I don't think stopwords as an optimization makes sense when you go beyond BM25. The search engine behaves worse and adding a bunch of optimizations makes an already incrediby complex piece of software more so.
So overall I don't think the juice is worth the squeeze.
Removing stop words is usually a bad advice which is beneficial only in a limited set of circumstances. Google keeps all the “the”: https://www.google.com/search?q=the
1. I'm not sure what you mean. The code is open source[3], but the data is, for logistical reasons, not available. Common Crawl is far more comprehensive though.
2. I've got such plans in the pipe. Not sure when I'll have time to implement it, as I'm in the middle of moving in with my girlfriend this month. Soon-ish.
[3] at https://git.marginalia.nu/ , though still some rough edges to sand down before it's easy to self-host (as easy as hosting a full blown internet search engine gets).
Based solely upon the title and the first commit's date, I'm guessing it's this: https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99
Correct!
Congrats Viktor.
> The feedback cycle in web search engine development is very long....Overall the approach taken to improving search result quality is looking at a query that does not give good results, asking what needs to change for that to improve, and then making that change. Sometimes it’s a small improvement, sometimes it’s a huge game changer.
Yes, this resonates with our experience
Amazing! "bicycle touring in France" as a search target produces a huge number of spot-on returns beautifully formatted.
> To make the most of phrase matching, stop words need to go.
Perhaps I am misunderstanding; does this mean occurrences of stop words like "the" are stored now instead of ignored? That seems like it would add a lot of bloat. Are there any optimizations in place?
Just a shot-in-the-dark suggestion, but if you are storing some bits with each keyword occurrence, can you add a few more bits to store whether the term is adjacent to a common stop word? So maybe if you have to=0 or=1, "to be or not to be" would be able to match the data `0be 1not 0be`, where only "be" and "not" are actual keywords. But the extra metadata bits can be ignored, so pages containing "The Clash" will match both the literal query (via the "the" bit), and just "clash" (without the "the" bit).
One of the problems with stop words is that they vary between languages. "The" is a good candidate in English, but in Danish it just means "tea", which should be a valid search term. And even in English, what looks like a serious stop word, can be an integral part of the phrase. "How to use The in English".
It's not as bad as you might think, we're speaking dozens of GB across the entire index.
I don't think stopwords as an optimization makes sense when you go beyond BM25. The search engine behaves worse and adding a bunch of optimizations makes an already incrediby complex piece of software more so.
So overall I don't think the juice is worth the squeeze.
Removing stop words is usually a bad advice which is beneficial only in a limited set of circumstances. Google keeps all the “the”: https://www.google.com/search?q=the
Always nice to see updates on marginalia.
1. Is the index public ? 2. Any chance for a rss feed search ?
1. I'm not sure what you mean. The code is open source[3], but the data is, for logistical reasons, not available. Common Crawl is far more comprehensive though.
2. I've got such plans in the pipe. Not sure when I'll have time to implement it, as I'm in the middle of moving in with my girlfriend this month. Soon-ish.
[3] at https://git.marginalia.nu/ , though still some rough edges to sand down before it's easy to self-host (as easy as hosting a full blown internet search engine gets).
Thanks . What you answered at 1. is what I meant. I was looking for a small web dataset but cc is too big for me process .
1. Do you know any dataset of rss feeds that are not 100s of gbs ?
2. How does your crawler handle malicious site when crawling ?