i would probably use Playwright with custom code, create chunks based on similar products, then run it on a large cluster in parallel (https://github.com/Burla-Cloud/burla).
if you have a single worker trying to scrape a shit ton of products back to back to back you're going to get rate limited or their bot detection will catch you.
Well the website is kind of useless, but it does suck me in. I love reading crazy reviews. The only thing that would make it better is if they also included Airbnb reviews.
The second review I read was a customer complaining about profanity in a movie and then writing out all the examples. Who has time for that?
I must say the reviews you have are more in the horrifying and less in the pretty funny situation. My favorite funny (and bad) review was a host that accused his guest of flipping over all the furniture in the house and the guest was like "why and how would I do this". I still want to know what happened that day. How did all the furniture end up upside down?
Find all of the ones with taxidermy in the southwest USA.... It's like all of them. Okay I did find one in Austin without, but it still had cowhide pillows.
It does taste exactly like formaldehyde and kerosene swirled around with a bit of gas station kimchi that's been warming down the front of a hobo's pants. How do I know? To give this comment validity, I took the liberty of mixing up that exact concoction, then I went to down the train yard and asked a hobo to warm it up for me right on his swampy, fetid hobo taint.
Loved this until I remembered that these reviews are what AI is trained on and influenced by.
I'm going to publish an Airbnb example tomorrow where I scraped 1,406,718 photo URLs from public listing pages. For that I used https://docs.burla.dev/ which is a high-performance parallel processing python library I've been working on for a few years now.
- i saw your other comment that talks about using an open source dataset but i had to ask
- how would you actually go about loading reviews if you really wanted to
- what kind of system would you need to work around the captcha and stuff
i would probably use Playwright with custom code, create chunks based on similar products, then run it on a large cluster in parallel (https://github.com/Burla-Cloud/burla).
if you have a single worker trying to scrape a shit ton of products back to back to back you're going to get rate limited or their bot detection will catch you.
Well the website is kind of useless, but it does suck me in. I love reading crazy reviews. The only thing that would make it better is if they also included Airbnb reviews.
The second review I read was a customer complaining about profanity in a movie and then writing out all the examples. Who has time for that?
well well well... take a look at what I just built https://burla-cloud.github.io/airbnb-burla/
I must say the reviews you have are more in the horrifying and less in the pretty funny situation. My favorite funny (and bad) review was a host that accused his guest of flipping over all the furniture in the house and the guest was like "why and how would I do this". I still want to know what happened that day. How did all the furniture end up upside down?
I love it! Endless entertainment and 0 attempts to get me to stay at the Airbnb.
yeah now that I have the images I want to do some silly shit with it. maybe find the all Airbnbs with satanic decor or like red rooms haha
Find all of the ones with taxidermy in the southwest USA.... It's like all of them. Okay I did find one in Austin without, but it still had cowhide pillows.
Amazon doesn't even allow you to use slightly strong (non-profanity) wording in reviews these days. Are these old reviews?
from 2023
I love this. The reviews' word play tops MacBeth in my book.
i'm just happy they don't censor the comment section haha, makes for funny content.
i also love that people will complain about the vulgar language in a book or movie by writing a review that contains a quote with the vulgar language
It does taste exactly like formaldehyde and kerosene swirled around with a bit of gas station kimchi that's been warming down the front of a hobo's pants. How do I know? To give this comment validity, I took the liberty of mixing up that exact concoction, then I went to down the train yard and asked a hobo to warm it up for me right on his swampy, fetid hobo taint.
Loved this until I remembered that these reviews are what AI is trained on and influenced by.
But at least he's employing hobos.
how did you scrape all the reviews?
open source dataset from McAuley Lab at UCSD https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2....
I'm going to publish an Airbnb example tomorrow where I scraped 1,406,718 photo URLs from public listing pages. For that I used https://docs.burla.dev/ which is a high-performance parallel processing python library I've been working on for a few years now.
Shit like this is why Amazon reviews are now behind a login wall for everyone.