Ask HN: I want to crawl every plain HTML website. Where do I begin?
You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.
I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.