ratio11
You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.

https://commoncrawl.org

heresjohnny
More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.
deepsy
I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.
sr.ht