More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.
deepsy
I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.
https://commoncrawl.org