The Problem
Lately I've been thinking about how to go about scraping the contents of a certain big, multi-national website, to get specific details about the products the company offers for sale. The website has no API, but there is some XML you can download for each product by sending a GET request with the product ID to a specific URL. So at least that's something.
The problem is that there are hundreds of millions of potential product ID's that could exist (between, say, 000000001 and 500000000), yet only a few hundred thousand products actually exist. And it's impossible to know which product ID's are valid.
Conveniently, sending a HEAD request to the product URL yields a different response depending on whether or not the product ID is valid (i.e. the product actually exists). And once we know that the product actually exists, we can download the full XML and scrape it for the bits of data needed.
Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server, so I'd like to take the opportunity to learn how to develop some sort of distributed application (totally new territory for me). At this point, I should mention that this particular website can easily handle a massive amount of incoming requests per second without risk of DOS. I'd prefer not to name the website, but it easily gets millions of hits per day. This scraper will have a negligible impact on the performance of the website. However, I'll immediately put a stop to it if the company complains.
The Design
I have no idea if this is the right approach, but my current idea is to launch a single "coordination server", and some number of nodes to communicate with that server and perform the scraping, all running as EC2 instances.
Each node will launch some number of processes, and each process will be designated a job by the coordination server containing a distinct range of potential product ID's to be scraped (e.g. product ID 00001 to 10000). These jobs will be stored in a database table on the coordination server. Each job will contain info about:
- Product ID start number
- Product ID end number
- Job status (idle, in progress, complete, expired)
- Job expiry time
- Time started
- Time completed
When a node is launched, a query will be sent to the coordination server asking for some configuration data, and for a job to work on. When a node completes a job, a query will be sent updating the status of the job just completed, and another query requesting a new job to work on. Each job has an expiry time, so if a process crashes, or if a node fails for any reason, another node can take over an expired job to try it again.
To maximise the performance of the system, I'll need to work out how many nodes should be launched at once, how many processes per node, the rate of HTTP requests sent, and which EC2 instance type will deliver the most value for money (I'm guessing high network performance, high CPU performance, and high disk I/O would be the key factors?).
At the moment, the plan is to code the scraper in Python, running on Ubuntu EC2 instances, possibly launched within Docker containers, and some sort of key-value store database to hold the jobs on the coordination server (MongoDB?). A relational DB should also work, since the jobs table should be fairly low I/O.
I'm curious to know from more experienced engineers if this is the right approach, or if I'm completely overlooking a much better method for accomplishing this task?
Much appreciated, thanks!