Designing a distributed web scraper

Question

The Problem

Lately I've been thinking about how to go about scraping the contents of a certain big, multi-national website, to get specific details about the products the company offers for sale. The website has no API, but there is some XML you can download for each product by sending a GET request with the product ID to a specific URL. So at least that's something.

The problem is that there are hundreds of millions of potential product ID's that could exist (between, say, 000000001 and 500000000), yet only a few hundred thousand products actually exist. And it's impossible to know which product ID's are valid.

Conveniently, sending a HEAD request to the product URL yields a different response depending on whether or not the product ID is valid (i.e. the product actually exists). And once we know that the product actually exists, we can download the full XML and scrape it for the bits of data needed.

Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server, so I'd like to take the opportunity to learn how to develop some sort of distributed application (totally new territory for me). At this point, I should mention that this particular website can easily handle a massive amount of incoming requests per second without risk of DOS. I'd prefer not to name the website, but it easily gets millions of hits per day. This scraper will have a negligible impact on the performance of the website. However, I'll immediately put a stop to it if the company complains.

The Design

I have no idea if this is the right approach, but my current idea is to launch a single "coordination server", and some number of nodes to communicate with that server and perform the scraping, all running as EC2 instances.

Each node will launch some number of processes, and each process will be designated a job by the coordination server containing a distinct range of potential product ID's to be scraped (e.g. product ID 00001 to 10000). These jobs will be stored in a database table on the coordination server. Each job will contain info about:

Product ID start number
Product ID end number
Job status (idle, in progress, complete, expired)
Job expiry time
Time started
Time completed

When a node is launched, a query will be sent to the coordination server asking for some configuration data, and for a job to work on. When a node completes a job, a query will be sent updating the status of the job just completed, and another query requesting a new job to work on. Each job has an expiry time, so if a process crashes, or if a node fails for any reason, another node can take over an expired job to try it again.

To maximise the performance of the system, I'll need to work out how many nodes should be launched at once, how many processes per node, the rate of HTTP requests sent, and which EC2 instance type will deliver the most value for money (I'm guessing high network performance, high CPU performance, and high disk I/O would be the key factors?).

At the moment, the plan is to code the scraper in Python, running on Ubuntu EC2 instances, possibly launched within Docker containers, and some sort of key-value store database to hold the jobs on the coordination server (MongoDB?). A relational DB should also work, since the jobs table should be fairly low I/O.

I'm curious to know from more experienced engineers if this is the right approach, or if I'm completely overlooking a much better method for accomplishing this task?

Much appreciated, thanks!

*"This scraper will have a negligible impact on the performance of the website."* You lack the insight to justify this statement. *"However, I'll immediately put a stop to it if the company complains."* Then you should ask their permission up front. *"Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server"* Not obvious at all; this is a false premise. If you can't do 500 requests/sec on a single core, regardless of round trip time, you are doing it wrong. — Michael - sqlbot, Jan 31 '16 at 14:17
Fair point. I'll ask the web admin's permission before going ahead with this then. I'm not sure I understand your comment about the rate of HTTP requests. Even at 500 requests/sec, 500 million total requests would take over 16,000 hours to complete... So surely it's obvious that this task is not suitable to be completed by a single server (by that I mean, a single process)? — snelson, Jan 31 '16 at 15:21
Yes, if you have 500M legitimate requests to send... but it seems unlikely that 500 req/sec sustained is going to fly under the radar for very long. Once, while adding some logic to *reduce* the number of requests in a system that polled a web site, I inadvertently left a loop too early, resulting in sending *more* requests -- 12 req/sec to be exact -- and the mega company behind the site in question took notice of my traffic and was kind enough to identify and contact me within a fairly short time frame with a notice that they (correctly) considered the traffic unreasonable. — Michael - sqlbot, Jan 31 '16 at 20:31

score 5 · Answer 1 · answered Jan 31 '16 at 13:44

You are trying to design a distributed workflow system which is, in fact, a solved problem. Instead of reinventing the wheel, I suggest you look at AWS's SWF, which can easily do all state management for you, leaving you free to only worry about coding your business logic.

This is how a system designed using SWF will look like (Here, I'll use SWF's standard terminologies- you might have to go through the documentation to understand those exactly):

Start one workflow per productID.
1st activity will check whether this productID is valid, by making a HEAD request as you mentioned.
If it isn't, terminate workflow. Otherwise, 2nd activity will fetch relevant XML content, by making the necessary GET request, and persist it, say, in S3.
3rd activity will fetch the S3 file, scrape the XML data and do whatever with it.

You can easily change the design above to have one workflow process a batch of product IDs.

Some other points that I'd suggest you keep in mind:

Understand the difference between crawling and scraping: crawling means fetching relevant content from the website, scraping means extracting necessary data from it.
Ensure that what you are doing is strictly legal!
Don't hit the website too hard, or they might blacklist your IP ranges. You have two options:
- Add delay between two crawls. This too can be easily achieved in SWF.
- Use anonymous proxies.
Don't rely too much on XML results from some undocumented API, because that can change anytime.
You'll need high network performance EC2 instances. I don't think high CPU or memory performance would matter to you.

Thanks for introducing me to SWF - looks interesting! I'm still tempted to develop everything from scratch to learn from the experience, but I'll be sure to try out SWF. Also RE crawling vs. scraping, I pedantically checked the difference between the two before deciding on scraping: http://stackoverflow.com/a/4327523/3924011 — snelson, Jan 31 '16 at 15:37

Designing a distributed web scraper

1 Answers1