0

Currently, I'm working with Beautiful Soup to extract information from websites. I've managed to scrape data from multiple pages from a certain apartment renting website with it - let's call it “Website A”. Now, I'd like to obtain data from another renting websites (“Website B”). I tried to follow a similar procedure as previously, but it failed because Website B requires a login.

I did manage to scrape the first page of apartments of Website B by means of Adelin's answer. His/her approach is based on the usage of Curl Trillworks (link). In principle, this approach could work for Website B as well. However, then one would need to manually repeat the procedure for the 800 or so pages on which the apartments are listed, and afterwards do the same for each of the 15 apartments per page.

This is too much work for me, so I try to automate the process. For instance, I tried adapting this to my situation, but I haven't succeeded so far. The dictionary I get is empty. I also tried making a new header for each page by putting a new referer each time in the original header. Then I'd put these referers in the the header dictionary. However, this failed - probably because websiteB recognized I was using the same cookie everytime I sent a request (the same one I used for the original apartment page for Website B).

So my question is:

Suppose one would have a list of pages of Website B that all have the same format (www.websiteB.com/PageNumber/ ). How would one quickly/automatically obtain a header for each page by means of your own login credentials for the website, with which one can create an appropriate response?

I could share the code I have so far, but I'm somewhat hesitant as this is a large commercial website and I suspect they aren't particularly happy with me sharing code that allows their website to be scraped and names the website itself as well.

Max Muller
  • 341
  • 1
  • 3
  • 10
  • Here's the link to the Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – Max Muller Apr 16 '21 at 13:40
  • 1
    You could use `selenium` to log into the website, then pass the page source to `bs4.BeautifulSoup` to scrape the website. – Jacob Lee Apr 16 '21 at 15:12
  • @JacobLee Right, I've seen the name selenium come by as well. Does that help in this particular problem as well, when I have to scrape many pages of the form www.websiteB.com/PageNumber/ ? (So PageNumber goes from 2 to 800 or so here) – Max Muller Apr 16 '21 at 17:07
  • Most likely, you would first log into _www.websiteB.com/_, since you stated that the website requires that you log in to access all the pages. The log in phase would be easiest to do using `selenium` or another browser automation software, like `playwright` or `pyppeteer`. From there, you would navigate the remote browser to each page number website and you could pass the webpage content to `bs4.BeautifulSoup` to parse the website (likely using some multiprocessing). – Jacob Lee Apr 16 '21 at 17:12

0 Answers0