Currently, I'm working with Beautiful Soup to extract information from websites. I've managed to scrape data from multiple pages from a certain apartment renting website with it - let's call it “Website A”. Now, I'd like to obtain data from another renting websites (“Website B”). I tried to follow a similar procedure as previously, but it failed because Website B requires a login.
I did manage to scrape the first page of apartments of Website B by means of Adelin's answer. His/her approach is based on the usage of Curl Trillworks (link). In principle, this approach could work for Website B as well. However, then one would need to manually repeat the procedure for the 800 or so pages on which the apartments are listed, and afterwards do the same for each of the 15 apartments per page.
This is too much work for me, so I try to automate the process. For instance, I tried adapting this to my situation, but I haven't succeeded so far. The dictionary I get is empty. I also tried making a new header for each page by putting a new referer each time in the original header. Then I'd put these referers in the the header dictionary. However, this failed - probably because websiteB recognized I was using the same cookie everytime I sent a request (the same one I used for the original apartment page for Website B).
So my question is:
Suppose one would have a list of pages of Website B that all have the same format (www.websiteB.com/PageNumber/ ). How would one quickly/automatically obtain a
headerfor each page by means of your own login credentials for the website, with which one can create an appropriateresponse?
I could share the code I have so far, but I'm somewhat hesitant as this is a large commercial website and I suspect they aren't particularly happy with me sharing code that allows their website to be scraped and names the website itself as well.