I'm currently building a web scraper and have run into the issue of being IP blocked. To get around this issue I'm trying to use the requests_ip_rotator which use AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping. Following this answer I've implemented it into my code which is below:
import requests
from bs4 import BeautifulSoup
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
url = "https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1"
page1 = requests.get(url)
soup1 = BeautifulSoup(page1.content, "html.parser")
gateway = ApiGateway("https://secure.runescape.com/",access_key_id="****",access_key_secret="****")
gateway.start()
session = requests.Session()
session.mount("https://secure.runescape.com/", gateway)
page2 = session.get(url)
gateway.shutdown()
soup2 = BeautifulSoup(page2.content, "html.parser")
print("\n"+page1.url)
print(page2.url)
print(soup1.head.title==soup2.head.title)
input()
output:
Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://secure.runescape.com/ - IP Rotate API' (10 new).
Deleting gateways for site 'https://secure.runescape.com'.
Deleted 10 endpoints with for site 'https://secure.runescape.com'.
https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1
https://6kesqk9t6d.execute-api.eu-central-1.amazonaws.com/ProxyStage/m=hiscore_oldschool_ironman/a=13/overall
False
So both times I use the .get(url) method I am using the same url but receiving different pages. Request.get(url) is giving me the page I want but when I use the amazon gateway with session.get(url) it is not giving me the same page as before but a different page from the same site. I'm stumped for what the issue could be so any help would be greatly appreciated!