Trying to Scrape from website with login Python

Question

I am trying to scrape my data from a website that requires a login but I keep getting the following error:

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>MethodNotAllowed</Code><Message>The specified method is not allowed against this resource.</Message><Method>POST</Method><ResourceType>OBJECT</ResourceType><RequestId>DCVJZ8D4R3PK45M1</RequestId><HostId>PIra5vNbfC5d1TfFZ3hABXk9eIsKwtJm5bYH4Bozu4nS4InkGEILNflPPzdvT9hUpQOPaW0AZBA=</HostId></Error>

Python Script

import requests


loginurl = ("https://cbscarrickonsuir.app.vsware.ie/")
secure_url = ("https://cbscarrickonsuir.app.vsware.ie/11571471/behaviour")
payload = {"username":"REMOVED","password":"REMOVED","source":"web"}
r = requests.post(loginurl, data=payload)
print(r.text)

Had to remove username and password as this is a working website. I don't know how to do this. I followed a youtube tutorial but he had a much easier website to scrape from. I hope you can help me.

sometimes it is better to use `Session()` to work with `cookies` and first use `GET` to get all cookies from server (especially cookies for `Session ID`). And later run `POST`. Some pages may need to copy some extra value from HTML which you get with `GET` (ie. unique session ID) — furas, Nov 28 '21 at 03:57
the main problem can be that this page uses `JavaScript` to add elements in HTML - so it may also uses JavaScript also to detect scripts/bots - but `requests` can't run `JavaScript`. it may need to use [Selenium](https://selenium-python.readthedocs.io/) to control real web browser which can run `JavaScript` — furas, Nov 28 '21 at 04:01
login form doesn't have to send it to the same URL - and this page sends data to `https://cbscarrickonsuir.vsware.ie/tokenapiV2/login` as you have in answer. — furas, Nov 28 '21 at 04:05

Shubham Dhingra · Answer 1 · 2021-11-27T10:55:53.553

0

Open the network tab of your browser, use the login form after typing some username and password and you can see what endpoint is used for login. In your case it is https://cbscarrickonsuir.vsware.ie/tokenapiV2/login

It would be a good idea to click through links in XHR part of Network tab and see the headers, request and response to understand what API endpoint exactly you should be using along with the method, the request body format which is expected and the kind of response you will receive.

Edit: Also you'll be probably needing persistent sessions for scraping any data which will require you to login first. Go through these:

edited Nov 27 '21 at 10:55

answered Nov 27 '21 at 10:50

Shubham Dhingra

186
3
13

No error but no output. Can you help? – GCIreland Nov 27 '21 at 11:09
@GCIreland Can you tell me the result of `r.status_code`? If the login is successful, you can now use that session object to get other pages (which required authentication for access). Go through the answer from `Anuj Gupta` in the first link I mentioned in the answer. – Shubham Dhingra Nov 27 '21 at 14:00
What is that @Shubham Dhingra – GCIreland Nov 27 '21 at 14:02
@GCIreland Please go through this https://realpython.com/python-requests/ it provides examples for everything you're trying to do. – Shubham Dhingra Nov 27 '21 at 14:04
Status Code is 400. Does that mean succesfully logged in – GCIreland Nov 27 '21 at 14:05
Status 400 is bad I don't know what is happening – GCIreland Nov 27 '21 at 14:15

score 0 · Answer 2 · answered Nov 28 '21 at 04:19

There are two mistakes in your code.

you send data to main page but browser send to https://cbscarrickonsuir.vsware.ie/tokenapiV2/login
you send data as FORM data but browser sends as JSON data so you need json=payload instead of data=payload

Other problem can make that you don't use Session() to send automatically cookies - and all servers use cookies to keep information that you already logged in. If you don't send cookies then server doesn't know that you are logged in.

import requests

url = "https://cbscarrickonsuir.app.vsware.ie/"

login_url = 'https://cbscarrickonsuir.vsware.ie/tokenapiV2/login'

payload = {
    "username": "none",
    "password": "none@none.com",
    "source":"web"
}

s = requests.Session()

r = s.post(login_url, json=payload)

print('status:', r.status_code)
print('--- text ---')
print(r.text)
print('----------------')

I don't have account to login but now it get status 401 with message invalid_username_password

status: 401
--- text ---
{"fieldErrors":[],"genericErrors":[{"messageKey":"invalid_username_password","metadata":null}]}

Trying to Scrape from website with login Python

2 Answers2