Trying to login to a website using Python3

Question

I'm new to Python so still getting used to some of the different libraries it offers. I'm currently trying to use urllib to access the HTML of the website so that I can eventually scrape data from a table in the account I want to login as.

import urllib.request

link = "websiteurl.com"
login = "email@address.com"
password = "password"

#Access the website of the given address, returns back an HTML file
def access_website(address):
    return urllib.request.urlopen(address).read()

html = access_website(link)
print(html)

This function returns me

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    
<meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Festival Manager</title>\n   
 <link href="bundle.css" rel="stylesheet">\n    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n   
 <!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n    <!--[if lt IE 9]>\n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\n     
 <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>\n    <![endif]-->\n  </head>\n  <body>\n    
<script src="vendor.js"></script>\n    <script src="login.js"></script>\n  </body>\n</html>\n'

And the thing is I'm not really sure why it's giving me the part about HTML5 shim and respond.js... Because when I go to the actual website and inspect the javascript it doesn't look like this, so it doesn't seem to be returning me the HTML that I see when I actually visit the website.

Also I was trying to check what kind of requests it sends when I send login information, it isn't showing me a post request in the network tab of inspect elements. So I'm not really sure how I would even send the login information through Python through a post request to login?

Also if anyone needs the actual specific website I'm looking at I can link that as well! — kursdragon, Dec 29 '19 at 00:13
It would be nice to see the layout of the website you are trying to scrape information off of. I have some ideas that could help you using Selenium and BeautifulSoup to scrape information once you are logged in... Are you able to use a GUI for this? — jasper, Dec 29 '19 at 00:18
Look at BeautifulSoup for parsing HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/ — Iain Shelvington, Dec 29 '19 at 00:20
Thanks I actually looked at Selenium and I think it is working muuuuchhh better than the urllib was working so I think I'll continue with that, thanks a lot guys! If that doesn't work out I'll check BeautifulSoup — kursdragon, Dec 30 '19 at 01:50

score 0 · Answer 1 · answered Apr 14 '20 at 17:58

Here is my take on it for Python 3, done without any external libraries (StackOverflow). After login you can use BeautifulSoup, or any other kind of scraping, if you did login without 3d party libraries/modules you can do scraping as well.

Likewise, script on my GitHub here

Whole script replicated below as to StackOverflow guidelines:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()

A short additional info, regarding your original code... That is usually good enough if you do NOT have a login page. But with modern logins, you usually have cookies, checking of referal page, user agent code, tokens, if not more (like captcha's). Websites don't like to be scraped, and they fight it. It's also called good security.

So in addition to just doing the request as you initially did you have to: - take the cookie of the page, and serve it back during login - know the page's referal, usually you can just push the login page to the login-action page - fake the agent, if you announce yourself as "Python 3" agent (default) you are maybe just getting blocked right away - scrape the token (as in my case) and serve it back in login - package your payload (user, pass, token, and other stuff), encode it properly, and submit it as DATA to trigger the POST method - etc.

So yeah, with built-in libraries, code baloons a bit as soon as login page is involved. With 3rd party it's somewhat shorter, but as much as I researched, you again have to think about referals, agents, scraping tokens, etc. No lib will do that automagically, as each page works slightly differently (some will need fake agent, some won't, some have tokens, some not, some name them differently, etc).

If you strip my code of comments and extras, and shorten it a bit, you can make it a function that takes in 5 arguments and has 15 lines or less.

Cheers!

Trying to login to a website using Python3

1 Answers1