0

I am trying to scrape a forum, but I can't resolve the login part

Context about the forum to scrape

The part I want to scrape is only available to logged-in users. The forum seems to be phpBB. I can't give you the link because it is local.

Attempt to log-in using requests

I have tried to authenticate with following code:

url = 'http://forum.com'  #not real one XD

pet = requests.get(url, auth=HTTPBasicAuth('user', 'passw'), verify=False)

and also:

pet =requests.get(url, auth=HTTPDigestAuth('user', 'pass'), verify=False)

Parsing HTML to extract content

For information extraction I use BeautifulSoup:

soup = BeautifulSoup(pet.content)
print(soup.prettify())

Problem

When I execute those commands, it returns the information of the forum page without login. So apparently the login thing doesn't work.

I put Verify=False because if I don't, then an SSL-error is raised.

How can I achieve this? I would prefer solutions using the requests module, but also others are welcome.

Authentication and Login-Page

I don't know what kind of authentication the forum has (Basic, Digest, ...).

This is the piece of HTML from the login-page where the user and password are asked:

    <dl>
          <dt>
           <label for="username">
            Nombre de Usuario:
           </label>
          </dt>
          <dd>
           <input class="inputbox autowidth" id="username" name="username" size="25" tabindex="1" type="text" value=""/>
          </dd>
         </dl>
         <dl>
          <dt>
           <label for="password">
            Contraseña:
           </label>
          </dt>
          <dd>
           <input autocomplete="off" class="inputbox autowidth" id="password" name="password" size="25" tabindex="2" type="password"/>
          </dd>
          <dd>
           <a href="./ucp.php?mode=sendpassword">
            Olvidé mi contraseña
           </a>
          </dd>
         </dl>
         <dl>
          <dd>
           <label for="autologin">
            <input id="autologin" name="autologin" tabindex="4" type="checkbox"/>
            Recordar
           </label>
          </dd>
          <dd>
           <label for="viewonline">
            <input id="viewonline" name="viewonline" tabindex="5" type="checkbox"/>
            Ocultar mi estado de conexión en esta sesión
           </label>
          </dd>
         </dl>
         <input name="redirect" type="hidden" value="./search.php?search_id=newposts"/>
         <dl>
          <dt>
          </dt>
          <dd>
           <input name="sid" type="hidden" value="b48ad769e2eab979294621d07e3ef19d"/>
           <input class="button1" name="login" tabindex="6" type="submit" value="Identificarse"/>
          </dd>
         </dl>

Remark:

When I make a request for the page to scrape (the one that doesn't appear unless logged-in), it returns a HTTP status code 200. But the HTML it returns is the one of the login-page.

hc_dev
  • 8,389
  • 1
  • 26
  • 38
luisiacc
  • 1
  • 2
  • 2
    Possible duplicate of [Login to website using python](https://stackoverflow.com/questions/8316818/login-to-website-using-python) – Troy Wray Dec 06 '17 at 05:48

1 Answers1

0

when doing an auth to a page, curiously you do not send the requests to the login page, you directs the requests to the page that you want to scrap (in case it's using basic outh) Try it out.

Shailyn Ortiz
  • 766
  • 4
  • 14
  • So, not basic outh then Perhaps it's expecting some of the elements from the form, but I can tell without seeing the html at least. – Shailyn Ortiz Dec 06 '17 at 15:27