How do you parse HTML with beautiful soup to ONLY get a specific javascript link, as well as a specific date from a HTML table?

Question

I am trying to parse a HTML document using beautiful soup and the FindALL method, but I can't seem to isolate the information I need. I've looked at the documentation, and a few tutorials, perhaps it's because I'm a Junior developer but i can't seem to isolate the numbers and link.

heres a dummy HTML table with basic information:


<tbody>                                    
                                            <tr class="results_row2">
                                                <td align="left">
                                                    Text is here ispssgjj sgdhjksgd jhsgd sgd
                                                </td>
                                                <td align="left">
                                                    GHJSFAGHJSFA GAFGSH AGSHSAGJH
                                                </td>
                                                <td align="left">
                                                    hdjk sgdhjk fdhjk sdhjk sdghjk
                                                </td>
                                                <td align="center">
                                                    11/10/1964
                                                </td>
                                                <td align="left">
                                                    
                                                </td>
                                                <td align="center">
                                                    5
                                                </td>


                                                <td align="center">
                                                    
                                                    <a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a>
                                                    
                                                            <br>
                                                            <a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a>
                                                            
                                                    <br>
                                        
                                                </td>
                                            </tr>
                                                                               
                                </tbody>

When I run my program I need it to pull the following for EACH ROW (that is just one row): Date (but rearranged to YYMMDD i.e., 641110) as well as the string where it says LINK GOES HERE (but I have to join it with another string to make it a valid link)

I don't want any of the other information, such as Link is here or gibberish text (e.g. Hjkhjksgd)

EDIT: I also need to be able to log into the webpage location with the correct credibility (I have the password and username)

Hopefully my code is legible enough, I've got some prints to help me understand the variables etc. I'm also open to other ways, I can't seem to figure out beautiful pandas or selenium.... so far I've got this:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#label the file location
file_location = r"Destination goes here"

#open the file up
with open(file_location, 'r') as f:
    file = f.read() 

#create a soup
soup= BeautifulSoup(file, "html.parser")
#print(f"soup is {soup}")     

#find all the tags that match what we want
script = soup.findAll('td', id='center')

print('begning loop')

#this is to find the date I am going to make a separate loop to find the print certificate 
#loop through the tags and check for what we want 
for i in range (0, len(script)):
        #these two variables are me trying to convert the tag to a variable to be used to check
    scriptString = str(script[i])
    scriptInt = int(script[i])
    
        #print(f'Starting loop i is: {i}')
        # Every 7th cell seems to be a number....
    if((i+4)%7 == 0):
        print(f'Starting IF i is: {i}')
        print(f'int test is {scriptInt}')
        #print(f'script is {script[i]} quote end')
                #this was to find out which part of the string was a number and it's 80% accurate 
        #for j in range (0, len(scriptString)):
            #print(f' j is {j} and string is {scriptString[j]}')
    #this printed the YYMMDD    
        print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end')


print("end")

I tried pulling the string from the table but it doesn't come across as an int, and the string is very messy. Due to the messiness of the string I can't compare it to what I want. Due to the fact that there is multiple td tags I can't isolate it by td.

For whoever is trying to do a similar thing here is some code with placeholder in plain english THIS CODE WILL NOT COMPILE AS IS BECAUSE OF THAT... BIG THANKS TO THE ANSWER FOR HELPING!!!


'''
To start this program you will need to go to Google Developer Tools
while you are in the website that you want to access
Go to the tab Network and right click Copy as Curl (Bash)
Copy that info to curlconverter.com to get the required code
'''

print('begin program')

#Import libraries
import requests
import os
from bs4 import BeautifulSoup  
import pdfkit
import re
from datetime import datetime

'''
Get cookies by opening Google Chrome Develope Tools network tab, then sign in (or sign out and sign in if you are already signed in) and then click on login (probably the top one) and right click and then click copy (and as of this comment), it has an arrow and click on copy as curl (bash) 
Curl your website using https://curlconverter.com/
it'll generate you some python code, copy and paste it below, but move the response to below 
'''


# normally you will use url, but we want the welcome page to check to see if we have made a successful connection
response = requests.post(
    'YOUR URL GOES HERE',
    params=params,
    cookies=cookies,
    headers=headers,
    data=data,
)

#store the file into soup and then parse it below to see if it goes to the login screen or welcome screen 
soup = BeautifulSoup(response.content, 'html.parser')

#were putting a try here because soup might have been stored as a nonetype which isn't itterable so it'll also end the program if not caught
try:
    #Check if it is stuck on the Login page to see if your cookies didn't work
    if "LOGIN PAGE DESCRIMINATOR " in soup.find('title'): 
        print("ERROR cookies not getting you past the login screen you will have problems finding the PDFS")
        raise UnicodeError 

   
    PDFArray  = [First, second, etc.]
    tempFileLocation = r"FILE LOCATION PATH GOES HERE"
    totalErrors = 0 
    #these are the dates to search between. 
    lowerYear = 2020
    higherYear = 2023

    print('Connection & Login succesful, looping through PDFs Array\n')

    #this loop goes through each of the Pagess in the array 
    for i in (range(0,len(PDFArray))):

        #print(f'begining {PDFArray[i]}\'s loop which is {i+1} in the Array')

        #declare the variables for this loop
        grandTotalNumPDFs = grandTotalNumPDFs + totalNumPDFs
        PDFFromArray = PDFArray[i]                               
        
        #reseting these variables back to 0 for this loop, as this is now a new PDF so reset the totals etc.
        totalNumPDFs = 0
        previousDateObj = 0
        
        #updating the URL and folder location to draw from and save to 
        url = f'Specific URL with your needed alterations to download PDFs'
        folder_location = os.path.join(r"PATH TO THE FOLDER",str(PDFFromArray))
        if not os.path.exists(folder_location):os.mkdir(folder_location)
       
       #We want to pull the HTML from the website to parse
        response = requests.post(
            url,
            params=params,
            cookies=cookies,
            headers=headers,
            data=data,
        )
        
        #store the HTML in the soup from the response 
        soup = BeautifulSoup(response.content, 'html.parser')
        
        #Let the Pages Know if the connection was succesful or not 
        print(f'For PDF:{PDFFromArray} it\'s {response.status_code==200} a connection was succesful and it\'s { the test you did above to see if it is stuck at the login page or not} that login was succesful')

        
        #to ensure that we are logging in and not stuck at the login screen 
        if "Your login specific info for the IF statement" in soup.find('title'):
            print(f"ERROR cookies not getting you past the login screen you will have problems finding the PDFS")
            #if it comes up with a timberlake login then you want to stop processing to prevent useless PDFs
            raise TypeError
   
        #parse the first part by finding all the ones with the tag td and the (I cant remember what it's called) of align center
        script = soup.findAll("td", align="center")
        
        #this is to show the Pages stuff is happening etc. but not super useful and kinda cumbersome so commented out but used for debugging
        #print(f"Begining the script loop for {PDFFromArray} it is {len(script)} tags long")

        #loop through everything in the script to parse it 
        for i in script:
        
            # We want to grab the date for later use.
            try:
                date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y")
                month = str(date_obj.month).zfill(2) # zero padding
                day = str(date_obj.day).zfill(2) # zero padding
          
          #Catch the value error so it doesn't crash 
            except ValueError:
                pass
                
            #get rid of everything that's NOT a "a" tag
            a_tags = i.findAll("a")

            #further parsing of the "a" tag
            if a_tags:

                # parsing JavaScript
                for a in a_tags:
                    pattern = r"\('(.*?)'\)"

                    #look for the particular stuff that has the javascript in it 
                    match = re.search(pattern, a["href"])

                    if match:
                        #match group 1 gets the first parenthese group e.g. in the follwoing it will pick only javascript 1 javascript(1), javascript(2) 
                        content = match.group(1)

                        #now we want ONLY the javascript that has our text which fortuneatly has this in it 
                        if "The str to find only your content" in content:

                            #next check to see if the certificate is inside the date range we want
                            if(date_obj.year >= lowerYear and date_obj.year <= higherYear):
                                
                                #Check to see if we need to add a suffix (e.g., -1 -2 -3 etc. to the filename and if so increase it otherwise reset to 0
                                if(date_obj == previousDateObj): duplicateNum = duplicateNum + 1
                                else: duplicateNum = 0
                                
                                #save the current date to check the next one
                                previousDateObj = date_obj
                                
                                #increase the print certificates by one
                                totalNumPDFs = totalNumPDFs + 1
                                
                                #set the correct URL to obtain the html version of the certificate to then be converted to HTML 
                                url = f"your URL goes here "
                               
                                #if it's a duplicate date go ahead and add a dash and number otherwise just add the normal stuff 
                                if duplicateNum > 0:
                                    filename = f"{PDFFromArray}{str(date_obj.year)[-2:]}{month}{day}-{duplicateNum}.pdf"
                                    duplicateNum = duplicateNum + 1
                                else:
                                    filename = f"{PDFFromArray}{str(date_obj.year)[-2:]}{month}{day}.pdf"
                                
                                #set the file path 
                                file_path = os.path.join(folder_location,os.path.basename(filename))
                                
                                #We've already done the cookies above so we just make a new request with the updated URL 
                                response = requests.post(
                                    url,
                                    params=params,
                                    cookies=cookies,
                                    headers=headers,
                                    data=data,
                                )
                                
                                #if the response was good go ahead otherwise print an error
                                if(response.status_code == 200):
                                  
                                    soup = response.content
                                    
                                    #write the content to the tempFileLocation to be converted to a PDF in the next step 
                                    with open(tempFileLocation, 'wb') as f:
                                            f.write(response.content)
                                            
                                    #in case a file is open or other problem 
                                    try:
                                        #write the file to a PDF    
                                        pdfkit.from_file(tempFileLocation, file_path)
                                    except:
                                        print(f'ERROR: Couldn\'t save {filename} because {file_path} is NOT VALID... or some other reason')
                                        totalErrors = totalErrors + 1
                                        #this isn't a critical error so just flag it and move on
                                        pass
                                    
                                #else if this was something besides 200 ( a valid one) print an error
                                else: print(f"ERROR: STATUS CODE WAS NOT 200 for {PDFFromArray} with the url {url} when trying to write out the temporary file")
                                    
                        #else if this does not have a print certificate in it, then add one to the totalNumDelete
                        else: totalNumDelete = totalNumDelete + 1
           
           #this is the end of the loop for the Pages, and loops to the next Pages in the array
        print(f'End of searching through PDF:{PDFFromArray}\'s page, {totalNumPDFs} valid PDFs were found, between the dates of {lowerYear} - {higherYear} \n')

    print(f'\nEnd of the program, program searched {len(PDFArray)} Pages and found, and downloaded {grandTotalNumPDFs} PDFs between the dates of {lowerYear} - {higherYear} and had {totalErrors} error(s) saving documents')

## These are to catch all the errors from the begining of the program
except TypeError: 
    print("PROGRAM END WITH ERROR: TYPE ERROR, meaning the Cookie probably is NO GOOD! or expired. and you got stuck at the login screen, and to prevent a bunch of PDFs with the login screen the program has been aborted")
except:
    print('This is a general catch all ERROR no idea what went wrong here')

911 · Accepted Answer · 2023-05-06T01:00:18.550

0

I used the datetime module and the re module to try to achieve your needs, I hope it can help you, the following is the code:

import re
from datetime import datetime
from bs4 import BeautifulSoup

file_location = r"yourhtml.html"
with open(file_location, "r") as f:
    file = f.read()
soup = BeautifulSoup(file, "html.parser")
script = soup.findAll("td", align="center")
print("begning loop")
for i in script:
    a_tags = i.findAll("a")
    if a_tags:
        # parsing JavaScript
        for a in a_tags:
            pattern = r"\('(.*?)'\)"
            match = re.search(pattern, a["href"])
            if match:
                content = match.group(1)
                print(content)
    try:
        date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y")
        month = str(date_obj.month).zfill(2) # zero padding
        day = str(date_obj.day).zfill(2) # zero padding
        print(f"{str(date_obj.year)[-2:]}{month}{day}")
    except ValueError:
        continue
print("end")

edited May 06 '23 at 01:00

answered May 05 '23 at 03:50

911

542
2
12

hey so that helped me solve the problem!!!! Thank you! I had to edit a few things to get it to work, i.e., it was picking up an extra link but I was able to solve that by checking if a word was in the string to isolate the text. You even had the right date format to thank you! The only reason I haven't clicked solved, is because I need to be able to log in to the website now (now that I have completed a successful parse on my local drive) Should I mark this as solved, and create a new post for the logging into the site? – William Charles May 05 '23 at 07:40
The big problem is the html that I’m accessing and downloading the PDF (which I now know what URL and how to save them) are behind a username and password (which I have) but I can’t seem to figure out how to log in to the site, a buddy said something about authentication header or something? – William Charles May 05 '23 at 07:43
Regarding login, you need to submit the username and password on the login page through the code, and save the returned cookie and other information, and carry them in subsequent requests. I think you should ask a new question about it, otherwise I may not know the specific situation – 911 May 05 '23 at 08:14
Ok! I’m going to have to learn about cookies! I just tried the basic post and such last time I tried logging in… I’ll check again and make a separate post for logging in if I can’t find it, thanks for the help!!! – William Charles May 05 '23 at 13:45
Also (I’m going to delete this comment when I figure this out, and this is partial note to future self) I just realized I only want the links from 2022-2020 I can just write a simple IF statement checking against the date right? – William Charles May 05 '23 at 13:53
Yes you can.Just check `date_obj.year` – 911 May 05 '23 at 16:09
yeah it was super simple! I am just now realizing however one of the dates isn't processing correctly... i don't know why. i'm trying to debug it (it's saying it SHOULDN"T print it, when in reality it should...) or in other words it's a javascript that I want but it thinks it's one to ignore... so I've got to fix that... I'll mark this as the correct answer later. I'm making one for the posting in a second. – William Charles May 05 '23 at 17:35
https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup#:~:text=Right%20click%20the%20site%20request%20%28the%20top%20one%29%2C,cookies%20and%20headers%20to%20proceed%20with%20the%20scraping – William Charles May 05 '23 at 17:50
That looks very promising, it said "get the curl (obtained from the google chrome dev tools) and then put it into a site and the site generates python code)" however it didn't say what to do with the python code... – William Charles May 05 '23 at 17:51
is their an easier way for me to scrape left and right instead of how I do it in the updated method?? Also that stack answered my question :D I think, I'm tinkering now... also in the dates, how do you get them to put a 0 in front of them if there isn't one (e.g. '08' vs '8') – William Charles May 05 '23 at 18:32
About adding "0" before the date, I updated the answer. – 911 May 06 '23 at 01:01
Nice! Thanks. Do you know a better way to isolate the two links? Mines giving me some glitches for some reason – William Charles May 06 '23 at 04:20
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253505/discussion-between-911-and-william-charles). – 911 May 06 '23 at 05:14
ok! I finished up the code etc. for some reason it's working now. anyways I will try that, never used it before. Have a few projects to tackle and then I can chat(in the chat)/post my revised code – William Charles May 09 '23 at 19:39
OK, sent a message in chat, commenting here in case I did it wrong, it's my first time using chat, bear with me. – William Charles May 10 '23 at 17:29

How do you parse HTML with beautiful soup to ONLY get a specific javascript link, as well as a specific date from a HTML table?

1 Answers1