I am currently trying to import pdf tables from a large number of files with tabulizer. Tabulizer works amazingly for pdfs all I need to do is:
table <- extract_tables("pdf_path" or "pdf_url)
However, the issue that I am having is that the website I am trying to extract these pdfs from needs you to log-in (free) to see the pdfs. So I am trying to log-in to the website using rvest and httr and then scrape the pdfs.
url <- 'https://www.krollbondratings.com/show_report/20265'
session <- html_session(url)
url <- jump_to(session, "https://www.krollbondratings.com/auth?uri=/show_report/20265")
form <- html_form(read_html(url))[[2]]
filled_form <- set_values(form,
email = "my_email",
password = "password")
pdf <- submit_form(session, filled_form)
This is where I am stuck, I know I am headed in the right direction as the output of "submit_form(session, filled_form)" is:
<session> https://www.krollbondratings.com/show_report/20265
Status: 200
Type: application/pdf
Size: 260625
Clearly, it is actually successfuly logging in and seeing the pdf, however, I have no idea how to make it stay logged in and actually download/access the pdf with either download.file or tabulizers extract_tables.
Downloading a file after login using a https URL
This is the best tutorial that I have found but it does not actually download a pdf file specifically, but rather an html file which is useless to me.
Thank you for your time, everyone.