0

I am currently trying to import pdf tables from a large number of files with tabulizer. Tabulizer works amazingly for pdfs all I need to do is:

table <- extract_tables("pdf_path" or "pdf_url)

However, the issue that I am having is that the website I am trying to extract these pdfs from needs you to log-in (free) to see the pdfs. So I am trying to log-in to the website using rvest and httr and then scrape the pdfs.

url <- 'https://www.krollbondratings.com/show_report/20265'
session <- html_session(url)
url <- jump_to(session, "https://www.krollbondratings.com/auth?uri=/show_report/20265")
form <- html_form(read_html(url))[[2]]
filled_form <- set_values(form,
                          email = "my_email",
                          password = "password")

pdf <- submit_form(session, filled_form)

This is where I am stuck, I know I am headed in the right direction as the output of "submit_form(session, filled_form)" is:

<session> https://www.krollbondratings.com/show_report/20265
  Status: 200
  Type:   application/pdf
  Size:   260625

Clearly, it is actually successfuly logging in and seeing the pdf, however, I have no idea how to make it stay logged in and actually download/access the pdf with either download.file or tabulizers extract_tables.

Downloading a file after login using a https URL

This is the best tutorial that I have found but it does not actually download a pdf file specifically, but rather an html file which is useless to me.

Thank you for your time, everyone.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Marc
  • 21
  • 2

1 Answers1

0

Solved, it actually does download the pdf file but not in pdf format!

url <- 'https://www.krollbondratings.com/show_report/20265'
session <- html_session(url)
url <- jump_to(session, "https://www.krollbondratings.com/auth?uri=/show_report/20265")
form <- html_form(read_html(url))[[2]]
filled_form <- set_values(form,
                          email = "my_email",
                          password = "password")
pdf <- submit_form(session, filled_form)
download_url <- 'https://www.krollbondratings.com/show_report/20265'
writeBin(download$response$content, basename(download_url))
Marc
  • 21
  • 2