0

I'm trying to webscrap a website (actually I'm downloading a .xlsx file) with R, but I got stuck since I don't know too much about both R and HTML.

I found the code below here, but it is not doing all the job. On the website I need to select (on a dropdown menu) the department and the checklist type, and then click on a download button (which opens the "download url" from the code bellow). The problem is the website allows me to download data from one department at a time.

So I would like to find a way to do this with R, and download the files from all the departments just changing the ID (from both department and checklist).

install.packages("rvest")
library(rvest)

url <- "https://website.com/auth/login?redirectUrl="
download_url <- "https://website.com/departments/file.xlsx"
session <- html_session(url)
form <- html_form(session)[[1]]

filled_form <- set_values(form,
                          email = "myusername", # "email" is an element from website HTML code
                          password = "mypassword") # "password" is an element from website HTML code

## Save main page url
main_page <- submit_form(session, filled_form)

#after login and submit_form do this:
download <- jump_to(main_page, download_url)

# write file to current working directory
writeBin(download$response$content, basename(download_url))

The area ID has the format in the HTML code bellow. I guess the JS function gets this setIdPlantaAtual(123) value and filter the database when I select the download button. The other filter seems to be based on the setModuloAtual parameter (equivalent to checklist type).

// Department HTML code

<h5><a onclick="setIdPlantaAtual(123)" href="javascript:;">Department ABC <span class="badge badge-success">selecionar</span></a></h5>

// Checklist HTML code

<li class="new" onclick="setModuloAtual('Checklist1')" style="cursor:pointer;">

Bellow are the two functions I found on the website code.

        function setIdPlantaAtual(_idplantaatual) {
            $.blockUI({
                message: '<h1> Wait...</h1>'
            });
            $.post(baseUrl + '/index/idplantaatual', {
                id: _idplantaatual
            }, function(res) {
                if (res == 'OK') {
                    location.reload();
                } else {
                    $.unblockUI();
                }
            });

        }

        function setModuloAtual(modulo) {
            if (modulo == "Tratativas de NCs") {
                modulo = "Tratativas de NC\'s";
            } 
           
            $.post(baseUrl + '/index/moduloatual', {
                modulo: modulo
            }, function(res) {
                console.log('aaa',res)
                if (res == 'OK') {
                    // location.reload();
                    window.location.replace(baseUrl);
                } else if (res.split('$%$').length > 0) {
                    var new_modulo = res.split('$%$')[1];
                    if (new_modulo != undefined) {
                        window.location.replace(baseUrl + '/' + new_modulo);
                    } else {
                        window.location.replace(baseUrl + '/');
                    }
                } else {
                    console.log(res);
                }
            });
        }

I think the same approach from my first attempt would work to merge the department ID (setIdPlantaAtual()) and the checklist ID (setModuloAtual()) into the download url, but I don't know how to do that.

I appreciate any help!

Phil
  • 7,287
  • 3
  • 36
  • 66
Gustavo
  • 1
  • 2
  • if you don't know R then use langauge which you know. There should be modules for scraping in Python, Java, JavaScript, Perl, etc. – furas Mar 29 '22 at 21:05
  • Thank you for your suggestion. Actually I'm using R because this is the first time I get some good results. Already tried Power Query (PBI), Power Automate (Online and Desktop versions) and VBA, but all that got was an empty/corrupted .xslx file. – Gustavo Mar 29 '22 at 22:09
  • It would be useful if you could provide the link to the XLSX file or the website. It would be more easy to provide a solution. – Emmanuel Hamel Apr 18 '23 at 02:03

0 Answers0