1

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically. But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?

Take my sample code for extracting a variable to make it more clear:

#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title? 
count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)

title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text- 
muted'))

df <- data.frame(title_data_html, rank_data_html,description_data_html)

This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.

A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.

Does anyone have a suggestion of what is the "best practice" in such a case?

Thank you!

Carolin
  • 539
  • 2
  • 7
  • 15
  • This is a bit of a confusing question. What does " I would like to make sure that are matches the correct values per variable to each observation. " mean? It might be easier to figure out what you want if you include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). As an aside, your sample code won't do what you think it's going to do. `for (i in ) ` will execute the loop once, with only one value for `i`. – De Novo Jun 11 '18 at 18:13
  • @DanHall: I have updated my description, I hope it is more clear now – Carolin Jun 11 '18 at 19:05

1 Answers1

0

I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.

You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.

If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

De Novo
  • 7,120
  • 1
  • 23
  • 39