I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically. But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!