2

I would like to create a function that updates a data frame from a different environment. Specifically, I would like to update the labels of a data frame using the Hmisc::label() function.

assign_label <- function(df, col) {
  col <- rlang::as_name(rlang::ensym(col))
  Hmisc::label(df[,col]) <- fetch_label(col)
}

fetch_label <- function(col) {
  val <- c("mpg" = "MPG",
           "hp" = "HP") 
  unname(val[col])
}

The following code executes without issue: assign_label(mtcars, hp)

However, it does not actually alter the data frame in the calling environment. I just can't figure out how to make it do what I imagine.

Ideally, I would like to be able to pipe a dataframe to this function as such:

mtcars %>% assign_label(mpg)

Dylan Russell
  • 936
  • 1
  • 10
  • 29
  • 1
    Up front, I don't know how ... but you're fighting an up-hill battle here, as most of R operates under a paradigm where arguments are pass-by-value, not by pass-by-reference. `data.table` does things by-ref, perhaps you can look there. If you don't need frames per-se and can deal with `environment`s, they are (always) pass-by-reference, so depending on your use-case, perhaps you can shift in that direction. – r2evans Aug 28 '20 at 22:03
  • 2
    I would strongly discourage you from doing that. Having functions that update variables ourside their scope is not very functional and now how most functions in R work. It's better to return the object with the updated values and if you want to save it, just reassign to the original value. – MrFlick Aug 28 '20 at 22:03
  • 1
    If your concern is that your tools work with larger datasets where R's preponderance for copying frequently is a concern, then ... `data.table`. – r2evans Aug 28 '20 at 22:05

2 Answers2

3

1) Return modified object Modifying objects in place is discouraged in R. The usual way to do this is to return the data frame and then assign it to a new name or back to the original name clobbering or shadowing it.

assign_label <- function(df, col) {
  col <- deparse(substitute(col))
  Hmisc::label(df[[col]]) <- fetch_label(col)
  df
}

mtcars_labelled <- mtcars %>% assign_label(mpg)

2) magrittr Despite what we have said above there are some facilties for modifying in place in R and in some R packages. The magrittr package provides a syntax for overwriting or shadowing the input. Using the definition in (1) we can write:

library(mtcars)
mtcars %<>% assign_label(mpg)

If mtcars were in the global environment it would ovewrite it with the new value but in this case mtcars is in datasets so a new mtcars is written to the caller and the original in datasets is unchanged.

3) replacement function Although not widely used, R does provide replacement functions which are defined and used like this. This does overwite or shadow the input.

`assign_label<-` <- function(df, value) {
  Hmisc::label(df[[value]]) <- fetch_label(value)
  df
}

assign_label(mtcars) <- "mpg"

Note

As an aside, if the aim is for an interface that is consistent with tidyverse then use tidyselect to retrieve the column name(s) so that examples like the following work:

assign_labels <- function(df, col) {
  nms <- names(select(df, {{col}}))
  for(nm in nms) Hmisc::label(df[[nm]]) <- fetch_label(nm)
  df
}

mtcars_labelled <- mtcars %>% assign_labels(starts_with("mp"))
str(mtcars_labelled)

mtcars_labelled <- mtcars %>% assign_labels(mpg|hp)
str(mtcars_labelled)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Is there any reason to use `deparse(subsitute(col))` as opposed to `rlang::as_name(rlang::ensym(col))`? – Dylan Russell Aug 28 '20 at 23:12
  • Seems pointless to add a package when it can be done in less code in base R. – G. Grothendieck Aug 28 '20 at 23:14
  • Well besides from the fact that I've used that package elsewhere in my script, makes sense. I didn't know if there was anything "safer" or more robust about one or the other. – Dylan Russell Aug 28 '20 at 23:16
  • @DylanRussell: `substitute()` can have [undesirable behavior in nested functions](https://stackoverflow.com/a/53215820/300187). Otherwise the two are equivalent. – Artem Sokolov Aug 29 '20 at 02:50
  • What that shows is that NSE is not a good idea in the first place. In (1) if we had designed the function to pass a character string for `col` then all this complexity would be eliminated, the function would be shorter (the first line of the body could be eliminated) and we could pass expressions that compute the name just as easily as passing the name, .e.g `assign_label(mtcars, names(mtcars)[1])` to pass the first column. – G. Grothendieck Aug 29 '20 at 08:25
  • @G.Grothendieck In this particular case, I agree: since the labels are strings, it makes sense to keep the function interface consistent. However, I think it's important to keep the `substitute()` issue in mind, because it often arises when the function is nested inside a map-style iterator (e.g., [lm](https://stackoverflow.com/a/58530100/300187) and [glm](https://stackoverflow.com/questions/57319130/purrrmap-and-glm-issues-with-call/57528229)). In short: 1) prefer SE over NSE; 2) when using NSE, prefer `rlang` functions over `substitute()`, if you ever plan to iterate over the NSE arguments. – Artem Sokolov Aug 29 '20 at 16:06
  • I went with NSE partially just to practice NSE but also to keep everything consistent with the NSE seen in `dplyr` pipes so users dont have to remember which functions call for characters and which call for symbols. – Dylan Russell Aug 30 '20 at 04:46
  • See Note added to end of answer. – G. Grothendieck Aug 30 '20 at 12:36
0

In regards to the comments about not modifying outside of the scope of a function, I created two functions that assign new dataframes with labels.

fetch_label <- function(col) {
  val <- c("mpg" = "MPG",
           "hp" = "HP") 
  unname(val[col])
}
 
assign_label <- function(df, col) {
  col <- rlang::as_name(rlang::ensym(col))
  Hmisc::label(df[[col]]) <- fetch_label(col)
  return(df)
}

assign_labels <- function(df) {
  purrr::iwalk(df, function(.x, .y) {
    lab <- fetch_label(.y)
    Hmisc::label(df[[col]]) <<- lab
  })
  return(df)
}

mtcars <- mtcars %>% assign_label(hp)
mtcars <- mtcars %>% assign_labels()
Dylan Russell
  • 936
  • 1
  • 10
  • 29
  • The `return(df)` in the first function is redundant (and in the second function `return` isn’t needed either, `df` is sufficient). – Konrad Rudolph Aug 28 '20 at 23:45