1

I have a dataset where caseness for a condition is stored across multiple variables, and needs to be collapsed into a single variable by assigning from each one sequentially. The below R code explains what I'm aiming for.

dataset$caseness <- NULL
dataset$caseness[dataset$a_case=="Y"] <- "Yes"
dataset$caseness[dataset$a_case=="N"] <- "No"
dataset$caseness[dataset$b_case=="Y" & is.na(dataset$caseness)] <- "Yes"
dataset$caseness[dataset$b_case=="N" & is.na(dataset$caseness)] <- "No"
dataset$caseness[dataset$c_case=="Y" & is.na(dataset$caseness)] <- "Yes"
dataset$caseness[dataset$c_case=="N" & is.na(dataset$caseness)] <- "No"

#etc.

Some example results of this would be:

a_case b_case c_case caseness
Y      NA     NA     Yes
NA     N      NA     No
N      N      Y      No
NA     NA     NA     NA

This code behaves exactly how I want it to (including the way it handles contradictions), but is long and clumsy, and extends over many lines because I have so many variables denoting caseness. Is there a more efficient way of doing this, for example by looping or similar?

Alice
  • 99
  • 1
  • 9

3 Answers3

3

You can use apply. Assuming your starting data looks like this:

df <- structure(list(a_case = c("Y", NA, "N", NA), b_case = c(NA, "N", 
"N", NA), c_case = c(NA, NA, "Y", NA)), row.names = c(NA, -4L
), class = "data.frame")

df
#>   a_case b_case c_case
#> 1      Y   <NA>   <NA>
#> 2   <NA>      N   <NA>
#> 3      N      N      Y
#> 4   <NA>   <NA>   <NA>

You would just do:

df$caseness <- apply(df, 1, function(x){
  if(length(x[!is.na(x)]) == 0) NA else x[!is.na(x)][1]})

df
#>   a_case b_case c_case caseness
#> 1      Y   <NA>   <NA>        Y
#> 2   <NA>      N   <NA>        N
#> 3      N      N      Y        N
#> 4   <NA>   <NA>   <NA>     <NA>

Created on 2020-07-06 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

If you want to make it shorter but retain some amount of human readability of the case rules you can use case_when. Since it works in the order you specify there's no need to constantly check to see whether an earlier rule is in effect

library(dplyr)


dataset %>%
  mutate(caseness = case_when(
    a_case == "N" ~ "No",
    a_case == "Y" ~ "Yes",
    b_case == "N"  ~ "No",
    b_case == "Y"  ~ "Yes",
    c_case == "N"  ~ "No",
    c_case == "Y"  ~ "Yes",
  ))

# A tibble: 4 x 4
  a_case b_case c_case caseness
  <chr>  <chr>  <chr>  <chr>    
1 Y      NA     NA     Yes      
2 NA     N      NA     No       
3 N      N      Y      No       
4 NA     NA     NA     NA       
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • Thanks! I like this solution in principle, but having a bit of trouble. When I execute the command, R attempts to print my df until `max.print` is reached, is there a way of suppressing this? Also, when I go to tabulate the resulting column using `table(df$caseness, exclude=NULL` it says `table of extent 0`. I've had trouble getting `mutate` to work before - sorry if these are obvious! – Alice Jul 06 '20 at 16:01
  • The easiest way to make the printing annoyance go away is by making it a tibble `dataset <- as_tibble(dataset)` – Chuck P Jul 06 '20 at 16:19
  • Would you please add a sample of your real data `dput(head(datset))` and what version of `dplyr` are you running? – Chuck P Jul 06 '20 at 16:21
0

If I understand correctly, you just need to get the first value in a row, that is not NA.

Then you can go through rows with apply() and find the first occurence of non-NA value, i.e. x[!is.na(x)][1]:

# your dataset
dataset <- read.table(header = TRUE, text = "
  a_case b_case c_case
  Y      NA     NA
  NA     N      NA
  N      N      Y
  NA     NA     NA
")

# new column
dataset$caseness <- apply(dataset, 1, function(x) x[!is.na(x)][1])

dataset
#>   a_case b_case c_case caseness
#> 1      Y   <NA>   <NA>        Y
#> 2   <NA>      N   <NA>        N
#> 3      N      N      Y        N
#> 4   <NA>   <NA>   <NA>     <NA>

Created on 2020-07-06 by the reprex package (v0.3.0)

Note it's NA by default, if the entire row is filled with NAs.

If you want only columns ending with _case, you just use endsWith() to select columns ending on a certain pattern from base R along with getting column names of your dataset by names() inside apply():

apply(dataset[endsWith(names(dataset), "_case")], 1, function(x) x[!is.na(x)][1])