2

I am trying to generate a new column based on grouping variables in R.

I have a data frame of test results where each student's end of year exam results are in one row.

Students are in two groups A and B. Group A were taught Chemistry in Semester 1 then English in Semester 2, and Group B were taught the other way round. The exam results are all from the end of the year.

ID Group English Chemistry
1    A      9     4
2    B      7     3
3    B      7     6
4    A      3     10
etc

I want to see if the order of teaching results in differences in exam scores, so I need a column called Sem1 that Includes Group A's Chemistry test and Group B's English, and another column called Sem2 that includes Group A's English and Group B's Chemistry.

so it would look like this:

ID Group English Chemistry   Sem1  Sem2
1    A      9     4          4       9
2    B      7     3          3       7
3    B      7     6          6       6
4    A      3     10         10      3
etc

I could then do statistics based on Semesters. I suspect this is not difficult but I am simple. All help appreciated!

Neill
  • 21
  • 2
  • For `ID=3` why is `Sem1=6` and `Sem2=6`? What happened to `English=7`? – Maurits Evers May 14 '18 at 12:10
  • I might not understand you correctly, but isn't the information you need already stored in the `Group` column? What is the purpose of adding these new columns when you can run, for example, `t.test(English~Group)` and `t.test(Chemistry~Group)`? – tifu May 14 '18 at 12:17
  • Thanks everyone, and for spotting my error Maurits. You're quite right. Tifu it may not be necessary to do this as you say, but we wanted to just compare all semester 1's subjects to all Semester 2's subjects at once so we can test the hypothesis that 'semester order does not affect overall performance in Chemistry/English'. – Neill May 14 '18 at 13:51

2 Answers2

2

You can use ifelse and mutate.

require(tidyverse)

#Sample data
df <- data.frame(ID = c(1:4), 
                 Group = c("A", "B", "B", "A"), 
                 English = c(9, 7, 7, 3), 
                 Chemistry = c(4, 3, 6, 10))

df %>% 
  mutate(Sem1 = ifelse(Group == "A", Chemistry, English), 
         Sem2 = ifelse(Group == "A", English, Chemistry))

Results:

  ID Group English Chemistry Sem1 Sem2
1  1     A       9         4    4    9
2  2     B       7         3    7    3
3  3     B       7         6    7    6
4  4     A       3        10   10    3

Edit - Suggesting case_when and doing benchmark for:ifelse, case_when, and transform

using the same sample data you can also use dplyr::case_when().

df %>% 
        mutate(Sem1 = case_when(Group == "A" ~ Chemistry, 
                                Group == "B" ~ English),
               Sem2 = case_when(Group == "A" ~ English,
                                Group == "B" ~ Chemistry))

However, including @Maurits Evers answer who uses base R transform, I was wondering which is the fastest.

New sample data

df <- data.frame(ID = c(1:100), 
                 Group = rep(sample(c("A", "B"), replace = TRUE), 100), 
                 English = rnorm(100, mean = 85, sd = 10), 
                 Chemistry = rnorm(100, mean = 85, sd = 10))

Benchmark:

require(rbenchmark) 

benchmark("ifelse" = {df %>% 
    mutate(Sem1 = ifelse(Group == "A", Chemistry, English), 
           Sem2 = ifelse(Group == "A", English, Chemistry))
},
"case_when" = {
  df %>% 
    mutate(Sem1 = case_when(Group == "A" ~ Chemistry, 
                            Group == "B" ~ English),
           Sem2 = case_when(Group == "A" ~ English,
                            Group == "B" ~ Chemistry))
},
"transform" = {
  transform(
    df, 
    Sem1 = ifelse(Group == "A", Chemistry, English), 
    Sem2 = ifelse(Group == "A", English, Chemistry))
},
replications = 1000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self")) 

Results:

       test replications elapsed relative user.self sys.self
2 case_when         1000    2.18    4.449      2.11     0.01
1     ifelse         1000    1.58    3.224      1.57     0.00
3 transform         1000    0.49    1.000      0.48     0.00
DJV
  • 4,743
  • 3
  • 19
  • 34
  • This is great - the transform function seemed the easiest to me and worked well. Thanks! – Neill May 14 '18 at 13:55
  • You're welcome :) I think that my main issue with transform (which is @Maurits Evers) is that you cannot pipe (%>%) it. Which might ruin the flow. – DJV May 14 '18 at 13:57
1

Assuming you made a mistake in row 3 of your expected output, here is a base R solution using transform:

transform(
    df, 
    Sem1 = ifelse(Group == "A", Chemistry, English), 
    Sem2 = ifelse(Group == "A", English, Chemistry))
#  ID Group English Chemistry Sem1 Sem2
#1  1     A       9         4    4    9
#2  2     B       7         3    7    3
#3  3     B       7         6    7    6
#4  4     A       3        10   10    3

Sample data

df <- read.table(text =
    "ID Group English Chemistry
1    A      9     4
2    B      7     3
3    B      7     6
4    A      3     10", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68