0

I have data set like this:

ID  col1 col2 col3  col5  col6
1   AA   TC   GC    CC    TA
2   AG   TC   CC    CC    TT
3   AG   TC   GG    CC    AA
4   GG   TC   GC    CC    AT
5   AA   CC   CC    CC    AA
6   AG   TT   GC    CC    TT
7   GG   TC   CC    CC    TA

I want to make a new column based on conditions like :

If col1 == A|col2==T|col3==G|col4==C|col5==T assign 001 
If col1 == G|col2==C|col3==C|col4==C|col5==A assign 002
if other assign other.

but the position of the letters is not important, i mean if the col5 have T or A first, the order is not important.

So the data should look like this:

ID  col1 col2 col3  col5  col6   new_column
1   AA   TC   GC    CC    TA     001+002 # because both of the conditions are found here 
2   AG   TC   CC    CC    TT     other+other
3   AG   TC   GG    CC    AA     other+other
4   GG   TC   GC    CC    AT     002+other     
5   AA   CC   CC    CC    AA     other+other
6   AG   TT   GC    CC    TT     001+other
7   GG   TC   CC    CC    TA     002+other

Can you please help me?

Marwah Al-kaabi
  • 419
  • 2
  • 7
  • 2
    Hi! Can you please post a reproducible data frame? You can use dput for that. For example have a look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Shibaprasadb Aug 16 '21 at 15:31
  • when you use the ```|``` symbol, you use it as 'or' or do you use like 'and'. In other words, do all conditions in ```col1 == A|col2==T|col3==G|col4==C|col5==T``` has to be fulfilled to assign 001? – elielink Aug 16 '21 at 15:42
  • @elielink ``` |``` its and here. Yes, all the condition has to be fulfilled to assign 001. – Marwah Al-kaabi Aug 16 '21 at 15:45
  • Is it guaranteed that there are always two letters in each column? And can you add a bit context around what you intend to do with the new column afterwards? I wonder if a different output format might be more suited to your needs. – coffeinjunky Aug 16 '21 at 16:03
  • @ stat_shib Hi I used `dput` and I got this : `structure(list(col1 = c(NA, "CC", "TC", "CC", "CC", "TC"), col2 = c(NA, "AA", "AA", "AA", "AA", "AA"), col3 = c("CC", "CC", "CC", "CC", "CC", NA), col4 = c("GG", "GG", "GG", "GG", "CG", "CG"), col5 = c("CC", "CC", "CC", "CC", "CC", "CC"), col6= c("TT", "TT", "TT", "TC", "TC", "TT")), class = "data.frame"` – Marwah Al-kaabi Aug 16 '21 at 16:04
  • @coffeinjunky , Hi its either 2 letters or NA. I just need the new column format as I explained in the example, I'll use the out put in statistical analysis.Thanks;) – Marwah Al-kaabi Aug 16 '21 at 16:08
  • 1
    @MarwahAl-kaabi then if all condition needs to be fulfilled, the results of the fist line cannot be 001 and 002 right? since there is no g in col 1 line 1 – elielink Aug 16 '21 at 16:42

1 Answers1

1

Ok here is what I've done from your instruction (i think there was some erros in colnames so I corrected it).

df = read.table(text = 'ID  col1 col2 col3  col4  col5
1   AA   TC   GC    CC    TA
2   AG   TC   CC    CC    TT
3   AG   TC   GG    CC    AA
4   GG   TC   GC    CC    AT
5   AA   CC   CC    CC    AA
6   AG   TT   GC    CC    TT
7   GG   TC   CC    CC    TA', header=T)

a =apply(df,1, function(i){
  n = c()
  i = as.data.frame(matrix(i, nrow = 1))
  colnames(i)=colnames(df)
  if(grepl('A',i$col1)&grepl('T',i$col2)&grepl('G',i$col3)&grepl('C',i$col4)&grepl('T',i$col5)){
    n = c(n,'001')
  }else{
    n = c(n,'other')
  }
  if(grepl('G',i$col1)&grepl('C',i$col2)&grepl('C',i$col3)&grepl('C',i$col4)&grepl('A',i$col5)){
    n = c(n,'002')
  }else{
    n = c(n,'other')
  }
  return(n)
})

 df$new_column = apply(a,2, function(i){
  paste(i,collapse = '+')
})

df

Output looks like this

  ID col1 col2 col3 col4 col5  new_column
1  1   AA   TC   GC   CC   TA   001+other
2  2   AG   TC   CC   CC   TT other+other
3  3   AG   TC   GG   CC   AA other+other
4  4   GG   TC   GC   CC   AT   other+002
5  5   AA   CC   CC   CC   AA other+other
6  6   AG   TT   GC   CC   TT   001+other
7  7   GG   TC   CC   CC   TA   other+002

Is this what you wanted?

elielink
  • 1,174
  • 1
  • 10
  • 22
  • Hi, I really appreciate your time and help, thank you. what if I have 15 condition that I want to apply it? what should I change in this code? – Marwah Al-kaabi Aug 17 '21 at 01:57
  • 1
    Each condition is obtained using this ```grepl('C',i$col3)```, you can manually add them. The ```&``` stands for 'AND'. So it looks like "condition1" & "condition2" & ... & "condition15" – elielink Aug 17 '21 at 06:19