1

Edit: Here's an attempt to better communicate my question and provide a reproducible example. Would someone please (1) explain what's wrong with my approach, (2) offer any sensible solution?

My data look like this:

DF.PSoft <- structure(list(Last = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("Carruthers", 
"Fester", "Mauger", "Schofield", "Vanhoy"), class = "factor"), 
Salary = structure(c(5L, 3L, 1L, 2L, 4L), .Label = c("121991.0", 
"142403.0", "47305.0", "47740.0", "49172.0"), class = "factor"), 
Dept1 = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("215086", 
"221230"), class = "factor"), Distrib1 = structure(c(2L, 
1L, 4L, 1L, 3L), .Label = c("100.0", "50.0", "75.0", "90.0"
), class = "factor"), Dept2 = structure(c(2L, 1L, 3L, 1L, 
4L), .Label = c("", "026112", "215086", "221704"), class = "factor"), 
Distrib2 = structure(c(3L, 1L, 4L, 1L, 2L), .Label = c("0.0", 
"15.0", "40.0", "5.0"), class = "factor"), Dept3 = structure(c(3L, 
1L, 3L, 1L, 2L), .Label = c("", "215086", "221704"), class = "factor"), 
Distrib3 = structure(c(2L, 1L, 3L, 1L, 2L), .Label = c("0.0", 
"10.0", "5.0"), class = "factor")), .Names = c("Last", "Salary", 
"Dept1", "Distrib1", "Dept2", "Distrib2", "Dept3", "Distrib3"
), row.names = c(NA, -5L), class = "data.frame")  

>DF.PSoft
          Last   Salary  Dept1 Distrib1  Dept2 Distrib2  Dept3 Distrib3
# 1  Schofield  49172.0 221230     50.0 026112     40.0 221704     10.0
# 2     Mauger  47305.0 215086    100.0             0.0             0.0
# 3     Fester 121991.0 221230     90.0 215086      5.0 221704      5.0
# 4 Carruthers 142403.0 215086    100.0             0.0             0.0
# 5     Vanhoy  47740.0 221230     75.0 221704     15.0 215086     10.0

The data describe people working in various departments. Schofield spends 50% of his time in Dept. 221230, 40% in 026112, and 10% in 221704. The real data set has 10 total Dept and Distrib columns; in this example, I'm working with 3 Dept/Distrib columns.

I want to reshape the data into a new frame that shows Last, Salary, the "Dept" column containing 215086 (if any "Dept" column matches), and the corresponding "Distrib" column:

>DF.Desired
          Last   Salary    Dept  Distrib  
# 1     Mauger  47305.0  215086    100.0
# 2     Fester 121991.0  215086      5.0 
# 3 Carruthers 142403.0  215086    100.0
# 4     Vanhoy  47740.0  215086     10.0

How can I do this? I've been frustrated in trying to figure this out. Here's what I have so far.

  1. Show all indices where data == "215086".

    test <- which(DF.PSoft=="215086", arr.in=TRUE)
    
    >test
           row col
    # [1,]   2   3
    # [2,]   4   3
    # [3,]   3   5
    # [4,]   5   7
    
  2. Create an empty DF that will hold data I pull in next steps.

    DF.blank <- data.frame(Last=character(dim(test)[1]), Salary=character(dim(test)[1]), Dept=character(dim(test)[1]), Distrib=character(dim(test)[1]),  stringsAsFactors=FALSE)
    
    >DF.blank
         Last Salary Dept Distrib
    # 1                         
    # 2                         
    # 3                          
    # 4
    
  3. Fill the empty DF with the data I want. Subset the entire data set (DF.PSoft) using indices from 'test', which searches for all entries == "215086". Get DF.PSoft row numbers as listed in 'test', get DF.PSoft columns 1, 2, whichever columns contain "215086" (this pulls in "215086" from the appropriate "Dept" column of DF.PSoft), and the column to the right of that positive search result (pulls in appropriate Distrib). If I'm thinking about this correctly, this method will work no matter how many "Dept" or "Distrib" columns I have in my file. I'd like to preserve that capability.

    for(i in 1:dim(test)[1]){
      DF.blank[i,] <- DF.PSoft[test[i,1], c(1,2, test[i,2], test[i,2]+1)]
      }
    
  4. Infuriatingly, I get this result:

    >DF.blank
        Last Salary Dept Distrib
    # 1    3      3    1       1
    # 2    1      2    1       1
    # 3    2      1    3       4
    # 4    5      4    2       2
    
  5. Interestingly, printing the DF.PSoft subset appears to work as expected:

    for (i in 1:dim(test)[1]) {
        print(DF.PSoft[test[i, 1], c(1, 2, test[i, 2], test[i, 2] + 1)])
    }
    
    #     Last  Salary  Dept1 Distrib1
    # 2 Mauger 47305.0 215086    100.0
    #         Last   Salary  Dept1 Distrib1
    # 4 Carruthers 142403.0 215086    100.0
    #     Last   Salary  Dept2 Distrib2
    # 3 Fester 121991.0 215086      5.0
    #     Last  Salary  Dept3 Distrib3
    # 5 Vanhoy 47740.0 215086     10.0
    

Many thank for your advice, and sorry again that I started out with a muddled question.

user7661
  • 57
  • 6
  • Please check this [link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). A good reproducible example will help others to tackle your question lot more easily. – CHP Nov 18 '13 at 03:45
  • Apologies for not having done this in the first place. I will read the link and post something a bit more helpful tomorrow. – user7661 Nov 18 '13 at 04:38

2 Answers2

1

If I understand your question correctly, you can just use reshape in it's most basic form:

reshape(DF.PSoft, idvar=c("Last", "Salary"), 
        varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
#                           Last Salary time   Dept Distrib
# Schofield.49172.1    Schofield  49172    1 221230      50
# Mauger.47305.1          Mauger  47305    1 215086     100
# Fester.121991.1         Fester 121991    1 221230      90
# Carruthers.142403.1 Carruthers 142403    1 215086     100
# Vanhoy.47740.1          Vanhoy  47740    1 221230      75
# Schofield.49172.2    Schofield  49172    2  26112      40
# Mauger.47305.2          Mauger  47305    2     NA       0
# Fester.121991.2         Fester 121991    2 215086       5
# Carruthers.142403.2 Carruthers 142403    2     NA       0
# Vanhoy.47740.2          Vanhoy  47740    2 221704      15
# Schofield.49172.3    Schofield  49172    3 221704      10
# Mauger.47305.3          Mauger  47305    3     NA       0
# Fester.121991.3         Fester 121991    3 221704       5
# Carruthers.142403.3 Carruthers 142403    3     NA       0
# Vanhoy.47740.3          Vanhoy  47740    3 215086      10

You can drop the rownames later if desired.


I had not noticed that you only wanted one department. Since that is the case, try this:

out <- reshape(DF.PSoft, idvar=c("Last", "Salary"),
               varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
rownames(out) <- NULL
out[out$Dept == "215086", ]
#          Last   Salary time   Dept Distrib
# 2      Mauger  47305.0    1 215086   100.0
# 4  Carruthers 142403.0    1 215086   100.0
# 8      Fester 121991.0    2 215086     5.0
# 15     Vanhoy  47740.0    3 215086    10.0
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I still have no idea why my approach fails, but yours was the starting point I needed to solve my problem. Thanks for your help, I had no idea about reshape before you posted this answer. – user7661 Nov 19 '13 at 01:59
0

Simple rbindlist from data.table package should do I think. With reproducible data in question , I can give reproducible answer :)

res <- rbindlist(list(DF.PSoft[, c('Last', 'Salary', 'Dept1', 'Distrib1')],
    DF.PSoft[, c('Last', 'Salary', 'Dept2', 'Distrib2')],
    DF.PSoft[, c('Last', 'Salary', 'Dept3', 'Distrib3')]    
 ))
CHP
  • 16,981
  • 4
  • 38
  • 57
  • Thanks, but I'm looking for a general solution, i.e., code that would reshape the data no matter how many "Dept" columns contain the string I'm searching for. This is important, as some employees work in 5 or more Depts(and have corresponding "Distrib" data). – user7661 Nov 18 '13 at 04:40
  • @geektrader, You don't *really* expect this to work do you? `rbind` usually checks names, so unless you're changing the names of the manually subsetted data, you're going to get errors. – A5C1D2H2I1M1N2O1R2T1 Nov 18 '13 at 04:47