0

I am trying to use the missForest package to impute missing data into a fairly large dataset.

missForest takes data in the form of a a data matrix with missing values. The columns correspond to the variables and the rows to the observations. Therefore, I converted my dataframe to a matrix, which inadvertently turned all of my categorical variables to numeric type.

Does anyone know how to assign a column of a matrix as a factor??

Thank you so much!!!

  • It would be useful if you provide an example of a data and code, so we can start helping you on a concrete example – storaged Nov 30 '17 at 14:52
  • You can't. The difference between a data frame and matrix is that data frames can have columns of different classes, but everything in a matrix must be one class. What you should do is use `model.matrix` to convert your data frame to a matrix with properly coded factors [as in this FAQ](https://stackoverflow.com/q/4560459/903061). (Also know that whatever model routine you are running will also convert your data frame to a matrix, probably using `model.matrix` based on a formula. It will just do it internally.) For more reading on this, look up "one hot encoding". – Gregor Thomas Nov 30 '17 at 14:55
  • Ok, I went ahead and made dummy variables and ran missForest using the dummy coded matrix. I still get decimal values instead of 1's and 0's. Soem values are even negative!!! What do I do?!? – Paula Lopez-Gamundi Nov 30 '17 at 21:54

1 Answers1

0

Ok let me add more details.

I have a data frame that has the following columns.

 homt_sub<-homt[c("CASEID","REASON","PSYPROB","SUB2.2","FREQ1","FRSTUSE1","FREQ2","AGEcont","GENDER", "RACE2", "ARRESTS")]

The only continuous variable is AGEcont. The rest are factors. I had to make a matrix to use the missForest function.

homt_matrix<-data.matrix(homt_sub, rownames.force = NA)
homt_sub.imp <- missForest(homt_matrix, verbose= TRUE, maxiter = 3, ntree = 20)

I can extract the imputations from here but I get decimal values because they were treated as continuous variables.

model.matrix could be a solution but it seems a bit burdensome to create so many extra variables, get the imputed data and then collapse it back to one column for each variable later? I know there is a way to run randomForest with factor variables, but it's very unclear how to do it.

Thank you so much