Recoding data

Problem

You want to recode data or calculate new data columns from existing ones.

Solution

The examples below will use this data:

  1. data <- read.table(header=T, text='
  2. subject sex control cond1 cond2
  3. 1 M 7.9 12.3 10.7
  4. 2 F 6.3 10.6 11.1
  5. 3 F 9.5 13.1 13.8
  6. 4 M 11.5 13.4 12.9
  7. ')

Recoding a categorical variable

The easiest way is to use revalue() or mapvalues() from the plyr package.This will code M as 1 and F as 2, and put it in a new column.Note that these functions preserves the type: if the input is a factor, the output will be a factor; and if the input is a character vector, the output will be a character vector.

  1. library(plyr)
  2. # The following two lines are equivalent:
  3. data$scode <- revalue(data$sex, c("M"="1", "F"="2"))
  4. data$scode <- mapvalues(data$sex, from = c("M", "F"), to = c("1", "2"))
  5. data
  6. #> subject sex control cond1 cond2 scode
  7. #> 1 1 M 7.9 12.3 10.7 1
  8. #> 2 2 F 6.3 10.6 11.1 2
  9. #> 3 3 F 9.5 13.1 13.8 2
  10. #> 4 4 M 11.5 13.4 12.9 1
  11. # data$sex is a factor, so data$scode is also a factor

See ../Mapping vector values and ../Renaming levels of a factor for more information about these functions.

If you don’t want to rely on plyr, you can do the following with R’s built-in functions:

  1. data$scode[data$sex=="M"] <- "1"
  2. data$scode[data$sex=="F"] <- "2"
  3. # Convert the column to a factor
  4. data$scode <- factor(data$scode)
  5. data
  6. #> subject sex control cond1 cond2 scode
  7. #> 1 1 M 7.9 12.3 10.7 1
  8. #> 2 2 F 6.3 10.6 11.1 2
  9. #> 3 3 F 9.5 13.1 13.8 2
  10. #> 4 4 M 11.5 13.4 12.9 1

Another way to do it is to use the match function:

  1. oldvalues <- c("M", "F")
  2. newvalues <- factor(c("g1","g2")) # Make this a factor
  3. data$scode <- newvalues[ match(data$sex, oldvalues) ]
  4. data
  5. #> subject sex control cond1 cond2 scode
  6. #> 1 1 M 7.9 12.3 10.7 g1
  7. #> 2 2 F 6.3 10.6 11.1 g2
  8. #> 3 3 F 9.5 13.1 13.8 g2
  9. #> 4 4 M 11.5 13.4 12.9 g1

Recoding a continuous variable into categorical variable

Mark those whose control measurement is <7 as “low”, and those with >=7 as “high”:

  1. data$category[data$control< 7] <- "low"
  2. data$category[data$control>=7] <- "high"
  3. # Convert the column to a factor
  4. data$category <- factor(data$category)
  5. data
  6. #> subject sex control cond1 cond2 scode category
  7. #> 1 1 M 7.9 12.3 10.7 g1 high
  8. #> 2 2 F 6.3 10.6 11.1 g2 low
  9. #> 3 3 F 9.5 13.1 13.8 g2 high
  10. #> 4 4 M 11.5 13.4 12.9 g1 high

With the cut function, you specify boundaries and the resulting values:

  1. data$category <- cut(data$control,
  2. breaks=c(-Inf, 7, 9, Inf),
  3. labels=c("low","medium","high"))
  4. data
  5. #> subject sex control cond1 cond2 scode category
  6. #> 1 1 M 7.9 12.3 10.7 g1 medium
  7. #> 2 2 F 6.3 10.6 11.1 g2 low
  8. #> 3 3 F 9.5 13.1 13.8 g2 high
  9. #> 4 4 M 11.5 13.4 12.9 g1 high

By default, the ranges are open on the left, and closed on the right, as in (7,9]. To set it so that ranges are closed on the left and open on the right, like [7,9), use right=FALSE.

Calculating a new continuous variable

Suppose you want to add a new column with the sum of the three measurements.

  1. data$total <- data$control + data$cond1 + data$cond2
  2. data
  3. #> subject sex control cond1 cond2 scode category total
  4. #> 1 1 M 7.9 12.3 10.7 g1 medium 30.9
  5. #> 2 2 F 6.3 10.6 11.1 g2 low 28.0
  6. #> 3 3 F 9.5 13.1 13.8 g2 high 36.4
  7. #> 4 4 M 11.5 13.4 12.9 g1 high 37.8