9.2 绘制均值和误差线

9.2.1 问题

你想要为一个数据集绘制均值的误差线。

9.2.2 方案

想要用 ggplot2 绘制图形,数据必须是数据框形式,而且是长格式(相对于宽格式)。如果你的数据需要重构,请参考长宽格式转换获取更多信息。

9.2.2.1 助手函数

如果你处理的仅仅是组间变量,那么 summarySE() 是你代码中唯一需要使用的函数。如果你的数据里有组内变量,并且你想要矫正误差线使得组间的变异被移除,就像 Loftus and Masson (1994) 里的那样,那么 normDataWithin()summarySEwithin() 这两个函数必须加入你的代码中,然后调用 summarySEwithin() 函数进行计算。

  1. ## Gives count, mean, standard deviation, standard error
  2. ## of the mean, and confidence interval (default 95%).
  3. ## data: a data frame. measurevar: the name of a column
  4. ## that contains the variable to be summariezed
  5. ## groupvars: a vector containing names of columns that
  6. ## contain grouping variables na.rm: a boolean that
  7. ## indicates whether to ignore NA's conf.interval: the
  8. ## percent range of the confidence interval (default is
  9. ## 95%)
  10. summarySE <- function(data = NULL, measurevar, groupvars = NULL,
  11. na.rm = FALSE, conf.interval = 0.95, .drop = TRUE) {
  12. library(plyr)
  13. # New version of length which can handle NA's: if
  14. # na.rm==T, don't count them
  15. length2 <- function(x, na.rm = FALSE) {
  16. if (na.rm)
  17. sum(!is.na(x)) else length(x)
  18. }
  19. # This does the summary. For each group's data frame,
  20. # return a vector with N, mean, and sd
  21. datac <- ddply(data, groupvars, .drop = .drop, .fun = function(xx,
  22. col) {
  23. c(N = length2(xx[[col]], na.rm = na.rm), mean = mean(xx[[col]],
  24. na.rm = na.rm), sd = sd(xx[[col]], na.rm = na.rm))
  25. }, measurevar)
  26. # Rename the 'mean' column
  27. datac <- rename(datac, c(mean = measurevar))
  28. datac$se <- datac$sd/sqrt(datac$N) # Calculate standard error of the mean
  29. # Confidence interval multiplier for standard error
  30. # Calculate t-statistic for confidence interval: e.g.,
  31. # if conf.interval is .95, use .975 (above/below), and
  32. # use df=N-1
  33. ciMult <- qt(conf.interval/2 + 0.5, datac$N - 1)
  34. datac$ci <- datac$se * ciMult
  35. return(datac)
  36. }
  37. ## Norms the data within specified groups in a data
  38. ## frame; it normalizes each subject (identified by
  39. ## idvar) so that they have the same mean, within each
  40. ## group specified by betweenvars. data: a data frame.
  41. ## idvar: the name of a column that identifies each
  42. ## subject (or matched subjects) measurevar: the name of
  43. ## a column that contains the variable to be summariezed
  44. ## betweenvars: a vector containing names of columns that
  45. ## are between-subjects variables na.rm: a boolean that
  46. ## indicates whether to ignore NA's
  47. normDataWithin <- function(data = NULL, idvar, measurevar,
  48. betweenvars = NULL, na.rm = FALSE, .drop = TRUE) {
  49. library(plyr)
  50. # Measure var on left, idvar + between vars on right of
  51. # formula.
  52. data.subjMean <- ddply(data, c(idvar, betweenvars),
  53. .drop = .drop, .fun = function(xx, col, na.rm) {
  54. c(subjMean = mean(xx[, col], na.rm = na.rm))
  55. }, measurevar, na.rm)
  56. # Put the subject means with original data
  57. data <- merge(data, data.subjMean)
  58. # Get the normalized data in a new column
  59. measureNormedVar <- paste(measurevar, "_norm", sep = "")
  60. data[, measureNormedVar] <- data[, measurevar] - data[,
  61. "subjMean"] + mean(data[, measurevar], na.rm = na.rm)
  62. # Remove this subject mean column
  63. data$subjMean <- NULL
  64. return(data)
  65. }
  66. ## Summarizes data, handling within-subjects variables by
  67. ## removing inter-subject variability. It will still
  68. ## work if there are no within-S variables. Gives count,
  69. ## un-normed mean, normed mean (with same between-group
  70. ## mean), standard deviation, standard error of the mean,
  71. ## and confidence interval. If there are within-subject
  72. ## variables, calculate adjusted values using method from
  73. ## Morey (2008). data: a data frame. measurevar: the
  74. ## name of a column that contains the variable to be
  75. ## summariezed betweenvars: a vector containing names of
  76. ## columns that are between-subjects variables
  77. ## withinvars: a vector containing names of columns that
  78. ## are within-subjects variables idvar: the name of a
  79. ## column that identifies each subject (or matched
  80. ## subjects) na.rm: a boolean that indicates whether to
  81. ## ignore NA's conf.interval: the percent range of the
  82. ## confidence interval (default is 95%)
  83. summarySEwithin <- function(data = NULL, measurevar, betweenvars = NULL,
  84. withinvars = NULL, idvar = NULL, na.rm = FALSE, conf.interval = 0.95,
  85. .drop = TRUE) {
  86. # Ensure that the betweenvars and withinvars are factors
  87. factorvars <- vapply(data[, c(betweenvars, withinvars),
  88. drop = FALSE], FUN = is.factor, FUN.VALUE = logical(1))
  89. if (!all(factorvars)) {
  90. nonfactorvars <- names(factorvars)[!factorvars]
  91. message("Automatically converting the following non-factors to factors: ",
  92. paste(nonfactorvars, collapse = ", "))
  93. data[nonfactorvars] <- lapply(data[nonfactorvars],
  94. factor)
  95. }
  96. # Get the means from the un-normed data
  97. datac <- summarySE(data, measurevar, groupvars = c(betweenvars,
  98. withinvars), na.rm = na.rm, conf.interval = conf.interval,
  99. .drop = .drop)
  100. # Drop all the unused columns (these will be calculated
  101. # with normed data)
  102. datac$sd <- NULL
  103. datac$se <- NULL
  104. datac$ci <- NULL
  105. # Norm each subject's data
  106. ndata <- normDataWithin(data, idvar, measurevar, betweenvars,
  107. na.rm, .drop = .drop)
  108. # This is the name of the new column
  109. measurevar_n <- paste(measurevar, "_norm", sep = "")
  110. # Collapse the normed data - now we can treat between
  111. # and within vars the same
  112. ndatac <- summarySE(ndata, measurevar_n, groupvars = c(betweenvars,
  113. withinvars), na.rm = na.rm, conf.interval = conf.interval,
  114. .drop = .drop)
  115. # Apply correction from Morey (2008) to the standard
  116. # error and confidence interval Get the product of the
  117. # number of conditions of within-S variables
  118. nWithinGroups <- prod(vapply(ndatac[, withinvars, drop = FALSE],
  119. FUN = nlevels, FUN.VALUE = numeric(1)))
  120. correctionFactor <- sqrt(nWithinGroups/(nWithinGroups -
  121. 1))
  122. # Apply the correction factor
  123. ndatac$sd <- ndatac$sd * correctionFactor
  124. ndatac$se <- ndatac$se * correctionFactor
  125. ndatac$ci <- ndatac$ci * correctionFactor
  126. # Combine the un-normed means with the normed results
  127. merge(datac, ndatac)
  128. }

9.2.2.2 示例数据

下面的示例将使用 ToothGrowth 数据集。注意 dose 在这里是一个数值列,一些情况下我们将它转换为因子变量将会更加有用。

  1. tg <- ToothGrowth
  2. head(tg)
  3. #> len supp dose
  4. #> 1 4.2 VC 0.5
  5. #> 2 11.5 VC 0.5
  6. #> 3 7.3 VC 0.5
  7. #> 4 5.8 VC 0.5
  8. #> 5 6.4 VC 0.5
  9. #> 6 10.0 VC 0.5
  10. library(ggplot2)

首先,我们必须对数据进行统计汇总。 这可以通过多种方式实现,参阅汇总数据。在这个案例中,我们将使用 summarySE() 函数( 确保summarySE() 函数的代码在使用前已经键入)。

  1. # install.packages('Rmisc')
  2. library(Rmisc)
  3. #> Loading required package: lattice
  4. #> Loading required package: plyr
  5. #>
  6. #> Attaching package: 'Rmisc'
  7. #> The following objects are masked _by_ '.GlobalEnv':
  8. #>
  9. #> normDataWithin, summarySE, summarySEwithin
  10. # summarySE 函数提供了标准差、标准误以及 95% 的置信区间
  11. tgc <- summarySE(tg, measurevar = "len", groupvars = c("supp",
  12. "dose"))
  13. tgc
  14. #> supp dose N len sd se ci
  15. #> 1 OJ 0.5 10 13.23 4.460 1.4103 3.190
  16. #> 2 OJ 1.0 10 22.70 3.911 1.2368 2.798
  17. #> 3 OJ 2.0 10 26.06 2.655 0.8396 1.899
  18. #> 4 VC 0.5 10 7.98 2.747 0.8686 1.965
  19. #> 5 VC 1.0 10 16.77 2.515 0.7954 1.799
  20. #> 6 VC 2.0 10 26.14 4.798 1.5172 3.432

9.2.2.3 线图

数据统计总结后,我们就可以开始绘制图形了。这里是一些带误差线的线图和点图,误差线代表标准差、标准误或者是 95% 的置信区间。

  1. # 均值的标准误
  2. ggplot(tgc, aes(x = dose, y = len, colour = supp)) + geom_errorbar(aes(ymin = len -
  3. se, ymax = len + se), width = 0.1) + geom_line() + geom_point()

9.2 绘制均值和误差线 - 图1

  1. # 发现误差线重叠(dose=2.0),我们使用 position_dodge
  2. # 将它们进行水平移动
  3. pd <- position_dodge(0.1) # move them .05 to the left and right
  4. ggplot(tgc, aes(x = dose, y = len, colour = supp)) + geom_errorbar(aes(ymin = len -
  5. se, ymax = len + se), width = 0.1, position = pd) +
  6. geom_line(position = pd) + geom_point(position = pd)

9.2 绘制均值和误差线 - 图2

  1. # 使用 95% 置信区间替换标准误
  2. ggplot(tgc, aes(x = dose, y = len, colour = supp)) + geom_errorbar(aes(ymin = len -
  3. ci, ymax = len + ci), width = 0.1, position = pd) +
  4. geom_line(position = pd) + geom_point(position = pd)

9.2 绘制均值和误差线 - 图3

  1. # 黑色的误差线 - 注意 'group=supp' 的映射 --
  2. # 没有它,误差线将不会避开(就是会重叠)。
  3. ggplot(tgc, aes(x = dose, y = len, colour = supp, group = supp)) +
  4. geom_errorbar(aes(ymin = len - ci, ymax = len + ci),
  5. colour = "black", width = 0.1, position = pd) +
  6. geom_line(position = pd) + geom_point(position = pd,
  7. size = 3)

9.2 绘制均值和误差线 - 图4

一张完成的带误差线(代表均值的标准误)的图形可能像下面显示的那样。最后画点,这样白色将会在线和误差线的上面(这个需要理解图层概念,顺序不同展示的效果是不一样的)。

  1. ggplot(tgc, aes(x=dose, y=len, colour=supp, group=supp)) +
  2. geom_errorbar(aes(ymin=len-se, ymax=len+se), colour="black", width=.1, position=pd) +
  3. geom_line(position=pd) +
  4. geom_point(position=pd, size=3, shape=21, fill="white") + # 21的填充的圆
  5. xlab("Dose (mg)") +
  6. ylab("Tooth length") +
  7. scale_colour_hue(name="Supplement type", # 图例标签使用暗色
  8. breaks=c("OJ", "VC"),
  9. labels=c("Orange juice", "Ascorbic acid"),
  10. l=40) + # 使用暗色,亮度为40
  11. ggtitle("The Effect of Vitamin C on\nTooth Growth in Guinea Pigs") +
  12. expand_limits(y=0) + # 扩展范围
  13. scale_y_continuous(breaks=0:20*4) + # 每4个单位设置标记(y轴)
  14. theme_bw() +
  15. theme(legend.justification=c(1,0),
  16. legend.position=c(1,0)) # 右下方放置图例

9.2 绘制均值和误差线 - 图5

9.2.2.4 条形图

条形图绘制误差线也非常相似。 注意 tgc$dose 必须是一个因子。如果它是一个数值向量,将会不起作用。

  1. # 将dose转换为因子变量
  2. tgc2 <- tgc
  3. tgc2$dose <- factor(tgc2$dose)
  4. # 误差线代表了均值的标准误
  5. ggplot(tgc2, aes(x=dose, y=len, fill=supp)) +
  6. geom_bar(position=position_dodge(), stat="identity") +
  7. geom_errorbar(aes(ymin=len-se, ymax=len+se),
  8. width=.2, # 误差线的宽度
  9. position=position_dodge(.9))

9.2 绘制均值和误差线 - 图6

  1. # 使用95%的置信区间替换标准误
  2. ggplot(tgc2, aes(x=dose, y=len, fill=supp)) +
  3. geom_bar(position=position_dodge(), stat="identity") +
  4. geom_errorbar(aes(ymin=len-ci, ymax=len+ci),
  5. width=.2, # 误差线的宽度
  6. position=position_dodge(.9))

9.2 绘制均值和误差线 - 图7

一张绘制完成的图片像下面这样:

  1. ggplot(tgc2, aes(x=dose, y=len, fill=supp)) +
  2. geom_bar(position=position_dodge(), stat="identity",
  3. colour="black", # 使用黑色边框,
  4. size=.3) + # 将线变细
  5. geom_errorbar(aes(ymin=len-se, ymax=len+se),
  6. size=.3, # 将线变细
  7. width=.2,
  8. position=position_dodge(.9)) +
  9. xlab("Dose (mg)") +
  10. ylab("Tooth length") +
  11. scale_fill_hue(name="Supplement type", # Legend label, use darker colors
  12. breaks=c("OJ", "VC"),
  13. labels=c("Orange juice", "Ascorbic acid")) +
  14. ggtitle("The Effect of Vitamin C on\nTooth Growth in Guinea Pigs") +
  15. scale_y_continuous(breaks=0:20*4) +
  16. theme_bw()

9.2 绘制均值和误差线 - 图8

9.2.2.5 为组内变量添加误差线

当所有的变量都属于不同组别时,我们画标准误或者置信区间会显得非常简单直观。然而,当我们描绘的是组内变量(重复测量),那么添加标准误或者通常的置信区间可能会对不同条件下差异的推断产生误导作用。

下面的方法来自 Morey (2008),它是对 Cousineau (2005)的矫正,而它所做的就是 提供比 Loftus and Masson (1994)更简单的方法。 你可以查看这些文章,以获得更多对组内变量误差线问题的详细探讨和方案。

这里有一个组内变量的数据集 (来自 Morey 2008),包含 pre/post-test

  1. dfw <- read.table(header = TRUE, text = "
  2. subject pretest posttest
  3. 1 59.4 64.5
  4. 2 46.4 52.4
  5. 3 46.0 49.7
  6. 4 49.0 48.7
  7. 5 32.5 37.4
  8. 6 45.2 49.5
  9. 7 60.3 59.9
  10. 8 54.3 54.1
  11. 9 45.4 49.6
  12. 10 38.9 48.5
  13. ")
  14. # 将物体的 ID 作为因子变量对待
  15. dfw$subject <- factor(dfw$subject)

第一步是将该数据集转换为长格式,参见长宽格式数据互换获取更多信息。

  1. # 转换为长格式
  2. library(reshape2)
  3. dfw_long <- melt(dfw, id.vars = "subject", measure.vars = c("pretest",
  4. "posttest"), variable.name = "condition")
  5. dfw_long
  6. #> subject condition value
  7. #> 1 1 pretest 59.4
  8. #> 2 2 pretest 46.4
  9. #> 3 3 pretest 46.0
  10. #> 4 4 pretest 49.0
  11. #> 5 5 pretest 32.5
  12. #> 6 6 pretest 45.2
  13. #> 7 7 pretest 60.3
  14. #> 8 8 pretest 54.3
  15. #> 9 9 pretest 45.4
  16. #> 10 10 pretest 38.9
  17. #> 11 1 posttest 64.5
  18. #> 12 2 posttest 52.4
  19. #> 13 3 posttest 49.7
  20. #> 14 4 posttest 48.7
  21. #> 15 5 posttest 37.4
  22. #> 16 6 posttest 49.5
  23. #> 17 7 posttest 59.9
  24. #> 18 8 posttest 54.1
  25. #> 19 9 posttest 49.6
  26. #> 20 10 posttest 48.5

使用 summarySEwithin() 函数拆解数据。

  1. dfwc <- summarySEwithin(dfw_long, measurevar = "value",
  2. withinvars = "condition", idvar = "subject", na.rm = FALSE,
  3. conf.interval = 0.95)
  4. dfwc
  5. #> condition N value value_norm sd se ci
  6. #> 1 posttest 10 51.43 51.43 2.262 0.7154 1.618
  7. #> 2 pretest 10 47.74 47.74 2.262 0.7154 1.618
  8. library(ggplot2)
  9. # 创建带 95% 置信区间的图形
  10. ggplot(dfwc, aes(x = condition, y = value, group = 1)) +
  11. geom_line() + geom_errorbar(width = 0.1, aes(ymin = value -
  12. ci, ymax = value + ci)) + geom_point(shape = 21, size = 3,
  13. fill = "white") + ylim(40, 60)

9.2 绘制均值和误差线 - 图9

valuevalue_norm 列代表了未标准化和标准化后的值。

9.2.2.6 理解组内变量的误差线

这部分解释组内的误差线值是如何计算出来的。这些步骤仅作解释目的;它们对于绘制误差线是非必需的。

下面独立数据的图形结果展示了组内变量 condition 存在连续一致的趋势,但使用常规的标准误(或者置信区间)则不能充分地展示这一点。Morey (2008) 和Cousineau (2005) 的方法本质是标准化数据去移除组间的变化,计算出这个标准化数据的变异程度。

  1. # 使用一致的 y 轴范围
  2. ymax <- max(dfw_long$value)
  3. ymin <- min(dfw_long$value)
  4. # 绘制个体数据
  5. ggplot(dfw_long, aes(x = condition, y = value, colour = subject,
  6. group = subject)) + geom_line() + geom_point(shape = 21,
  7. fill = "white") + ylim(ymin, ymax)

9.2 绘制均值和误差线 - 图10

  1. # 创造标准化的版本
  2. dfwNorm.long <- normDataWithin(data = dfw_long, idvar = "subject",
  3. measurevar = "value")
  4. # 绘制标准化的个体数据
  5. ggplot(dfwNorm.long, aes(x = condition, y = value_norm,
  6. colour = subject, group = subject)) + geom_line() +
  7. geom_point(shape = 21, fill = "white") + ylim(ymin,
  8. ymax)

9.2 绘制均值和误差线 - 图11

针对正常(组间)方法和组内方法的误差线差异在下面呈现。正常的方法计算出的误差线用红色表示,组内方法的误差线用黑色表示。

  1. # Instead of summarySEwithin, use summarySE, which
  2. # treats condition as though it were a between-subjects
  3. # variable
  4. dfwc_between <- summarySE(data = dfw_long, measurevar = "value",
  5. groupvars = "condition", na.rm = FALSE, conf.interval = 0.95)
  6. dfwc_between
  7. #> condition N value sd se ci
  8. #> 1 pretest 10 47.74 8.599 2.719 6.151
  9. #> 2 posttest 10 51.43 7.254 2.294 5.189
  10. # 用红色显示组间的置信区间,用黑色展示组内的置信区间
  11. ggplot(dfwc_between, aes(x = condition, y = value, group = 1)) +
  12. geom_line() + geom_errorbar(width = 0.1, aes(ymin = value -
  13. ci, ymax = value + ci), colour = "red") + geom_errorbar(width = 0.1,
  14. aes(ymin = value - ci, ymax = value + ci), data = dfwc) +
  15. geom_point(shape = 21, size = 3, fill = "white") + ylim(ymin,
  16. ymax)

9.2 绘制均值和误差线 - 图12

9.2.2.7 两个组内变量

如果存在超过一个的组内变量,我们可以使用相同的函数 summarySEwithin。下面的数据集来自 Hays (1994),在 Rouder and Morey (2005) 中用来绘制这类的组内误差线。

  1. data <- read.table(header = TRUE, text = "
  2. Subject RoundMono SquareMono RoundColor SquareColor
  3. 1 41 40 41 37
  4. 2 57 56 56 53
  5. 3 52 53 53 50
  6. 4 49 47 47 47
  7. 5 47 48 48 47
  8. 6 37 34 35 36
  9. 7 47 50 47 46
  10. 8 41 40 38 40
  11. 9 48 47 49 45
  12. 10 37 35 36 35
  13. 11 32 31 31 33
  14. 12 47 42 42 42
  15. ")

数据集首先必须转换为长格式,列名显示了两个变量: 形状 (圆形/方形) 和配色方案 (黑白/有色)。

  1. # 转换为长格式
  2. library(reshape2)
  3. data_long <- melt(data = data, id.var = "Subject", measure.vars = c("RoundMono",
  4. "SquareMono", "RoundColor", "SquareColor"), variable.name = "Condition")
  5. names(data_long)[names(data_long) == "value"] <- "Time"
  6. # 拆分 Condition 列为 Shape 和 ColorScheme
  7. data_long$Shape <- NA
  8. data_long$Shape[grepl("^Round", data_long$Condition)] <- "Round"
  9. data_long$Shape[grepl("^Square", data_long$Condition)] <- "Square"
  10. data_long$Shape <- factor(data_long$Shape)
  11. data_long$ColorScheme <- NA
  12. data_long$ColorScheme[grepl("Mono$", data_long$Condition)] <- "Monochromatic"
  13. data_long$ColorScheme[grepl("Color$", data_long$Condition)] <- "Colored"
  14. data_long$ColorScheme <- factor(data_long$ColorScheme, levels = c("Monochromatic",
  15. "Colored"))
  16. # 删除 Condition 列
  17. data_long$Condition <- NULL
  18. # 检查数据
  19. head(data_long)
  20. #> Subject Time Shape ColorScheme
  21. #> 1 1 41 Round Monochromatic
  22. #> 2 2 57 Round Monochromatic
  23. #> 3 3 52 Round Monochromatic
  24. #> 4 4 49 Round Monochromatic
  25. #> 5 5 47 Round Monochromatic
  26. #> 6 6 37 Round Monochromatic

现在可以进行统计汇总和绘图了。

  1. datac <- summarySEwithin(data_long, measurevar = "Time",
  2. withinvars = c("Shape", "ColorScheme"), idvar = "Subject")
  3. datac
  4. #> Shape ColorScheme N Time Time_norm sd se
  5. #> 1 Round Colored 12 43.58 43.58 1.212 0.3500
  6. #> 2 Round Monochromatic 12 44.58 44.58 1.331 0.3844
  7. #> 3 Square Colored 12 42.58 42.58 1.462 0.4219
  8. #> 4 Square Monochromatic 12 43.58 43.58 1.261 0.3641
  9. #> ci
  10. #> 1 0.7703
  11. #> 2 0.8460
  12. #> 3 0.9287
  13. #> 4 0.8014
  14. library(ggplot2)
  15. ggplot(datac, aes(x = Shape, y = Time, fill = ColorScheme)) +
  16. geom_bar(position = position_dodge(0.9), colour = "black",
  17. stat = "identity") + geom_errorbar(position = position_dodge(0.9),
  18. width = 0.25, aes(ymin = Time - ci, ymax = Time + ci)) +
  19. coord_cartesian(ylim = c(40, 46)) + scale_fill_manual(values = c("#CCCCCC",
  20. "#FFFFFF")) + scale_y_continuous(breaks = seq(1:100)) +
  21. theme_bw() + geom_hline(yintercept = 38)

9.2 绘制均值和误差线 - 图13

9.2.2.8 注意标准化的均值

函数 summarySEWithin() 返回标准化和未标准化的均值。未标准化的均值只是简单地表示每组的均值。标准化的均值计算出来保证组间的均值是一样的。

比如:

  1. dat <- read.table(header = TRUE, text = "
  2. id trial gender dv
  3. A 0 male 2
  4. A 1 male 4
  5. B 0 male 6
  6. B 1 male 8
  7. C 0 female 22
  8. C 1 female 24
  9. D 0 female 26
  10. D 1 female 28
  11. ")
  12. # 标准化和未标准化的均值是不同的
  13. summarySEwithin(dat, measurevar = "dv", withinvars = "trial",
  14. betweenvars = "gender", idvar = "id")
  15. #> Automatically converting the following non-factors to factors: trial
  16. #> gender trial N dv dv_norm sd se ci
  17. #> 1 female 0 2 24 14 0 0 0
  18. #> 2 female 1 2 26 16 0 0 0
  19. #> 3 male 0 2 4 14 0 0 0
  20. #> 4 male 1 2 6 16 0 0 0

9.2.3 其他

解决问题的方法不止作者提供的这一种,为了理解 ggplot2 是如何进行误差线的计算和添加,我在 stackoverflow 上提交了一个关于 ggplot2 使用 SE 还是 SD 作为默认误差线的问题。有人就给出了快速简易的解答。回答者的共同观点是必须先进行数据的统计计算。我之前在其他博客上看到使用 stat_boxplot(geom="errorbar", width=.3) 直接计算误差线可能就有问题(难以解释它算的是 SD 还是 SE)。