How do I set the series labels in a multiline ggplot2 series?

时间:2017-04-09 23:40:21

标签: r ggplot2

I'm currently working on automating some basic experiential analysis using R. Currently, I've got my script setup as follows which generates the plot shown below.

data <- list()
for (experiment in experiments) {
    path = paste('../out/', experiment, '/', plot, '.csv', sep="")
    data[[experiment]] <- read.csv(path, header=F)
}

df <- data.frame(Year=1:40,
                 'current'=colMeans(data[['current']]),
                 'vip'=colMeans(data[['vip']]),
                 'vipbonus'=colMeans(data[['vipbonus']]))

df <- melt(df, id.vars = 'Year', variable.name = 'Series')
plotted <- ggplot(df, aes(Year, value)) +
           geom_line(aes(colour = Series)) +
           labs(y = ylabel, title = title)

file = paste(plot, '.png', sep="")
ggsave(filename = file, plot = plotted)

enter image description here

While this is close to what we want the final product to look like, the series labels need to be updated. Ideally we want them to be something like "VIP, no bonus", "VIP, with bonus" and so forth, but obviously using labels like that in the data frame is not valid R (and invalid characters are automatically replaced with . even with backticks). Since these experiments are a work in progress, we also know that we are gong to need more series labels in the future so we don't want to lose the ability of ggplot to automatically set the colors for us.

How can I set the series labels to be appropriate for humans?

3 个答案:

答案 0 :(得分:2)

While this may not be an ideal approach, what we found that worked for us was to update the relevant series labels after the melt command was performed:

df$Series <- as.character(df$Series)
df$Series[df$Series == "current"] <- "Current"
df$Series[df$Series == "vip"] <- "VIP, no bonus"
df$Series[df$Series == "vipbonus"] <- "VIP, with bonus"

Which results in plots like the following:

enter image description here

答案 1 :(得分:2)

OP解释说他目前致力于自动化一些基本的经验分析,其中一部分是该系列的重新标记。 OP还显示了一些用于准备要绘制的数据的代码。

根据评论中提供的其他信息,我相信可以简化整体处理,以解决系列标签问题。

一些准备工作

# used for creating file paths
experiments <- c("current", "vip", "vipbonus")
# used for labeling the series
exp_labels <- c("Current", "VIP, no bonus", "VIP, with bonus")
plot <- "dataset1"   # e.g.
paths <- paste0(file.path("../out", experiments, plot), ".csv") 
paths
#[1] "../out/current/dataset1.csv"  "../out/vip/dataset1.csv"      "../out/vipbonus/dataset1.csv"

读取数据

library(data.table)   #version 1.10.4 used here
# read all files into one large data.table
# add running count in column "Series" to identify the source of each row
DT <- rbindlist(lapply(paths, fread, header = FALSE), idcol = "Series")
# rename file chunks = Series, use predefined labels
DT[, Series := factor(Series, labels = exp_labels)]

按组重塑和聚合

# reshape from wide to long
molten <- melt(DT, id.vars = "Series")
# compute means by Series and Year = variable
aggregated <- molten[, .(value = mean(value)), by = .(Series, variable)]
# take factor level number of "variable" as Year
aggregated[, Year := as.integer(variable)]

请注意,聚合是以长格式( melt()之后)完成的,以便为每列输入相同的命令。

创建图表&amp;保存到磁盘

library(ggplot2)
ggplot(aggregated, aes(Year, value)) +
  geom_line(aes(colour = Series)) +
  labs(y = "ylabel", title = "title")

file = paste(plot, '.png', sep="")
ggsave(filename = file)   # by default, the last plot is saved

答案 2 :(得分:1)

你可以试试这个

library(tidyverse)
df <- df %>% dplyr::mutate(Series = as.character(Series),
                           Series = fct_recode(Series,
                                              "Current" = "current",
                                              "VIP, no bonus" = "vip", 
                                              "VIP, with bonus" = "vipbonus"))