Question

我有一个名为“table_parameter”的csv文件。 Please, download from here.数据如下所示：

           time        avg.PM10            sill       range         nugget
    1   2012030101  52.2692307692308    0.11054330  45574.072   0.0372612157
    2   2012030102  55.3142857142857    0.20250974  87306.391   0.0483153769
    3   2012030103  56.0380952380952    0.17711558  56806.827   0.0349567088
    4   2012030104  55.9047619047619    0.16466350  104767.669  0.0307528346
    .
    .
    .
    25  2012030201  67.1047619047619    0.14349774  72755.326   0.0300378129
    26  2012030202  71.6571428571429    0.11373430  72755.326   0.0320594776
    27  2012030203  73.352380952381 0.13893530  72755.326   0.0311135434
    28  2012030204  70.2095238095238    0.12642303  29594.037   0.0281416079
    .
    .

在我的数据框中，有一个名为time的变量包含从2012年3月1日到2012年3月7日的数字形式的小时值。例如2012年3月1日，上午1点，写成2012030101等等。

从这个数据集中我想要子集（24 * 11）数据帧，如下表所示：

例如，凌晨1点（2012030101,2012030201 ...... 2012030701）和avg.PM10＆lt; 10，我想要1个数据帧。在这种情况下，您可能发现对某些数据框架没有观察。但没关系，因为我将使用非常大的数据集。

我可以通过写这样的（24 * 11）240行代码来手动完成这个子集化！

table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))

par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times  ==24 & avg.PM10>100)

但我明白这段代码效率很低。有没有办法通过循环有效地做到这一点？

仅供参考：实际上将来，通过使用这些（24 * 11）数据集，我想绘制一些情节。

更新：在此子集之后，我想使用每个数据集的range绘制箱图。但问题是，我想在一个图中像矩阵一样显示range的所有箱图（24 * 11）[如上图]！如果您有任何疑问，请告诉我。非常感谢。

Answer 1

你可以使用一些plyr，dplyr和tidyr魔法来做到这一点：

library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

如果您计划使用ggplot2，您可以忘记tidyr和代码的最后一行，以便将数据帧保持为长格式，这样绘制起来会更容易。

编辑：阅读完评论后，我意识到我误解了你的问题。这将为您提供AVG.PM10的每个小时和间隔的箱线图：

library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 
  facet_grid(roundedPM.10~.)

Answer 2

这样的双循环怎么样：

"something"

如何通过在R中使用循环来有效地进行子集化？

2 个答案: