如何从csv中的多个文件计算平均值

时间:2021-05-04 12:31:55

标签: python r

在python中使用这个选项可以计算多个csv文件的平均值

如果file1.csv到file100.csv都在同一个目录下,可以使用这个Python脚本:

#!/usr/bin/env python3

N = 100
mean_sum = 0
std_sum = 0
for i in range(1, N + 1):
    with open(f"file{i}.csv") as f:
        mean_sum += float(f.readline().split(",")[1])
        std_sum += float(f.readline().split(",")[1])

print(f"Mean of means: {mean_sum / N}")
print(f"Mean of stds: {std_sum / N}")

如何在 R 中实现它?

2 个答案:

答案 0 :(得分:1)

“一切都可以编码”,埃里克 :)

如果您不提供一个最小的可重现示例并描述您迄今为止尝试过的内容以及出现问题的地方,则很难提供帮助。

以下基于{tidyverse};一组可以很好地协同工作的包。 我写的几乎 pseudo-code 应该会让你前进。显然,您必须进行调整、重命名以适合您的项目/变量名称等。

祝你好运:

library(readr)     # package to read tabular data
library(dplyr)     # main working horse to crunch data
library(purrr)     # functional programming for iterations/loops

pth <- "my-data-folder"    # provide path to your data

# create a list of file names in your folder
## you may need to fine-tune the regular pattern to select the files you look for
## full.names gives you the path/name of your data files
## \\.csv is the way to "escape" the dot of the csv type ending

fns <- list.files(path = pth, pattern = "*file.*\\.csv", full.names = TRUE)

# write a function that reads the file and calculates your stats
## you can "summarise" stats over a table

my_function <- function(.fn){
  df <- read_csv(.fn)     # read the file
  df <- df %>% 
    summarise(MEAN = mean(my-target-variable)    # calc mean of your file/data
              , SD = sd(my-target-variable)      # calc sd of the data
}

# iterate with purrr::map := take list of filenames and apply your function to each list entry
## map_dfr() provides a data frame, you can use "only" map() to get a list
## for testing purposes you can truncate the list of filenames with fns[1:3] for the
## first 3 files, other

ds <- fns %>% 
   purrr::map_dfr(.f = my_function)

ds

ds 是一个包含 MEAN 和 SD 列的表。

答案 1 :(得分:1)

考虑使这个示例可重现是一件很有趣的事情,所以这里有一些代码来创建 100 个 CSV,每个 CSV 包含五列随机数据,读回它们,然后进行您想要的计算。正如@Ray 的回答所暗示的那样,使用 map() 及其朋友是整齐迭代的好方法。

library(readr)
library(dplyr)
library(tidyr)
library(purrr)

## Make a "tmpdat" folder in the working dir if one doesn't exist
ifelse(!dir.exists(file.path("tmpdat")), dir.create(file.path("tmpdat")), FALSE)

#> [1] TRUE

## Make 100 CSV files, each with 5 columns
## of random data.
set.seed(16)

nvars <- 5

paste0("csv_", 1:100) %>%
  set_names() %>%
  map(~ replicate(n = nvars, rnorm(100, 0, 1))) %>%
  map_dfr(as_tibble, .id = "id", .name_repair = ~ paste0("v", 1:nvars)) %>%
  group_by(id) %>%
  nest() %>%
  pwalk(~ write_csv(x = .y, file = paste0("tmpdat/", .x, ".csv")))

## Get their names
filenames <- dir(path = "tmpdat",
                 pattern = "*.csv",
                 full.names = TRUE)

## Read them in and then
## 1. Calculate the mean and sd of each column in each CSV
## 2. Get the overall mean of means and mean of sds for
filenames %>%
  map_dfr(read_csv, .id = "id", col_types = cols()) %>%
  group_by(id) %>%
  summarize(across(everything(),
                   list(mean = mean, sd = sd))) %>%
  pivot_longer(-id,
               names_to = c("col", ".value"), names_sep="_") %>%
  group_by(col) %>%
  summarize(avg_mean = mean(mean),
            avg_sd = mean(sd))


#> # A tibble: 5 x 3
#>   col   avg_mean avg_sd
#>   <chr>    <dbl>  <dbl>
#> 1 v1    -0.00433  1.01 
#> 2 v2     0.00124  0.989
#> 3 v3    -0.00185  0.997
#> 4 v4     0.00431  0.991
#> 5 v5    -0.00502  0.996

如果您只想要一个整体均值和整体 sd(而不是所有 CSV 中的每一列一个),那么这会更简单,因为您可以将 CSV 变量转换为按文件 ID 分组的单个向量并取其平均值和标准差。

相关问题