Question

我需要创建一个函数来解析和拆分不整洁的excel数据，并最终将它们另存为csv文件。

更具体地说，对于每个Excel文件，对于每个工作表，我需要具有2个数据框（一个包含一个表，一个包含所有标头）。对于每个工作表，布局都是相同的（相同的标题，相同的表）

我绝不能使用行来分隔标题，而应使用字符名称。

如何拆分？

标题为：标题，页面，副本，年份，结尾。前四个是3 表格上方的行，末端位于表格下方的2行。
该表有5列：id，葡萄糖，胰岛素，crp，ffa。
该表有4行（如果您在上面包含列名，则为5行）
标题应与最终df中的类似：标题：一个，页数：许多，份数：200，年：2019，结束：下午。
当前，标头和相应的值在同一单元格，但在相邻单元格中。

如有要求，我会尽力提供进一步的说明。

在解析之后，我已经有一个将其写为csv的代码了，

我已经具有拆分Excel工作表的功能。

现在我只需要拆分为2个数据帧。

我尝试使用tidyxl，但是由于某种原因，它无法打开文件。它指出有一个error("zip file cannot be opened")，但是该文件不是zip，而是我刚刚创建的xlsx。

我尝试了按字符过滤，但是没有用（或者我不能用）。

#file.choose()
my_path <- "C:\\Users\\Βύρωνας\\Desktop\\BYRON\\Miscellaneous\\test.xlsx"
cel <- xlsx_cells(path = my_path, sheets = Sheet1) #doesn't work
cells <- readxl::read_excel(my_path, col_names = F)

cells <- as_cells(cells)
rectify(cells, character, numeric)
str(cells)

#idea: which row and which columns have the top_left and bottom_right?
corners <- 
filter(cells, !is.na(character),
     !(character %in% c("title", "pages", "copies", "year", "end")))

partition(cells, corners)

#if it doesn't work, use subset? or find the row that contains the end and 
#get the inbetween space?

如果有帮助，下面是导入和拆分工作表的代码（有效）：

library(readxl)
read_excel_allsheets <-function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
 if(!tibble) 
  x <-lapply(x, as.data.frame)
  names(x) <- sheets
x
}

同时，我已经走到了这么远：

sheet <- 1
read1 <- readxl::read_excel(my_path, sheet = sheet)

skip_rows <- NULL
col_skip <- 0
search_string1 <- "id"
search_string2 <- "end"
max_cols_to_search <- 6
max_rows_to_search <- 20

while (length(skip_rows) == 0) {
  col_skip <- col_skip + 1
  if (col_skip == max_cols_to_search) break
  skip_rows <- 
which(stringr::str_detect(read1[1:max_rows_to_search,col_skip][[1]],
                                     search_string1)) - 0

}


read2 <- readxl::read_excel(
  my_path,
  sheet = sheet,
  skip = skip_rows
)

这可以帮助我摆脱顶部标题，这很好，但是我仍然不能摆脱底部标题。

如何解析和提取不整洁的Excel数据？

0 个答案: