Question

我有标题（第1列），文本（第2列）矩阵。

文本单元格的字符串长度超过10,000个单词。（但很少有细胞不是。）

我想将文本单元格划分为2,000个具有相同标题的单词。（这样原始矩阵[1,1]将有五个以上具有相同标题名称且少于2,000个字符的列）

我g目结舌并找到了关于它的代码（也许;;）但是它没有用。

我该如何解决这个问题？

makeflextextchunks <- function(doc.object, chunk.size=2000, percentage=TRUE){
  paras <- getNodeset(doc.object, "/d:TEI/d:text/d:body/d:p",
                      c(d="http://www.tel-c.org/ns/1.0"))
  words <- paste(sapply(paras.xmlvalue), collapse= " ")
  words.lower <- tolower(words)
  words.l <- strsplit(words.lower, "\\s+")
  word.v <- unlis(words.l)
  x <- seq_along(word.v)
  if(percentage){
    max.length <- length(word.v)/chunk.size
    chunks.l <- split (word.v, ceiling(x/max.length))
  }else{
    chunks.l <- split(word.v, ceiling(x/chunk.size))
    if(length(chunks.l[[length(chunks.l)]]) <-
       length(chunks.l[[length(chunks.l)]])/2)(
         chunks.l[[length(chunks.l)-1]] <-
           c(chunks.l[[length(chunks.l)-1]],
             chunks.l[[length(chunks.l)]])
         chunks.l[[length(chunks.l)]] <- NULL
  }
}
chunks.l <- lappy(chunks.l, paste, collaplse= " ")
chunks.df <- do.call(rbind, chunks.l)
return(chunk.sdf)
}

按设定的字数

0 个答案: