Question

我正在处理一个大文件，我阅读它的chucks并处理它并保存我提取的内容。然后在rm(list=ls())之后清除内存（有时也必须使用.rs.restartR（），但这篇文章中没有关注），我运行相同的脚本之后在我的脚本中以两个数字添加1 。

这似乎是一个尝试编写循环的机会，但是 - 在尝试初始化循环中使用的所有对象之间，并且考虑到我对编写循环不是很好，它真的让人感到困惑。

我在这里发布了一些建议，如果我的问题太模糊，我会提前道歉。感谢。

#######################         A:11
#######################         B:12

                # A    I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")



bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)

dfm_1 <- dfm(bi_tkn_one)

## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.

final_dfm_1 <- colSums(dfm_1)


tib <- tbl_df(final_dfm_1) %>% add_rownames()  
# This is what I wanted to extract 'the freq of each token'


            # B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")

rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)

下面我将运行相同的脚本，但在fread()中将11更改为12，在saveRDS()命令中将12更改为13。

#######################         A:12
#######################         b:13

            # A    I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")



bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)

dfm_1 <- dfm(bi_tkn_one)

## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.

final_dfm_1 <- colSums(dfm_1)


tib <- tbl_df(final_dfm_1) %>% add_rownames()  
# This is what I wanted to extract 'the freq of each token'


            # B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")

rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)

下面列出了我的工作环境中的所有对象（感谢this post），这些对象在运行与A + 1和B + 1相同的块之前从工作环境中清除。

                  Type      Size    Rows Columns
dfm_1        dfmSparse 174708600  166836 1731410
bi_tkn_one      tokens 152494696  166836      NA
tib             tbl_df 148109248 1731410       2
final_dfm_1    numeric 148108544 1731410      NA
text_tbl    data.table  22485264  166836       1

我花了一些时间试图弄清楚如何编写这个循环，在SO上找到关于如何使用data.table列初始化character的帖子，但我认为还有其他对象我需要初始化。我不确定编写这样一个循环是多么合理。

如上所示，我已经复制并粘贴了相同的脚本，并立即运行。这很愚蠢，因为我只是在两个地方加一个。

对我的方法感到自由评论，我想从中学到一些东西。最好

旁注：我读到了关于将.rs.restartR()添加到循环中，并且发现了建议在R中使用批处理文件或调度任务的帖子，我将不得不继续学习它们。

Answer 1

这非常简单，我没有初始化任何对象，这就是我想要做的事情。只有我必须加载的东西是启动R并运行循环时所需的包。

 ls()
    character(0)
From an empty environment, just a simple loop.

library(data.table)
library(quanteda)
library(dplyr)

    for (i in 4:19){
                    # A    I change the multiple each time here.
        text_tbl <- fread("tlm_s_words", skip = 166836*i, nrows = 166836, header = FALSE, col.names = "text")



        bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 3, concatenator =" ", verbose = TRUE)

        dfm_1 <- dfm(bi_tkn_one)

        ## Using colSums(), gives a numeric vector`final_dfm_1`
        ## tib is the desired oject I will save with new name each time.

        final_dfm_1 <- colSums(dfm_1)
        print(setNames(length(final_dfm_1), "no. N-grams in this batch"))
            # no. N-grams


        tib <- tbl_df(final_dfm_1) %>% add_rownames()  
        # This is what I wanted to extract 'the freq of each token'


             # B Here I change the name `tib`` is saved uneder each time.
        iplus = i+1
        saveRDS(tib, file = paste0("titr",iplus,".Rda"))

        rm(list=ls())
        Sys.sleep(10)
        gc()
        Sys.sleep(10)

    }

在没有初始化任何data.table或其他对象的情况下，上面循环的结果是在我的工作目录中保存了16个文件。

这让我想到，我们什么时候需要初始化用于循环的向量，矩阵和其他对象？

编写循环并初始化各种类的对象

1 个答案: