Question

我有一个1500万字符串的列表，我有一个800万字的字典。我想用字典中字符串的索引替换数据库中的每个字符串。我尝试使用哈希包来加快索引速度，但是仍然需要花费数小时来替换所有1500万个字符串。实现这个的有效方法是什么？

示例[已编辑]：

# Database
[[1]]
[1]"a admit been c case" 
[[2]] 
[1]"co confirm d ebola ha hospit howard http lik"

# dictionary
 "t" 1
 "ker" 2
 "be" 3
  .
  .
  .
  .

# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111

字典中admit的索引为3453。

感谢任何形式的帮助。

使用代码更新了示例： 这就是我目前正在做的事情。     示例：data =
    [1]“同一个克里米亚生动的东部加速http政治分裂分裂威胁乌克兰通过西部xtcnwl youtub"     [2]“由cia基金集团nazy花费了二十八岁的乌克兰”     [3]“所有回能源爷爷回家想念我的假设”     [4]“ao bv chega co de ebola http kkmnxv pacy rio suspeito t”
    [5]“android androidgam co coin收集gameinsight gold http i jzdydkylwd t ve”

words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))

index = lapply(data, function(row) 
    {
      temp = trim.leading(row)
      word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
      index_list = lapply(word_list,function(x)
         {
            return(h[[x]])
         }
         )
         #print(index_list)
        return(unlist(index_list))
    }
)
Output:
index_list
[[1]]
 [1]  6  1 19 21 22 23 31  2 40 44 46  3 48  5 51 52 53 54 55

[[2]]
 [1] 12 14 16 26 30 38 45  4 49  5

[[3]]
 [1]  7 11 25 29 32 36 37 41 42  4

[[4]]
 [1] 10 13 15  1 20 24  2 35 39 43 47  3

[[5]]
 [1]  8  9  1 17 18 27 28  2 33 34  3 50

输出是索引。如果数据长度很小，则运行速度很快，但如果长度为1500万，则执行速度非常慢。我的任务是最近邻搜索。我想搜索1000个与数据库格式相同的查询。我也尝试了很多东西，比如并行计算，但是内存有问题。

[编辑] 如何使用RCpp实现此目的？

Answer 1

我认为您希望通过拆分数据来避免lapply()，不再列出，然后处理单词的向量

data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'

执行匹配，然后重新列入原始

relist(match(words, word.vector), data.list)

对于下游应用程序，实际上可能需要保留向量+分区＆＃39;信息，partition = sapply(data.list, length)而不是重新列出，因为它将继续有效地操作未列出的矢量。 Bioconductor S4Vectors包提供了一个CharacterList类，它采用这种方法，其中一个主要用于类似列表的东西，但是数据存储在哪里，大多数操作都在底层字符向量上。< / p>

Answer 2

听起来你在做NLP。

快速非R解决方案（可以包装在R中）是word2vec

word2vec工具将文本语料库作为输入，并将单词向量作为输出。它首先构建一个词汇表训练文本数据，然后学习单词的矢量表示。该生成的单词矢量文件可以用作许多自然的特征语言处理和机器学习应用程序。

用1500万字符串中的索引替换每个单词

2 个答案: