使用Julia生成ngrams

时间:2017-02-21 07:20:05

标签: nlp zip julia n-gram

要在Julia中生成单词双字母,我可以简单地压缩原始列表和删除第一个元素的列表,例如:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

要生成三元组,我可以使用相同的collect(zip(...))成语来获取:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog") 

但是我必须在第3个列表中手动添加以进行压缩,是否有一种惯用的方法,以便我可以执行 n -gram的任何顺序?

e.g。我想避免这样做以提取5克:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

3 个答案:

答案 0 :(得分:5)

另一种方法是使用Iterators.jl' partition()

ngram(s,n) = collect(partition(s, n, 1))

答案 1 :(得分:4)

这是一种干净的单层衬里,适用于任何长度的n克。

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

它使用生成器理解来迭代元素数kdrop。然后,使用splat(...)运算符,将Drop解包为zip,最后collectZip解压缩为Array }。

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

正如您所看到的,这与您的解决方案非常相似 - 只添加了一个简单的理解来迭代元素数量drop,因此长度可以是动态的。

答案 2 :(得分:4)

通过略微更改输出并使用SubArray而不是Tuple s,几乎没有丢失,但可以避免分配和内存复制。如果基础单词列表是静态的,那就可以更快(在我的基准测试中)。代码:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

和输出:

julia> ngram(s,5)
 SubString{String}["the","lazy","fox","jumps","over"] 
 SubString{String}["lazy","fox","jumps","over","the"] 
 SubString{String}["fox","jumps","over","the","brown"]
 SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

对于较大的单词列表,内存要求也要小得多。

另请注意,使用生成器允许更快地处理ngrams并且内存更少,并且可能足以用于所需的处理代码(计算某些内容或通过某些哈希)。例如,使用没有collect的@ Gnimuc解决方案,即partition(s, n, 1)