如何在Spark中计算文档的术语频率?

时间:2015-08-04 22:04:05

标签: scala apache-spark

我正在使用Spark中的文档分类算法。我想根据每个待分类文档中的术语创建字典。以下是我到目前为止的情况:

def tokenize(content: String): Seq[String] = {

val tReader = new StringReader(content)
val analyzer = new EnglishAnalyzer(LuceneVersion)
val tStream = analyzer.tokenStream("contents",tReader)
val term = tStream.addAttribute(classOf[CharTermAttribute])
tStream.reset()

val result = mutable.ArrayBuffer.empty[String]
while(tStream.incrementToken()){
  result += term.toString()
}   
result

}

这个函数接受一个字符串并标记并阻止它并返回一个Seq[String],这就是我如何能够实现这个功能

val testInstance = sc.textFile("to-be-classified.txt")
testInstance.flatMap(line1 => tokenize(line1)).map(line2 => (line2,1))    

这就是我已经走了。有人可以帮我创建一个词典类型的结构,其中包含' term'作为键及其频率作为'值'?

编辑:我想到了一个更好的方法,但我不能写出来。这是一些部分:

 case class doc_terms(filename:String, terms:List[Pair])

然后我的想法是为我读过的每个文档创建一个类doc_terms的对象。它包含文档的所有术语列表。然后按键减少,我应该找到每个文档的每个术语的频率。最后,我将有一个RDD,其中每个实体都是(file1,[(' term1',12),(' term2',23)...] )。有人可以帮我写这个吗?

1 个答案:

答案 0 :(得分:1)

好的,所以我发现了两种方法。

我将使用简化的标记化器,你可以用更复杂的东西替换我的标记器,所有东西都应该运行。

对于文本数据,我使用的是小说War and Peace

的文本文件

请注意,为了保持类型兼容,我已经改变了一些确切的类。术语计数函数称为study,带有一个参数(输入文件)并返回类型DocTerms

方法1

import scala.collection.immutable.WrappedString;
import scala.collection.Map

def tokenize(line:String):Array[String] =
    new WrappedString(line).toLowerCase().split(' ')

case class DocTerms(filename:String, terms:Map[String,Int])

def study(filename:String):DocTerms = {
    val counts = (sc
    .textFile(filename)
    .flatMap(tokenize)
    .map( (s:String) => (s,1) )
    .reduceByKey( _ + _ )
    .collectAsMap()
    )
    DocTerms(filename, counts)
}

val book1 = study("/data/warandpeace.txt")

for(c<-book1.terms.slice(20)) println(c)

输出:

(worried,12)
(matthew.,1)
(follow,32)
(lost--than,1)
(diseases,1)
(reports.,1)
(scoundrel?,1)
(but--i,1)
(road--is,2)
(well-garnished,1)
(napoleon;,2)
(passion,,2)
(nataly,2)
(entreating,2)
(sounding,1)
(any?,1)
("sila,1)
(can,",3)
(motionless,22)

请注意,此输出未排序,并且Map类型通常不可排序,但它们可快速查找和类似字典。虽然只打印了20个元素,但所有术语都被计算并存储在book1中 类型为DocTerms

的对象

方法2

或者,terms DocTerms部分可以成为List[(String,Int)]类型并在返回之前进行排序(以某些计算成本),以便首先出现最多的术语。但这意味着它不会是Map或快速查找字典。但是,对于某些用途,类似列表的类型可能更可取。

import scala.collection.immutable.WrappedString;

def tokenize(line:String):Array[String] =
    new WrappedString(line).toLowerCase().split(' ')

case class DocTerms(filename:String, terms:List[(String,Int)])

def study(filename:String):DocTerms = {
    val counts = (sc
        .textFile(filename)
        .flatMap(tokenize)
        .map( (s:String) => (s,1) )
        .reduceByKey( _ + _ )
        .sortBy[Int]( (pair:Tuple2[String,Int]) => -pair._2 )
        .collect()
        )
    DocTerms(filename, counts.toList)
}

val book1 = study("/data/warandpeace.txt")

for(c<-book1.terms.slice(1,100)) println(c)

输出

(and,21403)
(to,16502)
(of,14903)
(,13598)
(a,10413)
(he,9296)
(in,8607)
(his,7932)
(that,7417)
(was,7202)
(with,5649)
(had,5334)
(at,4495)
(not,4481)
(her,3963)
(as,3913)
(it,3898)
(on,3666)
(but,3619)
(for,3390)
(i,3226)
(she,3225)
(is,3037)
(him,2733)
(you,2681)
(from,2661)
(all,2495)
(said,2406)
(were,2356)
(by,2354)
(be,2316)
(they,2062)
(who,1939)
(what,1935)
(which,1932)
(have,1931)
(one,1855)
(this,1836)
(prince,1705)
(an,1617)
(so,1577)
(or,1553)
(been,1460)
(their,1435)
(did,1428)
(when,1426)
(would,1339)
(up,1296)
(pierre,1261)
(only,1250)
(are,1183)
(if,1165)
(my,1135)
(could,1095)
(there,1094)
(no,1057)
(out,1048)
(into,998)
(now,957)
(will,954)
(them,942)
(more,939)
(about,919)
(went,846)
(how,841)
(we,838)
(some,826)
(him.,826)
(after,817)
(do,814)
(man,778)
(old,773)
(your,765)
(very,762)
("i,755)
(chapter,731)
(princess,726)
(him,,716)
(then,706)
(andrew,700)
(like,691)
(himself,687)
(natasha,683)
(has,677)
(french,671)
(without,665)
(came,662)
(before,658)
(me,657)
(began,654)
(looked,642)
(time,641)
(those,639)
(know,623)
(still,612)
(our,610)
(face,609)
(thought,608)
(see,605)

您可能会注意到最常见的单词并不是很有趣。但我们也有像“王子”,“公主”,“安德鲁”,“娜塔莎”和“法国人”这样的词语,这些词语可能更适用于战争与和平

一旦你有一堆文件,为了减少常用词的权重,对于缩放人来说,经常使用TFIDF或“术语频率逆文档频率”,这意味着每个术语的计数基本上除以语料库中的文档数量它出现的地方(或一些涉及日志的类似功能)。但这是另一个问题的主题。