从Word文档中提取半结构化文本

时间:2013-04-19 13:37:32

标签: r text-mining tm

我想基于以下表单对一组文件进行文本挖掘。我可以创建一个语料库,其中每个文件都是一个文档(使用tm),但我认为创建一个语料库可能会更好,其中第二个表格表中的每个部分都是具有以下元数据的文档:

  Author       : John Smith
  DateTimeStamp: 2013-04-18 16:53:31
  Description  : 
  Heading      : Current Focus
  ID           : Smith-John_e.doc Current Focus
  Language     : en_CA
  Origin       : Smith-John_e.doc
  Name         : John Smith
  Title        : Manager
  TeamMembers  : Joe Blow, John Doe
  GroupLeader  : She who must be obeyed 

其中Name,Title,TeamMembers和GroupLeader是从表单上的第一个表中提取的。通过这种方式,每个要分析的文本块都会保留一些上下文。

最好的方法是什么?我可以想到两种方式:

  • 以某种方式将我所拥有的语料库解析为子语料库。
  • 以某种方式将文档解析为子文档并从中创建语料库。

任何指针都会非常感激。

这是表格: HR form

Here is an RData file一个包含2个文档的语料库。 exc [[1]]来自.doc而exc [[2]]来自docx。他们都使用上面的表格。

1 个答案:

答案 0 :(得分:2)

这是一个方法的快速草图,希望它可能会激发一些更有才能的人停下来并建议一些更有效和更强大的东西...在你的问题中使用RData文件,我发现{{1 }}和doc文件的结构略有不同,因此需要稍微不同的方法(虽然我在元数据中看到您的docx是'fake2.txt',所以它真的是docx吗?我在你的其他问题中看到你在R之外使用了一个转换器,这就是为什么它是docx)。

txt

首先获取library(tm) 文件的自定义元数据。正如你所看到的,我不是正则表达式专家,但它大致是“摆脱拖尾和领先空间”然后“摆脱”单词“”,然后摆脱标点符号......

doc

现在看一下结果

# create User-defined local meta data pairs
meta(exc[[1]], type = "corpus", tag = "Name1") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[1]][3])))
meta(exc[[1]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[1]][4])))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[1]][5])))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[1]][7])))

# inspect meta(exc[[1]], type = "corpus") Available meta data pairs are: Author : DateTimeStamp: 2013-04-22 13:59:28 Description : Heading : ID : fake1.doc Language : en_CA Origin : User-defined local meta data pairs are: $Name1 [1] "John Doe" $Title [1] "Manager" $TeamMembers [1] "Elise Patton Jeffrey Barnabas" $ManagerName [1] "Selma Furtgenstein" 文件执行相同操作

docx

看看

# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))

如果您有大量文档,那么包含这些# inspect meta(exc[[2]], type = "corpus") Available meta data pairs are: Author : DateTimeStamp: 2013-04-22 14:06:10 Description : Heading : ID : fake2.txt Language : en Origin : User-defined local meta data pairs are: $Name2 [1] "Joe Blow" $Title [1] "Shift Lead" $TeamMembers [1] "Melanie Baumgartner Toby Morrison" $ManagerName [1] "Selma Furtgenstein" 函数的lapply函数将是您的选择。

现在我们已经获得了自定义元数据,我们可以对文档进行分组以排除文本的这一部分:

meta

看看:

# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","), 
                      paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)

现在文档已准备好进行文本挖掘,上表中的数据移出文档并移入文档元数据。

当然,所有这些都取决于文件的高度规律性。如果每个文档的第一个表中有不同数量的行,那么简单的索引方法可能会失败(尝试一下,看看会发生什么),并且需要更强大的东西。

更新:更强大的方法

更仔细地阅读了这个问题,got a bit more education about regex,这里的方法更加健壮,不依赖于索引文档的特定行。相反,我们使用正则表达式从两个单词之间提取文本以制作元数据并拆分文档

以下是我们如何制作用户定义的本地元数据(替换上述元数据的方法)

inspect(excBody)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
|CURRENT RESEARCH FOCUS |,| |,|Lorem ipsum dolor sit amet, consectetur adipiscing elit. |,|Donec at ipsum est, vel ullamcorper enim. |,|In vel dui massa, eget egestas libero. |,|Phasellus facilisis cursus nisi, gravida convallis velit ornare a. |,|MAIN AREAS OF EXPERTISE |,|Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel. |,|In sit amet ante non turpis elementum porttitor. |,|TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED |,| Vestibulum sed turpis id nulla eleifend fermentum. |,|Nunc sit amet elit eu neque tincidunt aliquet eu at risus. |,|Cras tempor ipsum justo, ut blandit lacus. |,|INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS) |,| Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio. |,|Etiam a justo vel sapien rhoncus interdum. |,|ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT |,|(Please include anticipated percentages of your time.) |,| Proin vitae ligula quis enim vulputate sagittis vitae ut ante. |,|ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES |,|e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign |,|Canvasser (GCWCC), OSH representative, Social Committee |,|Sed nec tellus nec massa accumsan faucibus non imperdiet nibh. |,,

[[2]]
CURRENT RESEARCH FOCUS,,* Lorem ipsum dolor sit amet, consectetur adipiscing elit.,* Donec at ipsum est, vel ullamcorper enim.,* In vel dui massa, eget egestas libero.,* Phasellus facilisis cursus nisi, gravida convallis velit ornare a.,MAIN AREAS OF EXPERTISE,* Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel.,* In sit amet ante non turpis elementum porttitor. ,TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED,* Vestibulum sed turpis id nulla eleifend fermentum.,* Nunc sit amet elit eu neque tincidunt aliquet eu at risus.,* Cras tempor ipsum justo, ut blandit lacus.,INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS),* Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio.,* Etiam a justo vel sapien rhoncus interdum.,ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT ,(Please include anticipated percentages of your time.),* Proin vitae ligula quis enim vulputate sagittis vitae ut ante.,ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES,e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign Canvasser (GCWCC), OSH representative, Social Committee,* Sed nec tellus nec massa accumsan faucibus non imperdiet nibh.,,

同样,我们可以将第二个表格的各个部分分开 矢量然后你可以把它们变成文件和语料库或者只是工作 以它们为载体。

library(gdata) # for the trim function
txt <- paste0(as.character(exc[[1]]), collapse = ",")

# inspect the document to identify the words on either side of the string
# we want, so 'Name' and 'Title' are on either side of 'John Doe'
extract <- regmatches(txt, gregexpr("(?<=Name).*?(?=Title)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Name1") <- trim(gsub("[[:punct:]]", "", extract))

extract <- regmatches(txt, gregexpr("(?<=Title).*?(?=Team)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Title") <- trim(gsub("[[:punct:]]","", extract))

extract <- regmatches(txt, gregexpr("(?<=Members).*?(?=Supervised)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- trim(gsub("[[:punct:]]","", extract))

extract <- regmatches(txt, gregexpr("(?<=your).*?(?=Supervisor)", txt,  perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- trim(gsub("[[:punct:]]","", extract))

# inspect
meta(exc[[1]], type = "corpus")

Available meta data pairs are:
  Author       : 
  DateTimeStamp: 2013-04-22 13:59:28
  Description  : 
  Heading      : 
  ID           : fake1.doc
  Language     : en_CA
  Origin       : 
User-defined local meta data pairs are:
$Name1
[1] "John Doe"

$Title
[1] "Manager"

$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"

$ManagerName
[1] "Selma Furtgenstein"

等等。我希望你更接近你所追求的。如果没有,最好将你的任务分解成一组较小的,更集中的问题,并分别问他们(或等待其中一位专家停止这个问题!)。