Question

我正在使用R进行文本分析。我使用'readtext'函数从pdf中提取文本。但是，正如你可以想象的那样，它非常混乱。我使用'gsub'来替换不同目的的文本。一般目标是使用一种类型的分隔符' %%%%% '将记录拆分为行，将另一个分隔符“@”拆分为列。我完成了第一次，但我不知道如何完成后者。数据框中找到的数据样本如下：

895 “移动案例组合开发项目\ n @ Published :: June 6,1994 @作者：Baker A，Honigfeld S，Lieberman R，Tucker AM，Weiner JP @国家：美国@Journal：项目最终报告。美国马里兰州巴尔的摩：约翰霍普金斯大学和安泰健康计划。约翰霍普金斯大学和美国的Aetna健康计划作为美国[...“

896 “门诊护理小组：对军事医疗保健使用的评估@发布日期:: 1994年6月6日@作者：Bolling DR，Georgoulakis JM，Guillen AC @国家：美国@Journal：Fort Sam Houston，TX，USA：美国陆军医疗保健教育和研究中心，出版物#HR 94- \ n004。美国陆军医疗保健教育中心和[...] @URL：http：//oai.dtic.mil/oai/oai？verb = getRecord＆amp; MetadataPrefix = logo＆amp ;identifier =ADA27804“

我想获取这些数据并将@Published，@ Authors，@ Journal，@URL拆分为列 - c（“已发布”，“作者”，“期刊”，“URL”）。

有什么建议吗？

提前致谢！

Answer 1

这似乎工作正常：

dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n@Published:: June 6, 1994@Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP@Country: United States @Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use@Published:: June 6, 1994@Authors: Bolling DR, Georgoulakis JM, Guillen AC@Country: United States @Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]@URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)

library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "@Published::|@Authors:|@Country:|@Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))

基本上用四个字段之一拆分文本（注意发布后的双重::！），对结果进行行绑定，转换为数据帧，并给出一些名称。

创建整洁的文字

1 个答案: