从多行提取标题

时间:2019-11-07 15:07:09

标签: r extraction

我有多个文件,每个文件都有不同的标题,我想从每个文件中提取标题名称。这是一个文件的示例

//Basic
function x() {
  var promise = new Promise(function(resolve, reject) {
    setTimeout(function() {
      resolve("done!");
    });
  });
  return promise;
}
async function y() {
  var y = await x();
  console.log("y", y);
}
y();

//Implementation
var azure = require('azure-storage');
const fs = require('fs');
var fileService = azure.createFileService('microsoftdata');
var test = new Promise(function(resolve, reject) {
  fileService.getFileToStream('sharename', '', filename, fs.createWriteStream(filename), async function(error, result, response) {
    if (!error) {
      console.log('result ' + JSON.stringify(result, null, 4));
      var bitmap = await fs.readFileSync(filename);
      resolve(bitmap.toString('base64'));
    } else {
      console.log('error - ' + JSON.stringify(error, null, 4));
    }

  });
});
console.log('test - ' + test);

提取的预期标题为

[1] "<START"                        "ID=\"CMP-001\""                  "NO=\"1\">"                         
[4] "<NAME>Plasma-derived"          "vaccine"                         "(PDV)"                             
[7] "versus"                        "placebo"                         "by"                                
[10] "intramuscular"                "route</NAME>"                    "<DIC"                     
[13] "CHI2=\"3.6385\""              "CI_END=\"0.6042\""               "CI_START=\"0.3425\""   
[16] "CI_STUDY=\"95\""                "CI_TOTAL=\"95\""               "DF=\"3.0\""                        
[19] "TOTAL_1=\"0.6648\""           "TOTAL_2=\"0.50487622\""           "BLE=\"YES\"" 
.
.
.
 [789] "TOTAL_2=\"39\""             "WEIGHT=\"300.0\""              "Z=\"1.5443\">"    
 [792] "<NAME>Local"                "adverse"                       "events" 
 [795] "after"                      "each"                          "injection"
 [798] "of"                         "vaccine</NAME>"               "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
 [801] "</GROUP_LABEL_2>"           "<GRAPH_LABEL_1>"              "PDV</GRAPH_LABEL_1>"

请注意,每个文件的标题长度都不同。

1 个答案:

答案 0 :(得分:0)

这是使用stringr的解决方案。这首先将向量折叠成一个长字符串,然后捕获每对\n"<NAME>"之间不是换行符"</NAME>"的所有单词/字符。将来,如果您创建了reproducible example(例如,使用dput()),人们将能够更轻松地为您提供帮助。希望这会有所帮助!

注意:如果仅是第一个标题,则可以使用str_match()代替str_match_all()

library(stringr)

str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine" 

数据

string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
            "TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")
相关问题