Question

我正在尝试删除SAS中的重复单词组。基本上，我正在尝试删除重复出现的单词。正斜杠是分号。我正在使用SAS 9.4，并具有以下示例：

我尝试了上述正则表达式，它适用于“淋巴结痛/淋巴结痛/四肢疼痛”。结果是“淋巴结疼痛/四肢疼痛”。但是，它不适用于“淋巴结疼痛/四肢疼痛/四肢疼痛”和“淋巴结疼痛/神经痛/神经痛”。我不确定为什么。

data have;
  string = 'Lymph node pain/Pain in extremity/Pain in extremity';output;
  string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output;
  string = 'Lymph node pain/Neuralgia/Neuralgia'; output;
run;

data test;
  set have;
     _1=prxparse('s/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/\2\3/i');
     _2=prxparse('/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/i');
    do i=1 to 10;
        string=prxchange(_1, -1, strip(string));
        if not prxmatch(_2, strip(string)) then leave;
    end;
   drop i  ;
run;

感谢您的帮助。

Answer 1

这是一种基于scan的方法。我假设每个字符串最多可以包含3个短语，但是可以根据需要轻松调整为适用于任意数量的短语。

data have;
  string = 'Lymph node pain/Pain in extremity/Pain in extremity';output;
  string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output;
  string = 'Lymph node pain/Neuralgia/Neuralgia'; output;
  string = 'Neuralgia/Lymph node pain/Neuralgia'; output;  /*Added A/B/A example*/
run;

data test;
  set have;
  array phrases[3] $32;
  /*Separate string into an array of phrases delimited by / */
  do i = 1 to dim(phrases);
    phrases[i] = scan(string,i,'/');
  end;
  /*Sort the array so that duplicate phrases are next to each other*/
  call sortc(of phrases[*]);
  /*Iterate through the array and build up an output string of non-duplicates*/
  length outstring $255;
  do i = 1 to dim(phrases);
    if i = 1 then outstring = phrases[1];
    else if phrases[i] ne phrases[i-1] then outstring = catx('/',outstring,phrases[i]);
  end;
  keep string outstring;
run;

这具有将所有短语排序为字母顺序而不是字符串中首次出现的顺序的副作用。

在SAS中使用REGEX匹配和删除重复出现的单词组

1 个答案: