Question

我有一个100k +行的表。

我试图从其中一个字段中删除多个子字符串的所有实例。

到目前为止，我发现的方法基本上都是针对每个违规子字符串调用tranwrd。

以下示例中有三个，但在实际数据集中有还有更多。

data mytable;
infile datalines delimiter=':' truncover;
informat myfield $50. someval 3.;
input myfield someval;
datalines;
some value xx abc:10
another values:15
random stuff ccc:1
more stuff xx:2
example abc:44
foo abc bar:55
sub xx string:11
;
run;

proc sql;
    update mytable set myfield = strip(tranwrd(myfield,'abc',''));

    update mytable set myfield = strip(tranwrd(myfield,'ccc',''));

    update mytable set myfield = strip(tranwrd(myfield,'xx',''));
quit;

是否可以通过单个声明完成相同的操作？

即。给出要删除的完整字符串列表，将其全部删除。

类似于：

update mytable
set myfield = somefunction(myfield,/'abc','ccc','xx'/,'')

谢谢

随着更多的讨论，提出以下内容：

data mytable2;
set mytable;
n_myfield = myfield;
length word $50;
do word = 'abc','ccc','xx';
    n_myfield = tranwrd(n_myfield,word,'');
end;
    n_myfield = compbl(n_myfield);
drop word;
run;

没有嵌套（我真的不想嵌套10-15个tranwrd调用）或多个几乎相同的更新语句。

正则表达式是我希望使用的东西

Answer 1

这可以通过使用管道字符“|”连接字符串来使用正则表达式来完成这意味着正则表达式中的“OR”：

myField = prxChange("s/abc|ccc|xx//",-1,trim(myField));

如果你试图删除整个单词，这是特别方便的，这对于tranwrd来说并不容易，在这种情况下你只需将正则表达式更改为：

myField = prxChange("s/\b(abc|ccc|xx)\b\s?//",-1,trim(myField));

其中\ b表示单词边框（任何不是字母/数字/下划线）和\ s？部分在删除的单词后处理可能的额外空间。它会改变

"abcd abc ccc xx abccccxx"

到

"abcd abccccxx"

但正如Nickolay所指出的那样它可以正常工作，直到你的字符串中有元字符：{} ^ $ @。| * +？\和用于标记正则表达式边框的字符，在这个例子中它是/（你可以将s /.../.../改为s＃..＃...＃或s $ .. $ .. $如果你喜欢的话）。当你拥有它们时，你可以用\来手动转义它们，例如，下面的行删除字符串“abc”，“$ c \ c”，“x。”：

myField = prxChange("s/abc|\$c\\c|xx//",-1,trim(myField));

或者运行应用其他正则表达式，它将转义所有特殊字符：

length wordsToRemove $200;
retain wordsToRemove 'abc|ccc|xx|$pec@lW()rd';

if _n_ eq 1 then do;
    * This does not change, so set it once;
    wordsToRemove=prxChange('s/([\Q{}[]()^.*+?\E\$\@\\\/])/\\$1/', -1,strip(wordsToRemove));
end;
myField=prxChange('s/'|| strip(wordsToRemove) || '//', -1, trim(myField));

Answer 2

您可以阅读每个单词并评估是保留还是丢弃它。

data mytable;
  input someval myfield $50. ;
datalines;
10 some value xx abc
15 another values
1 random stuff ccc
2 more stuff xx
44 example abc
55 foo abc bar
11 sub xx string
;

data want ;
  set mytable ;
  length i 8 word n_myfield $50 ;
  drop i word ;
  do i =1 to countw(myfield,' ');
    word=scan(myfield,i,' ');
    if not findw('abc ccc xx',trim(word),' ') then n_myfield=catx(' ',n_myfield,word);
  end;
run;

结果：

Obs    someval    myfield              n_myfield

 1        10      some value xx abc    some value
 2        15      another values       another values
 3         1      random stuff ccc     random stuff
 4         2      more stuff xx        more stuff
 5        44      example abc          example
 6        55      foo abc bar          foo bar
 7        11      sub xx string        sub string

Answer 3

你可以通过正则表达式做同样的事情：

data mytable;
    modify mytable;
    myfield = strip(prxchange('s/abc|ccc|xx//',-1,myfield)));
    replace;
run;

由于我不清楚的原因，当两个其他单词之间的匹配单词被删除时，这会留下2个空格，而原始代码则留下3个空格。但是，我怀疑这可能与您的目的无关。

Answer 4

正则表达式是一种简单而有效的方法，如上所述，循环是其他方法，但是使用数组。

data mytable2;
set mytable;
n_myfield = myfield;
array var (3) $10  _temporary_ ('abc','ccc','xx');
do i=1 to 3;
    n_myfield = tranwrd(n_myfield,strip(var(i)),'');
end;
n_myfield = compbl(n_myfield);
drop i;
run;

Answer 5

对一组移除目标的单次通过不一定足够强大的减少需要多次通过，直到完全遍历不会导致替换。

以下是一个示例Proc DS2步骤，它定义了一个可重用的cleaner方法，并在data程序中使用它：

data have;
  infile datalines delimiter=':' truncover;
  informat myfield $50. someval 3.;
  input myfield someval;
datalines;
some value xx abc:10
another values:15
random stuff ccc:1
more stuff xx:2
example abc:44
foo abc bar:55
sub xx string:11
cabxxccc what to do?:123
xacccbcx funky chicken:456
;
run;

proc DS2 libs=WORK;

  package cleaner / overwrite=yes;

    method _remove ( 
        varchar(200) haystack
      , varchar(200) needles[*]
    )
    returns varchar(200);

      declare int i L P ;
      declare int removal_count;
      declare varchar(200) needle;

      do while (length ( haystack ) > 0);
        removal_count = 0;
        do i = 1 to dim(needles);

          needle = needles[i];
          L = lengthc (needle);
          P = index (haystack, needle);

          if L > 0 and P > 0 then do;
            haystack = tranwrd(haystack,needle,'');
            removal_count + 1;
          end;
        end;

        if removal_count = 0 then leave;
      end;

      return haystack;
    end;
  endpackage;

  data want / overwrite=yes;
    declare package cleaner c();
    declare varchar(20) targets[3];
    method init ();
      targets := ('abc', 'ccc', 'xx');
      put 'NOTE: INIT:' targets[*]=;
    end;
    method run ();
      set have;
      put myfield= ;
      myfield = c._remove(myfield, targets);
      put myfield= ;
      put;
    end;
  run;
quit;

日志

NOTE: INIT: targets[1]=abc targets[2]=ccc targets[3]=xx
myfield=some value xx abc
myfield=some value

myfield=another values
myfield=another values

myfield=random stuff ccc
myfield=random stuff

myfield=more stuff xx
myfield=more stuff

myfield=example abc
myfield=example

myfield=foo abc bar
myfield=foo  bar

myfield=sub xx string
myfield=sub  string

myfield=cabxxccc what to do?
myfield=cab what to do?

myfield=xacccbcx funky chicken
myfield= funky chicken

字符串清理和子字符串替换

5 个答案: