sed在两个模式之间获取字符串

时间:2016-05-04 14:49:11

标签: bash sed

我正在处理一个乳胶文件,我需要从中挑选出由\ citep {}标记的引用。这就是我用sed做的事情。

    cat file.tex | grep citep | sed 's/.*citep{\(.*\)}.*/\1/g'

现在,如果一行中只有一个模式,则此工作正常。如果一行中有多个模式,即\ citep,则失败。即使只有一个模式但是有一个以上的结束括号},它也会失败。我应该怎么做,以便它适用于一行中的所有模式以及我正在寻找的专属括号?

我正在打击bash。该文件的一部分如下所示:

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of 

在一行上,我得到这样的回答

    BilhamE01, TapponnierM76} and by distributed seismicity across the region (Fig. \ref{fig1_2

而我正在寻找

    BilhamE01, TapponnierM76

另一个带有多个/ citep模式的例子给出了像这样的输出

    Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study

而我正在寻找

    Pauletal2015 Mitraetal2005

有人可以帮忙吗?

4 个答案:

答案 0 :(得分:3)

这是一个贪婪的匹配改变正则表达式匹配第一个右大括号

if (null == myvariable)

测试

.*citep{\([^}]*\)}

请注意,它只会匹配每行一个实例。

答案 1 :(得分:2)

如果你正在使用grep,你也可以坚持下去(假设GNU grep):

$ echo $str | grep -oP '(?<=\\citep{)[^}]+(?=})'
BilhamE01, TapponierM76

答案 2 :(得分:1)

对于它的价值,这个可以用sed来完成

echo "\citep{string} xyz {abc} \citep{string2},foo" | \
  sed 's/\\citep{\([^}]*\)}/\n\1\n\n/g; s/^[^\n]*\n//; s/\n\n[^\n]*\n/, /g; s/\n.*//g'

输出:

string, string2

但哇,那太丑了。 sed脚本在此表单中更容易理解,恰好可以通过sed参数提供给-f

# change every \citep{string} to <newline>string<newline><newline>
s/\\citep{\([^}]*\)}/\n\1\n\n/g

# remove any leading text before the first wanted string
s/^[^\n]*\n//

# replace text between wanted strings with comma + space
s/\n\n[^\n]*\n/, /g

# remove any trailing unwanted text
s/\n.*//

这利用了sed可以匹配并替换换行符的事实,即使读取新的输入行不会导致最初出现在模式空间中的换行符。换行符是我们可以确定的一个字符,只有当sed故意将它放在那里时才会出现在模式空间(或保留空间)中。

初始替换纯粹是为了通过简化目标分隔符来使问题易于管理。原则上,剩余的步骤可以在没有这种简化的情况下执行,但所涉及的正则表达式将是可怕的。

这假设每个string中的\citep{string}至少包含一个字符;如果必须容纳空字符串,那么这种方法需要更多细化。

当然,我无法想象为什么有人会更喜欢@Lev的直接grep方法,但这个问题确实专门针对sed解决方案。

答案 3 :(得分:0)

<强> f.awk

BEGIN {
    pat = "\\citep"
    latex_tok = "\\\\[A-Za-z_][A-Za-z_]*" # match \aBcD
}

{
    f = f $0 # store content of input file as a sting
}

function store(args,   n, k, i) { # store `keys' in `d'
    gsub("[ \t]", "", args) # remove spaces
    n = split(args, keys, ",")
    for (i=1; i<=n; i++) {
      k = keys[i]
      d[k]
    }
}

function ntok() { # next token
    if (match(f, latex_tok)) {
      tok = substr(f, RSTART          ,RLENGTH)
      f   = substr(f, RSTART+RLENGTH-1        )
      return 1
    }
    return 0
}

function parse(    i, rc, args) {
    for (;;) { # infinite loop
      while ( (rc = ntok()) && tok != pat ) ;
      if (!rc) return

      i = index(f, "{")
      if (!i) return # see `pat' but no '{'
      f = substr(f, i+1)

      i = index(f, "}")
      if (!i) return # unmatched '}'

      # extract `args' from \citep{`args'}
      args = substr(f, 1, i-1)
      store(args)
    }
}

END {
    parse()
    for (k in d)
      print k
}

<强> f.example

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of

用法:

awk -f f.awk f.example

预期的输出:

BendickB2001
Arunetal2010
Pauletal2015
Mitraetal2005
BilhamE01
Mukuletal2009
TapponnierM76
WangLiu2009
BhattacharyaM2009
Mitraetal2010
Actonetal2011
Vernantetal2014