Question

In perl, from this:

gi|1339058241|ref|XP_023717639.1|zinc finger and BTB domain-containing protein 18-like [Cryptotermes secundus]

if my character is |, how can I get the string:

gi|1339058241|ref|XP_023717639.1|

thanks.

Answer 1

在Perl中，默认情况下，匹配项是“贪婪的”，因此您可以简单地匹配所有字符，直到用作分隔符的字符为止。

$foo="gi|1339058241|ref|XP_023717639.1|zinc finger and BTB domain-containing protein 18-like [Cryptotermes secundus]";
$foo =~ /.*\|/; 
print "$&\n"

$&表示由最后一次成功的模式匹配所匹配的字符串，在这种情况下，表示直到最后一个|字符为止的所有字符。

Answer 2

这是另一种解决方案，它可以通过删除给定字符串末尾与|不同的所有字符来实现。

use strict;
use warnings;

my $str = "gi|1339058241|ref|XP_023717639.1|zinc finger and BTB domain-containing protein 18-like [Cryptotermes secundus]";

$str =~ s/[^|]*$//;
print "$str\n;"

说明：

[^|]是一个 character类：它匹配除|以外的任何字符（“ but”由^字符表示）
*是一个量词，表示0到N个字符
$代表字符串的结尾

Answer 3

您可以使用rindex，就像index一样，只是它从字符串的右侧而不是左侧搜索-拉出最后出现的字符串而不是第一个字符串：

substr($str, 0, rindex ($str, '|') + 1);

Answer 4

我将它们识别为NCBI seq标头行，因此我知道它们具有固定数量的字段。

由于这些是字段/列，因此您可以拆分并加入：

my @rec = split(/\|/, $id);
my $idShort = join("|", @rec[0..3]);
print $idShort, "\n";

或者您可以使用正则表达式：

if ($id =~ /^(gi\|\d+\|\w+\|[\w\_]+\.\d+\|)/) { print "$1\n" } else { die("Unparseable: $id\n") }

但是我喜欢Hambone对rindex的使用。

perl: get a substring until the last occurence of a character

4 个答案: