如何进行正则表达式循环?

时间:2017-04-09 08:36:49

标签: r regex loops

所以我的情况是我有一个物理化学数据集中的文件列表,我是通过多次计算创建的,我希望在我的数据框中名为Files的列中运行foreach或while循环,标题为CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES。

我的文件名看起来像这样:&#34; 1AH7A_TRP-16-A_GLU-9-A.log:&#34;,&#34; 1AH7A_TRP-198-A_ASP-197-A.log:&# 34;,&#34; 1BGFA_TRP-43-A_GLU-44-A.log:&#34;,&#34; 1CXQA_TRP-61-A_ASP-82-A.log:&#34;等... < / p>

我希望在我的专栏&#34;文件&#34;中运行一段时间或一个foreach循环,如果存在单词&#34; GLU&#34;或&#34; ASP&#34;,然后如果我发现&#34; GLU&#34;或者&#34; ASP&#34;,在文件中我想将其打印到列表中。

因此,在上述文件中,打印顺序为&#34; GLU&#34;,&#34; ASP&#34;,&#34; GLU&#34;,&#34; ASP&#34;。同样,我的文件不是以任何特定的方式排序,而是一直到我的1273个文件条目。然后我可以保存这个列表并将其放入列标题&#34; Residues&#34;在我的数据框中,并做一些有用的探索性数据分析。

注意:ASP用于氨基酸天冬氨酸,GLU用于氨基酸谷氨酸。

我知道我可以正常表达式搜索grep以获取列中的条款&#34; Files&#34;像这样。

搜索&#34; ASP&#34;:

> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)

[1] "1AH7A_TRP-198-A_ASP-197-A.log:"  
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"    
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"  
[4] "1EU1A_TRP-32-A_ASP-33-A.log:" 

如你所见,我得到了一些比赛。事实上我得到了683场比赛。但那还不够好。我需要匹配它们发生的地方,而不是它们发生。

当然,我可以为#34; GLU&#34;:

> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)

[1] "1AH7A_TRP-16-A_GLU-9-A.log:"     
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"    
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"

我得到了一大堆比赛!

我试过一个for循环。当然失败了!

  > for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))

{print("ASP")} 

else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))

{print("GLU")}}

所有这一切都是打印:

[1] "ASP"

[1] "ASP"

[1] "ASP"

...

即使有&#34; GLU&#34;!

我的意思是我可以做一些对任何人都不重要的基本代数循环:

> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16

无论如何,我检查了警告,看看出了什么问题:

> warnings() 
Warning messages: 

1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
  the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
  the condition has length > 1 and only the first element will be used

正如您所看到的,我一遍又一遍地得到同样的错误。我想这是有道理的,因为这是一个循环。但是为什么会发生这种情况,为什么我不能在循环内部进行grep?

我想解析的数据框如下所示:

"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437

其中逗号分隔列。

这就是我想要的结果:

"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",

    "1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",

    "2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",

    "3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",

    "4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",

    "5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",

    "6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",

    "7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",

    "8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",

...

任何帮助表示赞赏!谢谢!

4 个答案:

答案 0 :(得分:2)

我们可以使用[{1}}

派生的子字符串将split数据集用于list data.frame sub
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))

数据

  df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:", 
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:", 
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:", 
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:", 
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197, 
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055, 
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1", 
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896, 
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145, 
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X", 
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))

答案 1 :(得分:1)

我不确定我是否完全接受了您的问题但请考虑您的数据位于&#34; dat&#34;数据(包含GLU和ASP的行)。使用下面的表格列出一个字段,该字段可以包含&#34; ASP&#34;的数据。和&#34; GLU&#34;。

library(stringr)
    newvar <- NULL
    newvar$GLU <- str_extract(dat$Files,"(GLU)")
    newvar$ASP <- str_extract(dat$Files,"(ASP)")
    newvar1 <- data.frame(newvar)
    newvar1
    library(tidyr)
    newvar1[is.na(newvar1)] = ""
    new <- unite(newvar1, new, GLU:ASP, sep='')
    dat$new <- new

此处名为new的字段将包含您的GLU和ASP值

<强>答案:

    dat
  X                          Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1    1AH7A_TRP-16-A_GLU-9-A.log:                        -8.497878  CD1           4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log:                        -7.926482  CD1           3.543075 ASP
3 3   1BGFA_TRP-43-A_GLU-44-A.log:                        -6.735078  CD1           4.171795 GLU
4 4   1CXQA_TRP-61-A_ASP-82-A.log:                        -9.398872  CD1           5.298973 ASP
5 5   1D8WA_TRP-17-A_GLU-14-A.log:                        -9.747203  CD1           3.693986 GLU
6 6   1D8WA_TRP-17-A_GLU-18-A.log:                       -11.323520  CD1           3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log:                        -7.468913  CD1           5.411084 GLU
8 8   1E58A_TRP-15-A_GLU-18-A.log:                        -6.598308  CD1           4.797902 GLU

答案 2 :(得分:1)

After a long time I figured out a solution to my problem:

# Save my column as a vector because factors are making the world burn:

Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)

# Split the Files into three parts along the two underscores, and save it back to my vector, preserving the third cut around the underscore.

Files <- str_split_fixed(Files, "_", 3)[,3]

Result:

[1] "GLU-9-A.log:"
"ASP-197-A.log:" etc ...

# Split those results along the hyphens, and take what's next to the first hyphen or the first cut:

Residues <- str_split_fixed(Files, "-", 3)[,1]

> Residues
   [1] "GLU" "ASP" "GLU", ... 

Add the Residue columns to my data.frame.

CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue

I guess the grep function is overrated. I had to look hard for this function.

答案 3 :(得分:0)

假设您保存了试图在文件Error:java: com.sun.tools.javac.code.Symbol$CompletionFailure: class file for groovy.lang.Closure not found Error:java: java.lang.RuntimeException: com.sun.tools.javac.code.Symbol$CompletionFailure: class file for groovy.lang.Closure not found 中解析的数据。

下面是如何创建两个数据框的示例,一个用于GLU,另一个用于ASP:

glu_vs_asp.csv

要创建包含GLU和ASP的数据框,您可以尝试以下操作:

# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)

# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]

dt_asp <- dt[grep("ASP", dt$Files),]

命令

dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]

为您提供分别包含&#39; ASP&#39;的行的索引。和&#39; GLU&#39;在grep("ASP", dt$Files) grep("GLU", dt$Files) 列。