如何使用pandas

时间:2017-07-28 14:02:35

标签: python pandas

我正在读一个名为的文件:

label dataset sw sf
1H 1H_2
NOESY_F1eF2e.nv
4807.69238281 4803.07373047
600.402832031 600.402832031
1H.L 1H.P 1H.W 1H.B 1H.E 1H.J 1H.U 1H_2.L 1H_2.P 1H_2.W 1H_2.B 1H_2.E 1H_2.J 1H_2.U vol int stat comment flag0 flag8 flag9
0 {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
1 {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
2 {1.H8} 8.13712 0.05000 0.10000 ++ {0.0} {} {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
3 {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} {1.H8} 8.13712 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
4 {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} {2.H1'} 5.90291 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
5 {2.H1'} 5.90291 0.05000 0.10000 ++ {0.0} {} {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
6 {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
7 {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} {1.H8} 8.13712 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
8 {1.H1'} 5.82020 0.05000 0.10000 ++ {0.0} {} {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0
9 {1.H8} 8.13712 0.05000 0.10000 ++ {0.0} {} {2.H8} 7.61004 0.05000 0.10000 ++ {0.0} {} 0.0 100.0000 0 {} 0 0 0

我想从1H.L,1H.P,1H_2.L和1H_2.P列中获取值。 这是我的代码:

import pandas as pd

result={}
df = pd.read_csv("peaks_ee.xpk", sep=" ", skiprows=5)

shift1 = df["1H.P"]
shift2 = df["1H_2.P"]

mask = ((shift1>5.1) & (shift1<6)) & ((shift2>7) & (shift2<8.25))

result = df[mask]
result = result[["1H.L","1H.P","1H_2.L","1H_2.P"]]

for col in result.columns:
    if col == ("1H.L") or col==( "1H_2.L"):
        result[col]=result[col].str.strip("{} ")
result.drop_duplicates(keep='first',inplace=True)
tclust_atom=open("tclust_ppm.txt","w+")
result.to_string(tclust_atom, header=False)

这是输出:

0     1.H1'  5.82020   2.H8  7.61004
3     1.H1'  5.82020   1.H8  8.13712
5     2.H1'  5.90291   2.H8  7.61004
11    4.H1'  5.74125   3.H6  7.53261
12    3.H1'  5.54935   4.H8  7.49932
15    3.H1'  5.54935   3.H6  7.53261
18    2.H1'  5.90291   3.H6  7.53261
21    4.H1'  5.74125   4.H8  7.49932
27    6.H1'  5.54297   5.H6  7.72158
32    4.H1'  5.74125   5.H6  7.72158

我希望我的输出看起来像这样:

1.H1'  5.82020 0.3
2.H8  7.61004 0.3  
1.H8  8.13712 0.3
2.H1'  5.90291 0.3   
4.H1'  5.74125 0.3   
3.H6  7.53261 0.3
3.H1'  5.54935 0.3   
4.H8  7.49932 0.3
3.H1'  5.54935 0.3  
3.H6  7.53261 0.3 
6.H1'  5.54297 0.3   
5.H6  7.72158 0.3

我想把它全部放在两列中,我不希望任何重复。如何将当前输出的第三列和第四列的所有值放入第一列和第二列,然后不包含任何重复项?如何在第三列中添加常量值(0.3)?

编辑:更新代码:

import pandas as pd

result={}
df = pd.read_csv("peaks_ee.xpk", sep=" ", skiprows=5)

shift1 = df["1H.P"]
shift2 = df["1H_2.P"]

mask = ((shift1>5.1) & (shift1<6)) & ((shift2>7) & (shift2<8.25))

result = df[mask]
result = result[["1H.L","1H.P","1H_2.L","1H_2.P"]]

for col in result.columns:
    if col == ("1H.L") or col==( "1H_2.L"):
        result[col]=result[col].str.strip("{} ")

res = pd.lreshape(df, {'atom_name':['1H.L','1H_2.L'], 'ppm':
['1H.P','1H_2.P']}).drop_duplicates()
res['new']=0.3
result.drop_duplicates(keep='first',inplace=True)

tclust_atom=open("tclust_ppm.txt","w+")

result.to_string(tclust_atom,header = False)

res.to_string(tclust_atom, header = False) 

这是输出:

0    0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {1.H1'}  5.82020  0.3
1    0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {2.H8}  7.61004  0.3
2    0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {1.H8}  8.13712  0.3
5    0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {2.H1'}  5.90291  0.3
10   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {3.H6}  7.53261  0.3
11   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {4.H1'}  5.74125  0.3
12   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {3.H1'}  5.54935  0.3
13   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {4.H8}  7.49932  0.3
26   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {5.H6}  7.72158  0.3
27   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {6.H1'}  5.54297  0.3
29   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0   {5.H2'}  4.26210  0.3
35   0.1  ++  {0.0}  {}  0.05  0.1  ++  {0.0}  {}  0.05  {}  0  0  0  100.0  0  0.0    {7.H8}  8.16859  0.3

2 个答案:

答案 0 :(得分:4)

IIUC我们可以使用<form action="" id="n4" onkeyup="sync()"> <input type="radio" name="gender" value="male" id"4" onkeyup="sync()"> Male<br> <input type="radio" name="gender" value="female" onkeyup="sync()"> Female<br> <input type="radio" name="gender" value="other" onkeyup="sync()"> Other </form> <input type="text" name="n1" id="n1" >

pd.lreshape
  

使用&#39; 0.3&#39;

添加第三列
In [41]: df
Out[41]:
       c1       c2    c3       c4
0   1.H1'  5.82020  2.H8  7.61004
3   1.H1'  5.82020  1.H8  8.13712
5   2.H1'  5.90291  2.H8  7.61004
11  4.H1'  5.74125  3.H6  7.53261
12  3.H1'  5.54935  4.H8  7.49932
15  3.H1'  5.54935  3.H6  7.53261
18  2.H1'  5.90291  3.H6  7.53261
21  4.H1'  5.74125  4.H8  7.49932
27  6.H1'  5.54297  5.H6  7.72158
32  4.H1'  5.74125  5.H6  7.72158

In [43]: res = pd.lreshape(df, {'key':['c1','c3'], 'val':['c2','c4']}).drop_duplicates()

In [44]: res
Out[44]:
      key      val
0   1.H1'  5.82020
2   2.H1'  5.90291
3   4.H1'  5.74125
4   3.H1'  5.54935
8   6.H1'  5.54297
10   2.H8  7.61004
11   1.H8  8.13712
13   3.H6  7.53261
14   4.H8  7.49932
18   5.H6  7.72158

答案 1 :(得分:0)

对于初学者,请将您的行改为result.drop_duplicates(keep='first',inplace=True)result.drop_duplicates(keep=False,inplace=True) 或者您将继续获得至少一个副本,因为操作保持第一次重复发生。当我第一次阅读pandas.DataFrame.drop_duplicates()的文档时,我犯了同样的错误。然后,为了获取最后两列并将它们添加到前两列,您可以执行以下操作,可能不是最有效的方法,但它对我有用。在删除重复项之前,还要执行列操作,以防万一遇到匹配不匹配的列大小。

result.columns = ["1HL","1HP","1H_2L","1H_2P"]
#remove the '.' in the name to make things easier
new_L = result.1HL.values.tolist()
new_P = result.1HP.values.tolist()
new_03col = []
for i, value in enumerate(result.1H_2L.values):
    new_L.append(value)
    new_P.append(result.1H_2P.values.tolist()[i[)
for i in range(len(new_L)):
    new_03col.append(0.3)
new_results = pd.DataFrame({'1HL':new_L, '1HP':new_P, '03col':new_03col})