过滤时如何考虑dataFrame中的其他行?

时间:2018-07-08 11:35:24

标签: python pandas dataframe nlp

我试图过滤(并因此更改)依赖于其他列中值的熊猫中的某些行。说我的dataFrame看起来像这样:

SENT    ID    WORD        POS        HEAD
1       1     I           NOUN        2
1       2     like        VERB        0
1       3     incredibly  ADV         4
1       4     brown       ADJ         5
1       5     sugar       NOUN        2
2       1     Here        ADV         2
2       2     appears     VERB        0
2       3     my          PRON        5
2       4     next        ADJ         5
2       5     sentence    NOUN        0

结构使得“ HEAD”列指向该行所依赖的单词的索引。例如,如果“棕色”依赖于“糖”,则“棕色”的头为4,因为“糖”的索引为4。

我需要提取POS为ADV且其头部为POS VERB的所有行的df,因此“此处”将位于新df中,但不会“令人难以置信”(并且可能更改其WORD条目) 。 目前,我正在循环执行此操作,但我不认为这是大熊猫方法,而且还会在以后产生问题。这是我当前的代码(split(“-”)来自另一个故事-忽略它):

def get_head(df, dependent):
    head = dependent
    target_index = int(dependent['HEAD'])
    if target_index == 0:
        return dependent
    else:
        if target_index < int(dependent['INDEX']):
            # 1st int in cell
                while (int(head['INDEX'].split("-")[0]) > target_index):
                    head = data.iloc[int(head.name) - 1]
        elif target_index > int(dependent['INDEX']):
            while int(head['INDEX'].split("-")[0]) < target_index:
                    head = data.iloc[int(head.name) + 1]
    return head

编写此函数时遇到的一个困难是(当时)我没有“ SENTENCE”列,因此我不得不手动找到最近的头部。我希望添加SENTENCE列应该使事情变得容易一些,尽管要注意的是,由于df中有成百上千个这样的句子,因此仅搜索索引“ 5”就不会做,因为有数百行df['INDEX']=='5'

以下是我如何使用get_head()的示例:

def change_dependent(extract_col, extract_value, new_dependent_pos, head_pos):
    name = 0
    sub_df = df[df[extract_col] == extract_value] #this is another condition on the df. 
    for i, v in sub_df.iterrows():
        if (get_head(df, v)['POS'] == head_pos):
            df.at[v.name, 'POS'] = new_dependent_pos
    return df

change_dependent('POS', 'ADV', 'ADV:VERB', 'VERB')

这里有人可以想到一种更优雅/高效/熊猫的方式,使我可以获取所有头为VERB的ADV实例吗?

1 个答案:

答案 0 :(得分:0)

import pandas as pd
df = pd.DataFrame([[1,1,'I','NOUN',2],
                  [1,2,'like','VERB',0],
                  [1,3,'incredibly','ADV',4],
                  [1,4,'brown','ADJ',4],
                  [1,5,'sugar','NOUN',5],
                  [2,1,'Here','ADV',2],
                  [2,2,'appears','VERB',0],
                  [2,3,'my','PRON',5],
                  [2,4,'next','ADJ',5],
                  [2,5,'sentance','NOUN',0]]
                  ,columns=['SENT','ID','WORD','POS','HEAD'])

adv=df[df['POS']=='ADV']
temp=df[df['POS']=='VERB'][['SENT','ID','POS']].merge(adv,left_on=['SENT','ID'],right_on=['SENT','HEAD']) 
temp['WORD']