熊猫-获取与两个数据框之间的网址匹配的模式

时间:2019-03-05 09:51:43

标签: python pandas python-2.7

我有2个类型的数据框,

d1 = {'Domain': ['amazon.com', 'apple.com', 'amazon.com','xyz.com'], 'Pattern': ['kindle','music','subscribe-and-save',''],'Other Important Info':['a','b','c','d']}
df1 = pd.DataFrame(d1)

d2 = {'Domain': ['google.com','google.com','amazon.com','amazon.com', 'youtube.com', 'amazon.com'], 'Url': ['https://google.com/kindle','https://google.com/','https://amazon.com/subscribe-and-save','https://amazon.com/abc','https://youtube.com/music','https:amazon.com/kindle']}
df2 = pd.DataFrame(d2)

主要目的是基于“域”以及“模式”位于“网址”中时合并两个数据框。

因此结果应为以下数据框

{'Domain':['amazon.com','amazon.com'],'Url':['https://amazon.com/subscribe-and-save','https:amazon.com/kindle'],'Other Important Info':['c','a']}

我目前的工作方式

def lookup_table(value, df):
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out

df2['Pattern'] = df2['url'].apply(lambda x: lookup_table(x, df1[df1['Pattern']!='']))

merged = pd.merge(df2[df2['Pattern'].notnull()], df1[df1['Pattern']!=''],on=['Domain','Pattern'],how='left')

但是由于for循环,lookup_table函数花费的时间太长了

如何更快地执行此操作?在Windows上使用Python 2。

1 个答案:

答案 0 :(得分:4)

df1

       Domain             Pattern Other Important Info
0  amazon.com              kindle                    a
1   apple.com               music                    b
2  amazon.com  subscribe-and-save                    c
3     xyz.com                                         

df2

        Domain                                    Url
0   google.com              https://google.com/kindle
1   google.com                    https://google.com/
2   amazon.com  https://amazon.com/subscribe-and-save
3   amazon.com                 https://amazon.com/abc
4  youtube.com              https://youtube.com/music
5   amazon.com                https:amazon.com/kindle
  

主要目的是基于“域”和“域”合并两个数据框。   当“模式”位于“网址”中时也是如此。

df = df1.merge(df2, on='Domain')
df.loc[df.apply(lambda x: x.Pattern in x.Url, axis=1)]

输出

       Domain             Pattern Other Important Info  \
2  amazon.com              kindle                    a   
3  amazon.com  subscribe-and-save                    c   

                                     Url  
2                https:amazon.com/kindle  
3  https://amazon.com/subscribe-and-save