近似名称匹配以合并两个数据帧python

时间:2017-01-12 14:52:11

标签: python pandas merge fuzzywuzzy sequencematcher

我正在使用两个数据帧(df1和df2),我想根据名称匹配将df2合并到df1中,但两者之间的名称并不完全匹配(例如:' JS Smith&# 39;可能是" JS Smith(Jr)")并且df1中的名称位于由" |"分隔的列表中。对于各种名称变体。

此外,我在df2中还有另外一个列包含略有不同的名称,如果原始列中没有匹配项,我希望这些名称可以回归。

最后,如果df1中有一个唯一的匹配项,我只想从df2中引入数据,而且我不想覆盖之前引入的条目。

以下是dfs的示例:

df1(其中N1代表名称变量列表中的第一个名称)

    Name variants
0   N1|N2|N3|N4
1   N1|N2|
2   N1|N2|N3

DF2

    Name Type 1        Name Type 2        Data1     Data2     Data3
0   Name 0             Name 0.1           X         Y         Z
1   Name 1             Name 1.1           A         B         C
2   Name 2             Name 2.1           D         E         F

我想首先进行匹配"名称类型2" 假设匹配是:

  1. 名称0.1 - > N1 | N2中的一个名称(df1的第1行)
  2. 名称2.1 - > N1 | N2 | N3 | N4中的一个名称(df1的第0行)
  3. 名称1.1 - >与df1中的任何名称都不匹配,我会检查名称1与N1 | N2 | N3匹配(df1的第2行)
  4. 生成的新df如下所示:

        Name Variants    Matched Named    Data1     Data2     Data3    Matched
    0   N1|N2|N3|N4      Name2.1          D         E         F        True
    1   N1|N2|           Name0.1          X         Y         Z        True
    2   N1|N2|N3|        Name1            A         B         C        True
    

    我目前的做法是:

    1. 循环遍历df2中的每一行,并使用df1[df1['Name Variants'].contains('Name0.1')
    2. 搜索df1
    3. 如果有唯一匹配(在df1中找到1行)和"匹配"没有标记为" True"然后我拉入数据
    4. 如果有多个匹配项,我不会提取数据
    5. 如果没有匹配项,我会搜索"名称0"使用相同的方法并再次运行相同的逻辑(1匹配,当前没有数据合并等)
    6. 我的问题是:

      1. 这是非常耗费时间的
      2. 我不能像我最初描述的那样给出轻微的拼写差异
      3. 以下是我当前方法的代码:

        global_brands = set(ep["Global Brand"].dropna().str.replace("&", "").str.lower())
        products = set(ep["Product"].dropna().str.replace("&", "").str.lower())
        gx_name = set(ep["Generic Name"].dropna().str.replace(";","").str.lower())
        #%%
        
        print(len(global_brands))
        print(len(products))
        print(len(gx_name))
        #%%
        """
        add transformed names to ep and db
        
        """
        
        ep["alt_global_brands"] = ep["Global Brand"].fillna("").str.replace("&", "").str.lower()
        ep["alt_product"] = ep["Product"].fillna("").str.replace("&", "").str.lower()
        ep["alt_gx_name"] = ep["Generic Name"].fillna("").str.replace(";","").str.lower()
        
        
        db["alt_drug_names"] = db["Trans Drug Name"].str.lower()
        
        #%%
        print(db.loc[1805,"alt_drug_names"].split("|")[0] == "buprenorphine  naloxone")
        #%%
        print(ep.loc[166,"alt_product"] == "vx-661  ivacaftor")
        
        #%%
        
        ep['Match in db'] = ""
        db['EP match'] = ""
        
        num_product_nonmatches = 0
        num_product_exact_matches = 0
        double_matches = 0
        for product in products:
            product_matches = len(db.ix[db["alt_drug_names"].str.contains(product)])
        
            if product_matches == 1:
                matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]
        
            if product_matches > 1:
                #print(db.ix[db["alt_drug_names"].str.contains(global_brand)]["alt_drug_names"].str.split("|"))
                num_matched_rows = 0
                for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
                    names = value.split("|")
                    for name in names:
                        if product == name:
                            matched_row = row
                            num_matched_rows += 1
        
                if num_matched_rows == 1:
                    product_matches = 1
        
        
                #elif num_matched_rows > 1: - At no point was there still a double match after looping through each rows name variants and looking for an exact match
                if num_matched_rows == 0:
                    """
                    Here after looping through the name variants there was no exact match
                    This seems to be for assets that are too generic (ex: clonidine hydrochloride, rotavirus vaccine, etc.)
        
                    Approach: 
                    1. Check if name has / to split and create combo
                    2. If no / or still no match => leverage generic name
                    """
                    product_copy = product
                    if "("  in product:
                        product = product.split("(")[0].strip()
        
                    if "/" in product:
                        product_split = product.split("/")
                        for product_fragment in product_split:
                            product_fragment = product_fragment.strip()
        
                        temp_product = ""
                        for product_fragment in product_split:
                            temp_product = temp_product + product_fragment + "  "
        
                        product = temp_product[:-len("  ")].strip()
        
                    if len(db.ix[db["alt_drug_names"].str.contains(product)]) == 1: # this instance does not occur
                        product_matches = 1
                        matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]
        
                    elif len(db.ix[db["alt_drug_names"].str.contains(product)]) > 1:
                        num_matched_rows = 0
                        for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
                            names = value.split("|")
                            for name in names:
                                if product == name:
                                    matched_row = row
                                    num_matched_rows += 1
        
        
                        if num_matched_rows == 1:
                            product_matches = 1
        
                    product = product_copy
        
            if product_matches == 0:
                num_product_nonmatches += 1
                """
                Check if name has / to split and create combo
        
                LEVERAGE GENERIC NAME
        
                """
        
                #product_name = ep[ep["Global Brand"].str.replace("&", "+")]
                #product_matches = len(db.ix[db["Drug Name"].str.contains(global_brand) and db.ix[db["Drug Name"].str.contains(global_brand)])
            if product_matches == 1:
                num_product_exact_matches += 1
        #        print(product)
        #        print(matched_row)
                #print(product)
                ep_row = ep[ep['alt_product'] == product].index[0]
                if ep.loc[ep_row,'Match in db'] == "":
                    ep.loc[ep_row,'Match in db'] = "TRUE"
                if db.loc[matched_row,'EP match'] == "":
                    db.loc[matched_row, 'EP match'] = "TRUE"
                    db.loc[matched_row, 'EP Global Name'] = ep.loc[ep_row, 'Global Brand']
                    db.loc[matched_row, 'EP Product'] = ep.loc[ep_row, 'Product']
                    db.loc[matched_row, 'EP Generic Name'] = ep.loc[ep_row, 'Generic Name']
                    db.loc[matched_row, 'EP Company'] = ep.loc[ep_row, 'Company']
                    db.loc[matched_row, 'EP Rx or OTC'] = ep.loc[ep_row, 'Prescription']
                    db.loc[matched_row, 'EP markets'] = ep.loc[ep_row, 'Markets']
        
                    columns = ['2015 Actual/ Est. (Sales)','WW sales - 2008','WW sales - 2009','WW sales - 2010','WW sales - 2011','WW sales - 2012','WW sales - 2013','WW sales - 2014','WW sales - 2015',
                               'WW sales - 2016','WW sales - 2017','WW sales - 2018','WW sales - 2019','WW sales - 2020','WW sales - 2021','WW sales - 2022','WW sales - 2023','WW sales - 2024','WW sales - 2025',
                               'WW CAGR (2018 or Launch - 2025)','WW Est. Launch','U.S. sales - 2008','U.S. sales - 2009','U.S. sales - 2010','U.S. sales - 2011','U.S. sales - 2012','U.S. sales - 2013',
                               'U.S. sales - 2014','U.S. sales - 2015','U.S. sales - 2016','U.S. sales - 2017','U.S. sales - 2018','U.S. sales - 2019','U.S. sales - 2020','U.S. sales - 2021','U.S. sales - 2022',
                               'U.S. sales - 2023','U.S. sales - 2024','U.S. sales - 2025','U.S. CAGR (2018 or Launch - 2025)','Forecasters','Forecast Statistics']
        
                    for col in columns:
                        db.loc[matched_row, col] = ep.loc[ep_row, col]
        
                    db.loc[matched_row, 'U.S. Est. Launch'] = ep.loc[ep_row,'U.S. Est. Lauch']
        
        
        
        #%%
        
        print("EP non matches: " + str(num_product_nonmatches))
        print("EP matches: " + str(num_product_exact_matches))
        print("EP total: " + str(num_product_nonmatches + num_product_exact_matches))
        print("EP total products: " + str(len(ep)))
        print("EP length of product set: " + str(len(products)))
        print("EP double_matches: " + str(double_matches))
        

0 个答案:

没有答案