使用大数据集在同一列中匹配字符串?

时间:2018-12-22 20:35:33

标签: python pandas dataframe

我想找出为什么我的代码只返回每一行的第一个字母,而不是最长的匹配字符串? 我要处理包含1列和15,500行的大型数据集

 import csv
 import pandas as pd
 import numpy as np
 df = pd.read_csv('newproducts.csv',error_bad_lines=False)df 
 df['onkey'] = 1
 df1 pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
 df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
 from os.path import commonprefix
 df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
 df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
 df1 = df1[(df1['COL1_num']!=0)]
 df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
 df = df.rename(columns ={'name':'name_x'})
 df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')


 df['len'] = df['COL1'].apply(lambda x: len(x))
 df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
 df['COL1'] = df['COL1'].apply(lambda x: x.strip())
 df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
 df['other'] = df['other'].apply(lambda x:x.split('-'))
 df = df[['COL1','other']]

输入 因此,这将是您开始的专栏: 我想找到最长的通用字符串,然后将不匹配的部分放在单独的列中

product name
10 funniest Silicone Emperor - Ivory
10 funniest Stud 7 Inches - Hot Pink
10 funny elephant Hummer - Pink
10 funny elephant Hummer - Purple
10 Inch Realistic Dual Density Squirting snake
10 Inch Silicone Comfort Nozzle Attachment
10" comforter snake & comforter Bit Set - Black
10" comforter Jelly & comforter Bit Set - Pink
10" comforter Jelly & comforter Bit Set - Purple
10" Thick ladder W/balls & Suction - Black
100 insect magnets
1000 cloud Games
10-funniest Adonis Conqueror - Black
10-funniest Adonis Explorer - Red
10-funniest Adonis Vibrating Probe - Red
10-funniest Adonis Vibrating Strokers - Red
10-funniest Charisma Bliss - Black
10-funniest Charisma Bliss - Pink
10-funniest Charisma Kiss - Pink
10-funniest Charisma Tryst - Black
10-funniest Risque G-Vibe - Black
10-funniest Risque G-Vibe - Blue
10-funniest Risque G-Vibe - Purple
10-funniest Risque Slim - Black
10-funniest Risque Slim - Blue
10-funniest Risque Slim - Purple
10-funniest Risque Tulip - Black
10-funniest Risque Tulip - Blue
10-funniest Risque Tulip - Purple

输出-输出结果将是在第一列中匹配,而在另一列中不匹配的部分

new product name    
10 funniest Silicone Emperor     Ivory
10 funniest Stud 7 Inches    Hot Pink
10 funny elephant Hummer     Pink
10 funny elephant Hummer     Purple
10 Inch Realistic Dual Density Squirting snake  
10 Inch Silicone Comfort Nozzle Attachment  
10" comforter snake & comforter Bit Set      Black
10" comforter Jelly & comforter Bit Set      Pink
10" comforter Jelly & comforter Bit Set      Purple
10" Thick ladder W/balls & Suction   Black
100 insect magnets  
1000 cloud Games    
10-funniest Adonis Conqueror     Black
10-funniest Adonis Explorer      Red
10-funniest Adonis Vibrating Probe   Red
10-funniest Adonis Vibrating Strokers    Red
10-funniest Charisma Bliss   Black
10-funniest Charisma Bliss   Pink
10-funniest Charisma Kiss    Pink
10-funniest Charisma Tryst   Black
10-funniest Risque G-vibe    Black
10-funniest Risque G-vibe    Blue
10-funniest Risque G-vibe    Purple
10-funniest Risque Slim      Black
10-funniest Risque Slim      Blue
10-funniest Risque Slim      Purple
10-funniest Risque Tulip     Black
10-funniest Risque Tulip     Blue
10-funniest Risque Tulip     Purple

0 个答案:

没有答案
相关问题