结果

Question

下面是一个示例数据框，对于每个Bus＃description，我想找到所有其他包含至少一个相同单词的描述的总线＃。

Bus #                  DESCRIPTION

Bus1                   RICE MILLS MANUFACTURER 
Bus2                   LICORICE CANDY RETAIL
Bus3                   LICORICE CANDY WHOLESALE
Bus4                   RICE RETAIL

例如，输出：

RICE MILLS MANUFACTURER would be "RICE RETAIL"
LICORICE CANDY RETAIL would be "RICE RETAIL" "LICORICE CANDY WHOLESALE"
LICORICE CANDY WHOLESALE would be "LICORICE CANDY RETAIL"
RICE RETAIL would be: "RICE MILLS MANUFACTURER" "LICORICE CANDY RETAIL"

以下代码几乎可以正确执行此操作。

df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[1])]

问题是“米饭”这个词在“甘草”中，所以米饭制造商的产量包括“LICORICE RETAIL”。我不希望这样。

Answer 1

这仍然是O（n ^ 2），但它是高度矢量化的。

# get values of DESCRIPTION
s = df.DESCRIPTION.values.astype(str)

# parse strings and turn into sets
sets = np.array([set(l) for l in np.core.defchararray.split(s).tolist()])

# get upper triangle indices for all combinations of DESCRIPTION
r, c = np.triu_indices(len(sets), 1)

# use set operations to replicate intersection
i = sets[r] - sets[c] < sets[r]

# grab indices where intersections happen
r, c = r[i], c[i]
r, c = np.append(r, c), np.append(c, r)

结果

df.DESCRIPTION.iloc[c].groupby(r).apply(list)

0                                       [RICE RETAIL]
1             [LICORICE CANDY WHOLESALE, RICE RETAIL]
2                             [LICORICE CANDY RETAIL]
3    [RICE MILLS MANUFACTURER, LICORICE CANDY RETAIL]
Name: DESCRIPTION, dtype: object

比较时间

# build truth matrix
t = np.empty((s.size, s.size), dtype=np.bool)
t.fill(False)

t[r, c] = True

pd.DataFrame(t, df.index, df.index)

       0      1      2      3
0  False  False  False   True
1  False  False   True   True
2  False   True  False  False
3   True   True  False  False

时间

Answer 2

def match_word(ref_row,series):
    """
    --inputs
    ref_row (str): this is the string of reference
    series (pandas.series): this a series containing all other strings you want to cross-check
    --outputs:
    series (pandas.series): this will be a series of booleans
    """
    #convert ref_row into a set of strings. Use strip to remove whitespaces before and after the initial string
    ref_row = set(ref_row.strip().split(' '))
    #convert strings into set of strings 
    series = series.apply(lambda x:set(x.strip().split(' ')))
    #now cross check each row with the reference row.
    #find the size (number of words) of the intersection
    series = series.apply(lambda x:len(list(x.intersection(ref_row))))
    #if the size of the intersection set is greater than zero. Then there is a common word between ref_row and all the series
    series = series>0
    return series

现在，您可以按以下方式调用上述功能：

df['Description'].apply(lambda x:match_word(x,df['Description']))

请注意，这不是最佳的优化算法，但它是快速而肮脏的方法。这是O（n2）。

查找数据框字符串中单词的交叉点 - 仅限整个单词

2 个答案:

结果

比较时间