Question

我有数据框

member_id,event_type,event_path,event_time,event_date,event_duration
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,2
20077,2016-11-20,"2016-11-20 09:17:09",url,avito.ru/profile/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,1
20077,2016-11-20,"2016-11-20 09:17:37",url,avito.ru/auto/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,135
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,37

并有另一个df2

domain  category    subcategory unique id   count_sec   Main category   Subcategory
avito.ru/auto   Автомобили Авто 1600    83112396    Auto  Avito
youtube.com Видеопортал Видеохостинг    1317    42710996    Video   Youtube
ok.ru   Развлечения     Социальные сети 694 13394605    Social network  OK
kinogo.club Развлечения     Кино    497 8438800 Video   Illegal
e.mail.ru   Почтовый сервис None    1124    8428984 Mail.ru Email
vk.com/audio    Видеопортал Видеохостинг    1020    7409440 Music   VK

通常我使用：

df['category'] = df.event_date.map(df2.set_index('domain')['Main category']

但是它比较数据，如果它相等，它需要值并在新列中创建它。但是我怎么能这样做，但是如果在字符串中使用子字符串？

Answer 1

我真的不知道你到底想要做什么。但我的建议是这样的：

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
num_imgs = 20
datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img('data/train/cats/cat.0.jpg')  # this is a PIL image
x = img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    i += 1
    if i > num_imgs:
        break  # otherwise the generator would loop indefinitely

测试df的子部分，因为它可能需要一段时间，具体取决于您拥有的数据量。

Answer 2

如果没有任何启发式方法来发现要加入的模糊匹配项，您将无法获得可扩展的解决方案，因为您需要生成 O（N ²）比较。

对于您的特定用例，我建议您提取做想要比较的网址部分。也许像是

from urlparse import urlparse

def netloc(s):
    return urlparse('http://' + s).netloc

df['netloc'] = df['event_date'].apply(netloc)
df2['netloc'] = df2['domain'].apply(netloc)

df.merge(df2, 'left', on='netloc')

Pandas：如果一列中的值包含子字符串

2 个答案: