Question

我有一个带有字符串列的python数据框，我想将其分成更多列。

DF的某些行如下所示：

COLUMN

ORDP//NAME/iwantthispart/REMI/MORE TEXT
/REMI/SOMEMORETEXT
/ORDP//NAME/iwantthispart/ADDR/SOMEADRESS
/BENM//NAME/iwantthispart/REMI/SOMEMORETEXT

所以基本上，我想要'/ NAME /'之后的所有内容，直到下一个'/'。然而。并非每一行都具有“ / NAME / iwantthispart /”字段，如第二行所示。

我尝试使用拆分函数，但结果错误。

mt['COLUMN'].apply(lambda x: x.split('/NAME/')[-1])

这只是给我/ NAME /部分之后的所有内容，并且在没有/ NAME /的情况下，它会将完整的字符串返回给我。

有人有一些技巧或解决方案吗？非常感谢帮助！（项目符号是为了使它更具可读性，并且实际上不在数据中。）

Answer 1

您可以使用正则表达式使用str.extract提取选择的模式：

# Generally, to match all word characters:
df.COLUMN.str.extract('NAME/(\w+)')

OR

# More specifically, to match everything up to the next slash:
df.COLUMN.str.extract('NAME/([^/]*)')

两者均返回：

0    iwantthispart
1              NaN
2    iwantthispart
3    iwantthispart

Answer 2

这两行将为您提供第二个单词，无论第一个单词是否是名称

mt["column"]=mt["column"].str.extract(r"(\w+/\w+/)")
mt["column"].str.extract(r"(\/\w+)")

这将在熊猫数据框中显示以下结果：

/iwantthispart
/SOMEMORETEXT
/iwantthispart
/iwantthispart

如果您只对包含NAME的行感兴趣，那么这对您来说就可以了：

mt["column"]=mt["column"].str.extract(r"(\NAME/\w+/)")
mt["column"].str.extract(r"(\/\w+)")

这将产生以下结果：

/iwantthispart
/NaN
/iwantthispart
/iwantthispar

在python中的2个字符串之间提取子字符串

2 个答案: