Question

我很想在这里提问。但这是：

我有一个DataFrame，其中有一个名为＆＃34; id＆＃34;的列。我希望摆脱所有不以该给定列中的字母开头的行。下面是我正在使用的DataFrame的示例。

df=DataFrame({"level": [1,2,3,4,5,6,7,8,9,10],
 "personCode": [23,5,3,234,6567,232,67667,56,998,2456], 
"id":  ["Z71.89","J06.9","018.9","F41.1","M72.2","440.0","L85.1","000.00","000.00","I48.91"]})

我正在使用大型数据集，最近我发现在这个大小的DataFrame上使用for循环是不可行的。我不知道任何矢量化的字符串方法可以帮助我完成我正在做的事情。我基本上都在寻找像isalpha（）这样的布尔值，用于＆＃34; id＆＃34;中每个字符串的第一个字符。柱。一旦我找到了，我想删除整行。

过去两天我一直在研究这个问题而没有任何进展......任何反馈都会很棒！感谢。

Answer 1

另一个选择是检查第一个字符是不是字母：

df[-df.id.str[0].str.isalpha()]
#       id  level  personCode
#2   018.9      3           3
#5   440.0      6         232
#7  000.00      8          56
#8  000.00      9         998

（或df[~df.id.str[0].str.isalpha()]，如果您更喜欢波浪号。）

Answer 2

一个选项是使用str.match;这里使用正则表达式 [^ a-zA-Z] 来匹配非字母字母：

df[df.id.str.match('[^a-zA-Z]')]

#       id  level   personCode
#2   018.9      3   3
#5   440.0      6   232
#7  000.00      8   56
#8  000.00      9   998

由于str.match似乎已被弃用，您可以使用str.contains与锚^来指定字符串的开头：

df[df.id.str.contains('^[^a-zA-Z]')]

#       id  level   personCode
# 2  018.9      3   3
# 5  440.0      6   232
# 7 000.00      8   56
# 8 000.00      9   998

str.contains方法的天真时间和isalpha样本数据：

%timeit df[df.id.str.contains('^[^a-zA-Z]')]
#1000 loops, best of 3: 418 µs per loop

%timeit df[-df.id.str[0].str.isalpha()]
#1000 loops, best of 3: 576 µs per loop

pandas：识别条目中的第一个字符是字母还是数字

2 个答案: