Question

我有一个数据框，其中网站是列之一。尝试创建一个干净的字符串列，以排除.com / .net / .org / .edu等之后的所有内容，我的方法是找到它们的位置，并通过添加适当的字符来排除.com / .net之后的任何内容

**string**  
https:/amazon.com  
google.com/
http:/onlinelearning.edu/home  
walmart.net/  
https:/target.onlinesales.org/home/goods  
https:/target.onlinesales.de/home/goods  

**new string**  
https:/amazon.com    
google.com  
http:/onlinelearning.edu   
walmart.net  
https:/target.onlinesales.org  
https:/target.onlinesales.de

包含.com的内容

df['length'] = np.where(df['string'].str.contains('.com'), df['string'].str.find('.com') + 4, df['string'].str.len())
df['new_string'] = [y[:x] for (x, y) in zip(df['length'], account_dt['string'])]

Answer 1

这是正则表达式的工作。您可以将pd.Series.str.replace与否定的后面使用：

print (df["col"].str.replace("(?<!:)/.*", ""))

或者通过积极的隐身方式列出您所有的req域：

print (df["col"].str.replace("(?:(?<=com)|(?<=edu)|(?<=org)|(?<=de)|(?<=net))/.*", ""))

0                -https:/amazon.com
1                       -google.com
2         -http:/onlinelearning.edu
3                      -walmart.net
4    -https:/target.onlinesales.org
5     -https:/target.onlinesales.de
Name: col, dtype: object

您可以进一步优化模式以适应更多情况。

Python-根据第二个数据帧的长度选择字符串

1 个答案: