Question

是否有更快的方法来查找Pandas DataFrame中最长字符串的长度，而不是下面示例中显示的字符串？

import numpy as np
import pandas as pd

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])

print df.col1.map(lambda x: len(x)).max()
# result --> 6

使用IPython的df.col1.map(lambda x: len(x)).max()计时时，运行%timeit大约需要10秒。

Answer 1

帝斯曼的建议似乎是关于没有做一些手动微优化的最佳方法：

%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop

%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop

%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop

请注意，明确使用str.len()方法似乎没有太大改进。如果您不熟悉IPython，这是非常方便的%timeit语法的来源，我肯定建议快速测试一下这样的事情。

更新添加了屏幕截图：

Answer 2

作为次要添加，您可能希望遍历数据框中的所有对象列：

for c in df:
    if df[c].dtype == 'object':
        print('Max length of column %s: %s\n' %  (c, df[c].map(len).max()))

这将防止bool，int types等抛出错误。

可以针对其他非数字类型进行扩展，例如＆＃39; string _＆＃39;，＆＃39; unicode _＆＃39;即。

if df[c].dtype in ('object', 'string_', 'unicode_'):

Answer 3

有时您需要最长字符串的字节长度。这与使用花哨的Unicode字符的字符串相关，在这种情况下，字节长度大于常规长度。这在特定情况下非常相关，例如用于数据库写入。

df_col_len = int(df[df_col_name].str.encode(encoding='utf-8').str.len().max())

以上行有额外的str.encode(encoding='utf-8')。输出包含在int()中，因为它是一个numpy对象。

Answer 4

很好的答案，特别是Marius和Ricky很有帮助。

鉴于我们大多数人都在优化编码时间，以下是对这些答案的快速扩展，可以按列返回所有列的最大项目长度，并按每列的最大项目长度进行排序：

mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)

或作为一个班轮：

pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)

适应原始样本，可以将其演示为：

import pandas as pd

x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])

print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))

输出：

col2    6
col1    3
dtype: int64

Answer 5

您应该尝试使用 numpy 。这也可以帮助您提高效率。

下面的代码将为您提供excel电子表格中每列的最大长度（使用熊猫读入数据框）

import pandas as pd
import numpy as np

xl = pd.ExcelFile('sample.xlsx')
df = xl.parse('Sheet1')

columnLenghts = np.vectorize(len)
maxColumnLenghts = columnLenghts(df.values.astype(str)).max(axis=0)
print('Max Column Lengths ', maxColumnLenghts)

在Pandas数据帧列中查找最长字符串的长度

5 个答案: