根据列值的长度过滤数据帧行

时间:2017-07-13 19:40:56

标签: pandas

我有一个pandas数据帧如下:

df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )

df
              A  B
0             1  2
1           NaN  1
2  test string1  5

我正在使用pandas 0.20。删除“列中任何”列的长度为>的行的最有效方法是什么? 10?

len('test string1') 12

因此对于上述例如,我期望输出如下:

df
              A  B
0             1  2
1           NaN  1

4 个答案:

答案 0 :(得分:8)

如果基于列A

In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
     A  B
0    1  2
1  NaN  1

如果基于所有列

In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
     A  B
0    1  2
1  NaN  1

答案 1 :(得分:3)

In [42]: df
Out[42]:
              A  B                         C          D
0             1  2                         2 2017-01-01
1           NaN  1                       NaN 2017-01-02
2  test string1  5  test string1test string1 2017-01-03

In [43]: df.dtypes
Out[43]:
A            object
B             int64
C            object
D    datetime64[ns]
dtype: object

In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
     A  B    C          D
0    1  2    2 2017-01-01
1  NaN  1  NaN 2017-01-02

<强>解释

df.select_dtypes(['object'])仅选择objectstr)dtype的列:

In [45]: df.select_dtypes(['object'])
Out[45]:
              A                         C
0             1                         2
1           NaN                       NaN
2  test string1  test string1test string1

In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
       A      C
0  False  False
1  False  False
2   True   True

现在我们可以&#34;聚合&#34;它如下:

In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0    False
1    False
2     True
dtype: bool

最后我们只能选择值为False的那些行:

In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
     A  B    C          D
0    1  2    2 2017-01-01
1  NaN  1  NaN 2017-01-02

答案 2 :(得分:3)

我不得不为迭戈的工作答案输入一个字符串:

private int GetCount(IDictionary<string, int> counts, string item)
{
  int count;
  if (!counts.TryGetValue(item, out count))
    count = 0;
  count++;
  counts[item] = count;
  return count;
}

private IEnumerable<string> GetItems(IEnumerable<string> items)
{
  // Initialize dict for counts with appropriate comparison
  var counts = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
  foreach(var item in items)
    yield return string.Format("{0}[{1}]", item, GetCount(counts, item));
}

答案 3 :(得分:1)

使用系列的apply函数,以保留它们:

df = df[df['A'].apply(lambda x: len(x) <= 10)]