将数据帧中的(拆分)范围拆分为多行

时间:2018-02-13 21:49:44

标签: python pandas numpy dataframe

此问题与Split (explode) pandas dataframe string entry to separate rows类似,但包含有关添加范围的问题。

我有一个DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | bob@email.com  |
+------+---------+----------------+
| John |   NaN   | john@email.com |
+------+---------+----------------+
| Mary |   1,2   | mary@email.com |
+------+---------+----------------+
| Jane | 1,3-5   | jane@email.com |
+------+---------+----------------+

我希望用逗号分隔Options列以及为范围添加的行。

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | bob@email.com  |
+------+---------+----------------+
| Bob  | 2       | bob@email.com  |
+------+---------+----------------+
| Bob  | 4       | bob@email.com  |
+------+---------+----------------+
| Bob  | 5       | bob@email.com  |
+------+---------+----------------+
| Bob  | 6       | bob@email.com  |
+------+---------+----------------+
| John | NaN     | john@email.com |
+------+---------+----------------+
| Mary | 1       | mary@email.com |
+------+---------+----------------+
| Mary | 2       | mary@email.com |
+------+---------+----------------+
| Jane | 1       | jane@email.com |
+------+---------+----------------+
| Jane | 3       | jane@email.com |
+------+---------+----------------+
| Jane | 4       | jane@email.com |
+------+---------+----------------+
| Jane | 5       | jane@email.com |
+------+---------+----------------+

我怎样才能超越使用concatsplit之类的参考SO文章所说的来实现这一目标?我需要一种方法来添加范围。

该文章使用以下代码来分割逗号描述的值(1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

提前感谢您的建议!

更新2/14 示例数据已更新,以符合我当前的情况。

4 个答案:

答案 0 :(得分:6)

如果我理解你的需要

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   bob@email.com
1  Jane     [1, 3, 4, 5]  jane@email.com


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   bob@email.com  1.0
1   Bob   bob@email.com  2.0
2   Bob   bob@email.com  4.0
3   Bob   bob@email.com  5.0
4   Bob   bob@email.com  6.0
5  Jane  jane@email.com  1.0
6  Jane  jane@email.com  3.0
7  Jane  jane@email.com  4.0
8  Jane  jane@email.com  5.0

答案 1 :(得分:5)

从自定义替换功能开始:

c(1, 2)

将列名存储在某处,稍后我们将使用它们:

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

接下来,替换c = df.columns 中的项目,然后用逗号分隔:

df.Options

接下来,重塑您的数据并最终加载到新的数据框中:

v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')

df = pd.DataFrame(
       df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c

答案 2 :(得分:5)

我喜欢使用np.r_slice
我知道它看起来像一团糟,但美丽在旁观者的眼中。

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   bob@email.com
0   Bob        2   bob@email.com
0   Bob        4   bob@email.com
0   Bob        5   bob@email.com
0   Bob        6   bob@email.com
2  Mary        1  mary@email.com
2  Mary        2  mary@email.com
3  Jane        1  jane@email.com
3  Jane        3  jane@email.com
3  Jane        4  jane@email.com
3  Jane        5  jane@email.com

解释

  • np.r_使用不同的切片器和索引器并返回组合的数组。

    np.r_[1, 4:7]
    array([1, 4, 5, 6])
    

    np.r_[slice(1, 2), slice(4, 7)]
    array([1, 4, 5, 6])
    

    但如果我需要传递任意一组,我需要将tuple传递给np.r_ __getitem__方法。

    np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
    array([ 1,  4,  5,  6, 10, 11, 12, 13])
    

    所以我迭代,解析,制作切片并传递给np.r_.__getitem__

  • 在应用我的酷解析器后,使用locpd.Index.repeatpd.Series.str.len的组合

  • 使用pd.DataFrame.assign覆盖现有列

<强> __注__
如果您的Options列中包含不良字符,我会尝试按此过滤。

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')

答案 3 :(得分:4)

这是一个解决方案。虽然它不漂亮(pandas的最小使用),但效率很高。

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
                   ['Jane', '1,3-5', 'jane@email.com']],
                  columns=['Name', 'Options', 'Email'])

df['Options'] = df['Options'].map(ranger)

lens = list(map(len, df['Options']))

df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                       'Email': np.repeat(df['Email'].values, lens),
                       'Option': np.hstack(df['Options'].values)})

#             Email  Name  Option
# 0   bob@email.com   Bob       1
# 1   bob@email.com   Bob       2
# 2   bob@email.com   Bob       4
# 3   bob@email.com   Bob       5
# 4   bob@email.com   Bob       6
# 5  jane@email.com  Jane       1
# 6  jane@email.com  Jane       3
# 7  jane@email.com  Jane       4
# 8  jane@email.com  Jane       5

以下4个解决方案的基准(仅限兴趣)。

作为一般规则,repeat品种的表现优异。此外,从头开始创建新数据帧的解决方案(而不是apply)做得更好。下拉到numpy可获得最佳效果。

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))
    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(mm(list(map(int, s.strip().split('-')))) for s in o.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
                   ['Jane', '1,3-5', 'jane@email.com']],
                  columns=['Name', 'Options', 'Email'])

df = pd.concat([df]*1000, ignore_index=True)

def explode_jp(df):
    df['Options'] = df['Options'].map(ranger)
    lens = list(map(len, df['Options']))
    df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                           'Email': np.repeat(df['Email'].values, lens),
                           'Option': np.hstack(df['Options'].values)})
    return df_out

def explode_cs(df):
    c = df.columns
    v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
    df_out = pd.DataFrame(df.drop('Options', 1).values.repeat(v.str.len(), axis=0))
    df_out.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
    df_out.columns = c
    return df_out

def explode_wen(df):
    df.Options=df.Options.apply(yourfunc)
    df_out = df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
    return df_out

def explode_pir(df):
    r = df.Options.apply(parse)
    df_out = df.loc[df.index.repeat(r.str.len())].assign(Options=np.concatenate(r))
    return df_out

%timeit explode_jp(df.copy())   # 32.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_cs(df.copy())   # 90.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_wen(df.copy())  # 675 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_pir(df.copy())  # 163 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)