根据熊猫数据框中的条件将单元格拆分/分解为多行

时间:2018-09-26 11:42:09

标签: python pandas dataframe split explode

输入数据框的代码为

<div class="row">
  <article>
    <div class="image"><img src="https://via.placeholder.com/100x100"></div>
    <h2>Header text</h2>
    <p>Detail text</p>
  </article>
  <article>
    <div class="image"><img src="https://via.placeholder.com/100x100"></div>
    <h2>Header text</h2>
    <p>Detail text</p>
  </article>
  <article>
    <div class="image"><img src="https://via.placeholder.com/100x100"></div>
    <h2>Header text</h2>
    <p>Detail text</p>
  </article>
</div>

输入数据框:-

import pandas as pd
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))','Column2': 'XYZ + XXX/YYY'}])

输入列表:-

+---------------------------------+---------------------------------+
|              Column1            |              Column2            +
+---------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (!AAA)) |           XYZ + XXX/YYY         |
+---------------------------------+---------------------------------+

条件:-

list = [AAA,BBB,CCC]

因为!符号,该行变为

'+' should remain as such (similar to AND condition)
'/' means split the data into multiple cells (similar to OR condition)
'!' means replace with other elements in the corresponding list (similar to NOT condition)

请帮助我使用熊猫将单行拆分为多行,如下所示

+------------------------------------+---------------------------------+
|              Column1               |              Column2            +
+------------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (BBB/CCC)) |           XYZ + XXX/YYY         |
+------------------------------------+---------------------------------+

1 个答案:

答案 0 :(得分:0)

查看这是否满足您的要求。这些注释说明了它的工作原理。

#!/usr/bin/env python
import pandas as pd # tested with pd.__version__ 0.19.2
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))',
                    'Column2': 'XYZ + XXX/YYY'}])   # your input dataframe
list = ['AAA', 'BBB', 'CCC']                        # your input list
to_replace = dict()
for item in list:   # prepare the dictionary for the '!' replacements
    to_replace["!"+item+'\\b'] = '/'.join([i for i in list if i != item])
df = df.replace(to_replace, regex=True) # do all the '!' replacements
import re
def expanded(s):    # expand series s to multiple string list around '/'
    l = s.str.replace('[()]', '').tolist()
    while True:     # in each loop cycle, handle one A/B/C... expression
        xl = []     # expanded list for this cycle
        for s in l: # for each string in the list so far
            m = re.search(r'\w+(/\w+)+', s) # look for a A/B/C... expression
            if m:   # if there is, add the individual expansions to the list
                xl.extend([m.string[:m.start()]+i+m.string[m.end():]
                                            for i in m.group().split('/')])
            else:   # if not, we're done
                return l
        l = xl      # expanded list for this cycle is now the current list
def expand(c):      # expands the column named c to multiple rows
    new = expanded(df[c])                       # get the new contents
    xdf = pd.concat(len(new)/len(df[c])*[df])   # create required rows
    xdf[c] = sorted(new)                        # set the new contents
    return xdf                                  # return new dataframe
df = expand('Column1')
df = expand('Column2')
print df

输出:

           Column1    Column2
0  CC  + A11 + BBB  XYZ + XXX
0  CC  + A11 + CCC  XYZ + XXX
0  CC  + ABC + BBB  XYZ + XXX
0  CC  + ABC + CCC  XYZ + XXX
0   CC  + ZZ + BBB  XYZ + XXX
0   CC  + ZZ + CCC  XYZ + XXX
0  CC  + A11 + BBB  XYZ + YYY
0  CC  + A11 + CCC  XYZ + YYY
0  CC  + ABC + BBB  XYZ + YYY
0  CC  + ABC + CCC  XYZ + YYY
0   CC  + ZZ + BBB  XYZ + YYY
0   CC  + ZZ + CCC  XYZ + YYY