拆分逗号分隔列表并按条件分解

时间:2018-01-24 09:56:17

标签: python sql-server pandas tsql

我有一个问题,在分割到一定数量的cols后,将一个DataFrame分解为单独的行,而不是以逗号分隔的列表。我正试图在Pandas中实现这一点,但如果使用原始SQL(我试过并放弃)这是可能的话,那么这将是一个理想的解决方案。

示例数据

Reference   Surname   Forename   CurrentPostCode   PreviousPostCodes
1           Smith     John       WA1 2LA           WA2 HG5, LN4 6XS
2           Jones     Jack       NA1 2NE           None
3           Potter    Harry      LI8 0NX           None
4           Wane      Bruce      HE27 4PR          HE5 9PR
5           Finn      Grahame    B26 7UP           B15 6UR, B22 9JK, B13 3YT

我想将 PreviousPostCodes 列拆分为两列 PPC1 PPC2 ,如果数组/逗号分隔列表的列数超过其中包含2个项目(在参考文献5的情况下),需要拆分前两个并在下方添加一行,并使用 B13 3YT

填充 PPC1

所需输出

Reference   Surname   Forename   CurrentPostCode   PPC1       PPC2
1           Smith     John       WA1 2LA           WA2 HG5    LN4 6XS
2           Jones     Jack       NA1 2NE           None       None
3           Potter    Harry      LI8 0NX           None       None
4           Wane      Bruce      HE27 4PR          HE5 9PR    None
5           Finn      Grahame    B26 7UP           B15 6UR    B22 9JK
5           Finn      Grahame    B26 7UP           B13 3YT    None

我希望这是有道理的,我可以拆分列表,但我得到n列,我想将其限制为最大大小为2,如果超过2则溢出到新行。没有限制数据中先前邮政编码的数量,如果逗号分隔列表中有5个,则需要将该行分解为3个新行。

由于

2 个答案:

答案 0 :(得分:0)

df[['PPC1','PPC2']] = df.pop('PreviousPostCodes').str.split(',\s*', n=1, expand=True)
df['PPC2'] = df['PPC2'].fillna('').str.split(',\s*', expand=False)

的产率:

In [173]: df
Out[173]:
   Reference Surname Forename CurrentPostCode     PPC1                PPC2
0          1   Smith     John         WA1 2LA  WA2 HG5           [LN4 6XS]
1          2   Jones     Jack         NA1 2NE      NaN                  []
2          3  Potter    Harry         LI8 0NX      NaN                  []
3          4    Wane    Bruce        HE27 4PR  HE5 9PR                  []
4          5    Finn  Grahame         B26 7UP  B15 6UR  [B22 9JK, B13 3YT]

现在我们可以使用explode() function

In [174]: explode(df, lst_cols='PPC2')
Out[174]:
   Reference Surname Forename CurrentPostCode     PPC1     PPC2
0          1   Smith     John         WA1 2LA  WA2 HG5  LN4 6XS
1          2   Jones     Jack         NA1 2NE      NaN
2          3  Potter    Harry         LI8 0NX      NaN
3          4    Wane    Bruce        HE27 4PR  HE5 9PR
4          5    Finn  Grahame         B26 7UP  B15 6UR  B22 9JK
5          5    Finn  Grahame         B26 7UP  B15 6UR  B13 3YT

答案 1 :(得分:0)

试试这个可以解释你的Sql脚本.Below是样本数据

IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp
;With cte(Reference ,  Surname,   Forename ,  CurrentPostCode,   PreviousPostCodes)
AS
(
SELECT 1,'Smith' ,'John'   , 'WA1 2LA'  ,'WA2 HG5, LN4 6XS,B13 3YT,AA18 3YT,YT783 3YT'              UNION ALL
SELECT 2,'Jones' ,'Jack'   , 'NA1 2NE'  ,'None'                         UNION ALL
SELECT 3,'Potter','Harry'  , 'LI8 0NX'  ,'None'                         UNION ALL
SELECT 4,'Wane'  ,'Bruce'  , 'HE27 4PR' ,'HE5 9PR,B13 3YT,RT4 YT5'                      UNION ALL
SELECT 5,'Finn'  ,'Grahame', 'B26 7UP'  ,'B15 6UR, B22 9JK, B13 3YT'
)
SELECT * INTO #temp  FROM cte
SELECT * FROM #temp 

通过使用动态sql我们得到n个列,这取决于prviousPostCode列,其中数据用逗号分隔,n列将被创建,因为旧的邮政编码取决于n个逗号

  --To get the number of columns to be divided dynamically
    DECLARE @ColumnsDivideCnt INT
        ,@Dyncol nvarchar(max)
        ,@Sql nvarchar(max)
;WITH cte
AS
(
SELECT 0 As Rn, CHARINDEX(',',PreviousPostCodes+',') AS Pos ,PreviousPostCodes FROM #temp
UNION ALL
SELECT Pos+1,CHARINDEX(',',PreviousPostCodes+',',Pos+1) ,PreviousPostCodes
FROM cte

WHERE Pos >0
)
SELECT @ColumnsDivideCnt=MAX(ColumnToGet) FROm
(
SELECT PreviousPostCodes, Pos,ROW_NUMBER()OVER(Partition by PreviousPostCodes Order by PreviousPostCodes) AS ColumnToGet FROM cte
WHERE Pos >0
GROUP BY PreviousPostCodes,Pos
)dt

--Get the column names dynamically
;WIth cte2
AS
(
SELECT 1 AS Rn 
UNION ALL
SELECT Rn+1
From cte2
WHERE Rn<@ColumnsDivideCnt
)
SELECT @Dyncol=STUFF((SELECT ', ' + ReqCol FROM
(
SELECT 'ISNULL(Split.a.value('+'''/S['+CAST(Rn AS VARCHAR(2))+']'+''''+','+'''NVARCHAR(1000)'''+'),''None'') As [PPC'+CAST(Rn AS VARCHAR(2))+']'  AS ReqCol FROM cte2
)Dt
FOR XML PATH ('')),1,1,'')


SET @Sql='SELECT DISTINCT
                Reference
                ,Surname
                ,Forename
                ,CurrentPostCode
                ,'+@Dyncol+'
        FROM (
            SELECT Reference,Surname,Forename,CurrentPostCode,
                CAST(''<S>''+REPLACE(PreviousPostCodes,'','',''</S><S>'')+''</S>'' AS XML)AS  PreviousPostCodes
            FROM #temp
) AS A
CROSS APPLY PreviousPostCodes.nodes(''S'') AS Split(a)
'
PRINT @Sql
EXEC (@Sql)

运行Dynamic sql脚本之前的结果

Reference   Surname Forename    CurrentPostCode     PreviousPostCodes
-----------------------------------------------------------------------------------------------
1           Smith   John        WA1 2LA             WA2 HG5, LN4 6XS,B13 3YT,AA18 3YT,YT783 3YT
2           Jones   Jack        NA1 2NE             None
3           Potter  Harry       LI8 0NX             None
4           Wane    Bruce       HE27 4PR            HE5 9PR,B13 3YT,RT4 YT5
5           Finn    Grahame     B26 7UP             B15 6UR, B22 9JK, B13 3YT

结果AfterDynamic sql脚本运行

Reference   Surname Forename    CurrentPostCode  PPC1        PPC2       PPC3        PPC4
--------------------------------------------------------------------------------------------------
1           Smith   John         WA1 2LA        WA2 HG5     LN4 6XS     B13 3YT     AA18 3YT
2           Jones   Jack         NA1 2NE        None        None        None        None
3           Potter  Harry        LI8 0NX        None        None        None        None
4           Wane    Bruce        HE27 4PR       HE5 9PR     B13 3YT     RT4 YT5     None
5           Finn    Grahame      B26 7UP        B15 6UR     B22 9JK     B13 3YT     None
相关问题