我有一个数据框:
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
数据具有两个分类列“结构”和“趋势”。
我想按这两列对数据进行分组。
当我这样做时:
groups = data.groupby(['Struct', 'Trend'])
熊猫可能获得“结构”和“趋势”的6种不同组合: [('ohlc','D'),('ohlc','U'),('ohlc','U / D'),('olhc','D'),('olhc','U '),('olhc','U / D')]
如何合并组,其中“趋势”类别的值的子字符串为“ D”?
我希望只有4组::
简单地说,每个组“ D”必须包括所有数据“ D”和“ U / D”。每个组“ U”必须包含数据“ U”和“ U / D”
已编辑:
以上示例的预期结果:
Date Open High Low Close Struct Trend
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
我这样做,但是我只得到数据框并想要组:
trend_dtype = pd.api.types.CategoricalDtype(categories=['D', 'U/D'], ordered=False)
data['Trend'] = data['Trend'].astype(trend_dtype)
print(data.dropna())
答案 0 :(得分:1)
您可以使用boolen indexing。
df.loc[['U' in key for key in df['Trend']]]
Date Open High Low Close Struct Trend
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
df.loc[['D' in key for key in df['Trend']]]
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
答案 1 :(得分:1)
您可以将您的问题查看为重复,其中U/D
是df = (df.iloc[:,:-1]
.join(df.Trend.str.split('/', expand=True))
.melt(id_vars=df.columns[:-1], value_name='Trend')
.dropna()
.drop('variable', axis=1)
)
的行。所以这是一种方法:
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
26 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc D
28 2009-12-31 903.25 1130.38 666.79 1115.10 olhc D
您的df是:
(7,26)
注意(9,28)
和render prop
行。