我正在尝试创建一个数据框,以跟踪2010-2016期间开设的公立学校的数量。
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 0 0 2005 2015.0
2 Active Alameda 0 0 2006 NaN
3 Closed Alameda 0 0 2008 2015.0
4 Active Alameda 0 0 2011 NaN
5 Active Alameda 0 0 2011 NaN
6 Active Alameda 0 0 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 0 0 1980 NaN
9 Active Alameda 0 0 1980 NaN
我想更新2010-2016专栏,以跟踪每年开放的学校数量。例如,数据框中的第一所学校于2005年开放,并于2015年关闭。迭代器应该检查" ClosedYear"列并将1添加到所有列'行'值< 2015(2010,2011 ...,2014)。如果" ClosedYear"列显示" NaN",然后从" OpenYear"开始。列,为所有列添加1'行'值> =" OpenYear" (例如:学校#4,专栏[2011,2012 ...,2016] +1&专栏[2010]无变化)
我正在考虑使用" apply"将函数应用于数据框。但这可能不是解决问题的最有效方法。需要帮助找出如何使这项工作!谢谢!
额外步骤 完成计数后,我想按县分组年份列。我倾向于使用" groupby" w / sum函数总结每个县每年的开放学校数。如果有人可以添加上述问题的答案,那将非常有帮助。
预期产出:
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 1 0 2005 2015.0
2 Active Alameda 1 1 2006 NaN
3 Closed Alameda 1 0 2008 2015.0
4 Active Alameda 0 1 2011 NaN
5 Active Alameda 0 1 2011 NaN
6 Active Alameda 0 1 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 1 1 1980 NaN
9 Active Alameda 1 1 1980 NaN
答案 0 :(得分:2)
我觉得应该有一种方法可以在不使用>>> np.eye(a.shape[1])[indices]
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])
的情况下做到这一点但是,我无法想到它,所以这是我的解决方案:
for loop
输出:
# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType County OpenYear ClosedYear
Closed Alameda 2005 2015.0
Active Alameda 2006 NaN
Closed Alameda 2008 2015.0
Active Alameda 2011 NaN
Active Alameda 2011 NaN
Active Alameda 2012 NaN
Closed Alameda 1980 1989.0
Active Alameda 1980 NaN
Active Alameda 1980 NaN"""))
# For each year
for year in range(2010, 2016+1):
# Create a column of 0s
df[str(year)] = 0
# Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)
print(df.to_string)
(PS:我不太确定你试图用 StatusType County OpenYear ClosedYear 2010 2011 2012 2013 2014 2015 2016
0 Closed Alameda 2005 2015.0 1 1 1 1 1 1 0
1 Active Alameda 2006 NaN 1 1 1 1 1 1 1
2 Closed Alameda 2008 2015.0 1 1 1 1 1 1 0
3 Active Alameda 2011 NaN 0 1 1 1 1 1 1
4 Active Alameda 2011 NaN 0 1 1 1 1 1 1
5 Active Alameda 2012 NaN 0 0 1 1 1 1 1
6 Closed Alameda 1980 1989.0 0 0 0 0 0 0 0
7 Active Alameda 1980 NaN 1 1 1 1 1 1 1
8 Active Alameda 1980 NaN 1 1 1 1 1 1 1
做什么)
答案 1 :(得分:1)
除非确实需要创建这些中间列,否则您可以使用groupby
和.size
直接获取计数。根据您是否要包含结束年份,更改{的不等式{1}}到<=
。如果你想按县分组,你也可以在同一步骤中进行分组。
这是起始<
df