如何使用pandas中的另一列更新列

时间:2018-04-26 16:00:19

标签: python pandas

我正在尝试创建一个数据框,以跟踪2010-2016期间开设的公立学校的数量。

StatusType  County  2010 ...2016    OpenYear    ClosedYear
1   Closed  Alameda 0        0        2005        2015.0
2   Active  Alameda 0        0        2006         NaN
3   Closed  Alameda 0        0        2008        2015.0
4   Active  Alameda 0        0        2011         NaN
5   Active  Alameda 0        0        2011         NaN
6   Active  Alameda 0        0        2012         NaN
7   Closed  Alameda 0        0        1980        1989.0
8   Active  Alameda 0        0        1980         NaN
9   Active  Alameda 0        0        1980         NaN

我想更新2010-2016专栏,以跟踪每年开放的学校数量。例如,数据框中的第一所学校于2005年开放,并于2015年关闭。迭代器应该检查" ClosedYear"列并将1添加到所有列'行'值< 2015(2010,2011 ...,2014)。如果" ClosedYear"列显示" NaN",然后从" OpenYear"开始。列,为所有列添加1'行'值> =" OpenYear" (例如:学校#4,专栏[2011,2012 ...,2016] +1&专栏[2010]无变化)

我正在考虑使用" apply"将函数应用于数据框。但这可能不是解决问题的最有效方法。需要帮助找出如何使这项工作!谢谢!

额外步骤 完成计数后,我想按县分组年份列。我倾向于使用" groupby" w / sum函数总结每个县每年的开放学校数。如果有人可以添加上述问题的答案,那将非常有帮助。

预期产出:

StatusType       County 2010 ...2016    OpenYear    ClosedYear
    1   Closed  Alameda 1        0        2005        2015.0
    2   Active  Alameda 1        1        2006         NaN
    3   Closed  Alameda 1        0        2008        2015.0
    4   Active  Alameda 0        1        2011         NaN
    5   Active  Alameda 0        1        2011         NaN
    6   Active  Alameda 0        1        2012         NaN
    7   Closed  Alameda 0        0        1980        1989.0
    8   Active  Alameda 1        1        1980         NaN
    9   Active  Alameda 1        1        1980         NaN

2 个答案:

答案 0 :(得分:2)

我觉得应该有一种方法可以在不使用>>> np.eye(a.shape[1])[indices] array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]) 的情况下做到这一点但是,我无法想到它,所以这是我的解决方案:

for loop

输出:

# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType  County    OpenYear    ClosedYear
Closed      Alameda   2005        2015.0
Active      Alameda   2006         NaN
Closed      Alameda   2008        2015.0
Active      Alameda   2011         NaN
Active      Alameda   2011         NaN
Active      Alameda   2012         NaN
Closed      Alameda   1980        1989.0
Active      Alameda   1980         NaN
Active      Alameda   1980         NaN"""))

# For each year
for year in range(2010, 2016+1):
    # Create a column of 0s
    df[str(year)] = 0
    # Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
    df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)

print(df.to_string)

(PS:我不太确定你试图用 StatusType County OpenYear ClosedYear 2010 2011 2012 2013 2014 2015 2016 0 Closed Alameda 2005 2015.0 1 1 1 1 1 1 0 1 Active Alameda 2006 NaN 1 1 1 1 1 1 1 2 Closed Alameda 2008 2015.0 1 1 1 1 1 1 0 3 Active Alameda 2011 NaN 0 1 1 1 1 1 1 4 Active Alameda 2011 NaN 0 1 1 1 1 1 1 5 Active Alameda 2012 NaN 0 0 1 1 1 1 1 6 Closed Alameda 1980 1989.0 0 0 0 0 0 0 0 7 Active Alameda 1980 NaN 1 1 1 1 1 1 1 8 Active Alameda 1980 NaN 1 1 1 1 1 1 1 做什么)

答案 1 :(得分:1)

除非确实需要创建这些中间列,否则您可以使用groupby.size直接获取计数。根据您是否要包含结束年份,更改{的不等式{1}}到<=。如果你想按县分组,你也可以在同一步骤中进行分组。

这是起始<

df