Python - 使用Numpy计算基尼系数

时间:2015-07-14 20:28:30

标签: python numpy economics

我是新手,首先,刚刚开始学习Python,我正在尝试编写一些代码来计算假国家的基尼指数。我想出了以下内容:

./gradlew <project>:dependencies

GDP = (653200000000) A = (0.49 * GDP) / 100 # Poorest 10% B = (0.59 * GDP) / 100 C = (0.69 * GDP) / 100 D = (0.79 * GDP) / 100 E = (1.89 * GDP) / 100 F = (2.55 * GDP) / 100 G = (5.0 * GDP) / 100 H = (10.0 * GDP) / 100 I = (18.0 * GDP) / 100 J = (60.0 * GDP) / 100 # Richest 10% # Divide into quintiles and total income within each quintile Q1 = float(A + B) # lowest quintile Q2 = float(C + D) # second quintile Q3 = float(E + F) # third quintile Q4 = float(G + H) # fourth quintile Q5 = float(I + J) # fifth quintile # Calculate the percent of total income in each quintile T1 = float((100 * Q1) / GDP) / 100 T2 = float((100 * Q2) / GDP) / 100 T3 = float((100 * Q3) / GDP) / 100 T4 = float((100 * Q4) / GDP) / 100 T5 = float((100 * Q5) / GDP) / 100 TR = float(T1 + T2 + T3 + T4 + T5) # Calculate the cumulative percentage of household income H1 = float(T1) H2 = float(T1+T2) H3 = float(T1+T2+T3) H4 = float(T1+T2+T3+T4) H5 = float(T1+T2+T3+T4+T5) # Magic! Using numpy to calculate area under Lorenz curve. # Problem might be here? import numpy as np from numpy import trapz # The y values. Cumulative percentage of incomes y = np.array([Q1,Q2,Q3,Q4,Q5]) # Compute the area using the composite trapezoidal rule. area_lorenz = trapz(y, dx=5) # Calculate the area below the perfect equality line. area_perfect = (Q5 * H5) / 2 # Seems to work fine until here. # Manually calculated Gini using the values given for the areas above # turns out at .58 which seems reasonable? Gini = area_perfect - area_lorenz # Prints utter nonsense. print Gini 的结果毫无意义。我已经取出了区域变量给出的值并手工完成了数学计算并且相当不错,但是当我尝试让程序执行它时,它给了我一个完全的???值(-1.7198 ......)。我错过了什么?有人能指出我正确的方向吗?

谢谢!

2 个答案:

答案 0 :(得分:2)

第一个问题不是正确考虑基尼系数方程:

基尼=(洛伦兹曲线与完全相等之间的面积)/(下面积 完全平等)

计算中不包含in的分母,并且等距线下方的面积使用的方程式也不正确(请参见使用np.linspacenp.trapz的方法的代码)。 / p>

还有一个问题是,洛伦兹曲线的第一个点丢失了(它需要从0开始,而不是第一个五分位数的份额)。尽管Lorenz曲线下的面积在0和第一个五分位数之间较小,但与扩展后的等分线下的面积之比非常大。

以下内容为方法given in the answers to this question提供了等效的答案:

import numpy as np
    
GDP = 653200000000 # this value isn't actually needed
    
# Decile percents of global GDP
gdp_decile_percents = [0.49, 0.59, 0.69, 0.79, 1.89, 2.55, 5.0, 10.0, 18.0, 60.0]
print('Percents sum to 100:', sum(gdp_decile_percents) == 100)
    
gdp_decile_shares = [i/100 for i in gdp_decile_percents]
    
# Convert to quintile shares of total GDP
gdp_quintile_shares = [(gdp_decile_shares[i] + gdp_decile_shares[i+1]) for i in range(0, len(gdp_decile_shares), 2)]
    
# Insert 0 for the first value in the Lorenz curve
gdp_quintile_shares.insert(0, 0)
    
# Cumulative sum of shares (Lorenz curve values)
shares_cumsum = np.cumsum(a=gdp_quintile_shares, axis=None)
    
# Perfect equality line
pe_line = np.linspace(start=0.0, stop=1.0, num=len(shares_cumsum))

area_under_lorenz = np.trapz(y=shares_cumsum, dx=1/len(shares_cumsum))
area_under_pe = np.trapz(y=pe_line, dx=1/len(shares_cumsum))
    
gini = (area_under_pe - area_under_lorenz) / area_under_pe
    
print('Gini coefficient:', gini)

np.trapz计算的面积的系数为0.67。在没有洛伦兹曲线的第一点并且使用陷阱的情况下计算的值为0.59。现在,我们对全局不平等的计算大致等于上述问题中方法提供的计算(您无需在这些方法的列表/数组中添加0)。请注意,使用scipy.integrate.simps的结果为0.69,这意味着另一个问题中的方法与梯形的重叠比与Simpson积分的重叠更多。

以下是绘图,其中包括plt.fill_between以便在洛伦兹曲线下着色:

from matplotlib import pyplot as plt

plt.plot(pe_line, shares_cumsum, label='lorenz_curve')
plt.plot(pe_line, pe_line, label='perfect_equality')
plt.fill_between(pe_line, shares_cumsum)
plt.title('Gini: {}'.format(gini), fontsize=20)
plt.ylabel('Cummulative Share of Global GDP', fontsize=15)
plt.xlabel('Income Quintiles (Lowest to Highest)', fontsize=15)
plt.legend()
plt.tight_layout()
plt.show()

The resulting gini curve.

答案 1 :(得分:1)

星尘。

您的问题不在numpy.trapz;它是1)你对完全平等分布的定义,以及2)基尼系数的标准化。

首先,您已将完美的平等分布定义为Q5*H5/2,这是第五个五分位数收入和累积百分比(1.0)的乘积的一半。我不确定这个数字代表什么意思。

其次,你必须按照完全平等分布下的区域进行标准化;即:

  

gini =(完全平等的区域 - 洛伦兹下的区域)/(完全平等的区域)

如果您将完美的平等曲线定义为1的区域,则不必担心这一点,但如果您的定义中存在错误,那么它是一个很好的保护措施。完美的平等曲线。

为了解决这两个问题,我用numpy.linspace定义了完美的平等曲线。这样做的第一个好处是您可以使用真实分布的属性以相同的方式定义它。换句话说,无论你使用四分位数还是五分位数或十分位数,完全平等的CDF(y_pe,下面)将具有正确的形状。第二个优点是计算其区域也是用numpy.trapz完成的,有点并行性使代码更容易阅读并防止错误的计算。

import numpy as np
from matplotlib import pyplot as plt
from numpy import trapz

GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%

# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile

# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100

TR = float(T1 + T2 + T3 + T4 + T5)

# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)

# The y values. Cumulative percentage of incomes
y = np.array([H1,H2,H3,H4,H5])

# The perfect equality y values. Cumulative percentage of incomes.
y_pe = np.linspace(0.0,1.0,len(y))

# Compute the area using the composite trapezoidal rule.
area_lorenz = np.trapz(y, dx=5)

# Calculate the area below the perfect equality line.
area_perfect = np.trapz(y_pe, dx=5)

# Seems to work fine until here. 
# Manually calculated Gini using the values given for the areas above 
# turns out at .58 which seems reasonable?

Gini = (area_perfect - area_lorenz)/area_perfect

print Gini

plt.plot(y,label='lorenz')
plt.plot(y_pe,label='perfect_equality')
plt.legend()
plt.show()