
时间:2019-07-19 00:29:49

标签: python date dataframe training-data test-data


    issue_d int_rate    installment dti revol_bal   revol_util  inq_last_6mths  delinq_2yrs pub_rec loan_status purpose_credit_card purpose_debt_consolidation  purpose_home_improvement    purpose_house   purpose_major_purchase  purpose_medical purpose_moving  purpose_other   purpose_renewable_energy    purpose_small_business  purpose_vacation    purpose_wedding
11  Mar-2018    14.07%  233.05  24.69   707 15.7%   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0
16  Mar-2018    11.98%  232.44  20.25   5004    36% 0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0
17  Mar-2018    26.77%  607.97  24.40   7364    46% 1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
20  Mar-2018    20.39%  560.94  15.76   14591   34.2%   0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
23  Mar-2018    7.34%   930.99  16.18   755 0%  0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
130741  Apr-2018    6.07%   309.85  14.64   17380   24.5%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130742  Apr-2018    11.98%  555.86  21.05   19591   20.5%   2   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130744  Apr-2018    11.98%  215.84  14.68   4707    37.7%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y)


您可以下载CSV here(2018年的银行贷款。它们分为四个季度)。可以通过以下方式使用Python 3:

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]

result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]

2 个答案:

答案 0 :(得分:0)




df['issue_d'] = df['issus_d'].astype('datetime64[ns]') 


 strptime  (Extract the Custom time)


df['d_object'] = df.d_object.apply(my_convert_function)


答案 1 :(得分:0)


['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018']


In [545]: periods = pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M'); periods
Out[545]: PeriodIndex(['2018-03', '2018-02', '2018-01', '2018-06', '2018-05', '2018-04'], dtype='period[M]', freq='M')

然后,我们可以使用periods <= '2018-09'这样的表达式(是的,PeriodIndex可以理解与字符串的比较)来创建布尔掩码,以选择要进入训练和测试DataFrames的行。

In [558]: pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M') < '2018-04'
Out[558]: array([ True,  True,  True, False, False, False])

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q3.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q4.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]

subset['issue_period'] = pd.PeriodIndex(subset['issue_d'].values, freq='M')
mask = (subset['issue_period'] <= '2018-09')
train = subset.loc[mask]
test = subset.loc[~mask]