如何检查nan以获取python中列出的熊猫列值

时间:2018-08-05 04:47:17

标签: python pandas dataframe vectorization

我有一个Panda数据框,其中一列包含列表的值。我正在将列值之一馈入kfold。

filtered_labels = filtered_df['labels']
filtered_sentences = filtered_df.drop('labels', axis=1)

kf = KFold(n_splits=5) # Define the split - into 5 folds 
kf.get_n_splits(filtered_sentences)

for train_index, test_index in kf.split(filtered_sentences.shape[0]):
    X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]
    y_train, y_test = filtered_labels[train_index], filtered_labels[test_index]

    tdif_vectorizer = TfidfVectorizer(max_df=5,norm='l2',smooth_idf=True,use_idf=True,ngram_range=(1,1))

    train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
                                for sentence_tokens in X_train['setenceTokens']]

    tdif_train_features = tdif_vectorizer.fit_transform(train_corpus_as_string)
         tdif_test_features = tdif_vectorizer.transform(X_test) 

    vModel = LogisticRegression()
    vModel.fit(tdif_train_features,y_train)
    tdif_predicted_data_set = vModel.predict(tdif_test_features)

当我打印如下所示的内容时,

X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]

X_train['setenceTokens']

Out[642]: 
2171     [catastrophic, effect, hiroshima, nagasaki, at...
2172     [iraq, catastrophic, need, replace, constant, ...
2173          [learn, legacy, catastrophic, eruption, via]
2174     [catastrophic, effect, hiroshima, nagasaki, at...
2175              [wish, go, custom, werent, catastrophic]
2176     [best, part, old, baseball, manager, wear, uni...
2177               [learn, event, u, history, year, later]
2178     [catastrophic, effect, hiroshima, nagasaki, at...
2179     [catastrophic, effect, hiroshima, nagasaki, at...
2180              [society, respond, crisis, catastrophic]
2181     [british, upper, class, cause, catastrophic, s...
2182                   [dear, anyone, family, alive, 2040]
2183     [scientist, believe, catastrophic, manmade, gl...
2184     [everything, seem, catastrophic, feel, bad, hi...
2185     [jim, blog, catastrophic, outcome, may, come, ...
2186     [u, want, lead, united, state, catastrophic, w...
2187                  [stop, extreme, hurt, middle, class]
2188     [learn, legacy, catastrophic, eruption, new, y...
2189          [learn, legacy, catastrophic, eruption, via]
2190     [catastrophic, effect, hiroshima, nagasaki, at...
2191            [good, look, catastrophic, rain, flooding]
...

由于这些值在列表列表中,因此我想将它们转换为以下格式的数组:[“社会,应对,危机,灾难性”,“某事,灾难性,来临,调整” ..),以便我可以将其提供给我的tdif_vectorizer.fit_transform(array_of_strings)。

使用following迭代令牌时,

train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
                        for sentence_tokens in X_train['setenceTokens']]

在函数中,我打印出要获取的列表,并得到nan作为值。请参见下面

....
['escape', 'place', 'hide', 'time', 'space', 'collide']
['niggra', 'first', 'time', 'hear', 'song', 'sky', 'collide']
['even', 'star', 'moon', 'collide', 'oh', 'oh', 'never', 'want', 'back', 'life', 'take', 'word']
nan

and error : TypeError: 'float' object is not iterable

以下是我的get_string_representation_from_tokens方法,

def get_string_representation_from_tokens(tokens):
    string_tokens = ""
    print(tokens)
    for token in tokens:
        string_tokens += str(token) + " "
    return string_tokens

我的最终目标是进行5次kfold训练并获得训练数据,然后使用TfidfVectorizer获得向量并提供给Logistic回归模型并预测值。 TfidfVectorizer期望数据位于字符串数组中。这就是为什么我要遍历上面的列表以获得如上所述的所需数组。

如何检查值是否为nan并分配一个空字符串。我尝试了很多方法,但没有成功。

第二个问题

我正在尝试创建一个示例,以便于轻松地运行该想法,但是我有一个单独的问题(请问我在最后提出这个问题)。问题出在这里,当我分割数据时会引入nan值

我的原始数据框列值没有任何null / nan值,因为如下所示,

filtered_sentences.isnull().sum()
Out[652]: 
setenceTokens    0
dtype: int64

但是当我使用以下行拆分时,

X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]

并且X_train包含null / nan值,请参见下文

X_train.isnull().sum()
Out[653]: 
setenceTokens    21
dtype: int64

有21个值。我在NaNs suddenly appearing for sklearn KFolds中看到了类似的问题,但我使用了相同的问题,但仍然得到了难忘的价值观。如果我可以通过,则不需要检查值nan。很抱歉,这么长的帖子。

1 个答案:

答案 0 :(得分:0)

我发现了问题。从这个解决方案,我没有得到nan值。问题是我创建数据框的方式。之前,我的数据框具有列值作为数组。像下面的

['feel','bad','literally','feel']
['feeling','heart','sinking']

,但其值应为

feel bad literally feel
feeling heart sinking

然后,当我从kfold拆分时,它没有给我nan值。希望这样可以节省时间。