我有一个Panda数据框,其中一列包含列表的值。我正在将列值之一馈入kfold。
filtered_labels = filtered_df['labels']
filtered_sentences = filtered_df.drop('labels', axis=1)
kf = KFold(n_splits=5) # Define the split - into 5 folds
kf.get_n_splits(filtered_sentences)
for train_index, test_index in kf.split(filtered_sentences.shape[0]):
X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]
y_train, y_test = filtered_labels[train_index], filtered_labels[test_index]
tdif_vectorizer = TfidfVectorizer(max_df=5,norm='l2',smooth_idf=True,use_idf=True,ngram_range=(1,1))
train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
for sentence_tokens in X_train['setenceTokens']]
tdif_train_features = tdif_vectorizer.fit_transform(train_corpus_as_string)
tdif_test_features = tdif_vectorizer.transform(X_test)
vModel = LogisticRegression()
vModel.fit(tdif_train_features,y_train)
tdif_predicted_data_set = vModel.predict(tdif_test_features)
当我打印如下所示的内容时,
X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]
X_train['setenceTokens']
Out[642]:
2171 [catastrophic, effect, hiroshima, nagasaki, at...
2172 [iraq, catastrophic, need, replace, constant, ...
2173 [learn, legacy, catastrophic, eruption, via]
2174 [catastrophic, effect, hiroshima, nagasaki, at...
2175 [wish, go, custom, werent, catastrophic]
2176 [best, part, old, baseball, manager, wear, uni...
2177 [learn, event, u, history, year, later]
2178 [catastrophic, effect, hiroshima, nagasaki, at...
2179 [catastrophic, effect, hiroshima, nagasaki, at...
2180 [society, respond, crisis, catastrophic]
2181 [british, upper, class, cause, catastrophic, s...
2182 [dear, anyone, family, alive, 2040]
2183 [scientist, believe, catastrophic, manmade, gl...
2184 [everything, seem, catastrophic, feel, bad, hi...
2185 [jim, blog, catastrophic, outcome, may, come, ...
2186 [u, want, lead, united, state, catastrophic, w...
2187 [stop, extreme, hurt, middle, class]
2188 [learn, legacy, catastrophic, eruption, new, y...
2189 [learn, legacy, catastrophic, eruption, via]
2190 [catastrophic, effect, hiroshima, nagasaki, at...
2191 [good, look, catastrophic, rain, flooding]
...
由于这些值在列表列表中,因此我想将它们转换为以下格式的数组:[“社会,应对,危机,灾难性”,“某事,灾难性,来临,调整” ..),以便我可以将其提供给我的tdif_vectorizer.fit_transform(array_of_strings)。
使用following迭代令牌时,
train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
for sentence_tokens in X_train['setenceTokens']]
在函数中,我打印出要获取的列表,并得到nan作为值。请参见下面
....
['escape', 'place', 'hide', 'time', 'space', 'collide']
['niggra', 'first', 'time', 'hear', 'song', 'sky', 'collide']
['even', 'star', 'moon', 'collide', 'oh', 'oh', 'never', 'want', 'back', 'life', 'take', 'word']
nan
and error : TypeError: 'float' object is not iterable
以下是我的get_string_representation_from_tokens方法,
def get_string_representation_from_tokens(tokens):
string_tokens = ""
print(tokens)
for token in tokens:
string_tokens += str(token) + " "
return string_tokens
我的最终目标是进行5次kfold训练并获得训练数据,然后使用TfidfVectorizer获得向量并提供给Logistic回归模型并预测值。 TfidfVectorizer期望数据位于字符串数组中。这就是为什么我要遍历上面的列表以获得如上所述的所需数组。
如何检查值是否为nan并分配一个空字符串。我尝试了很多方法,但没有成功。
第二个问题
我正在尝试创建一个示例,以便于轻松地运行该想法,但是我有一个单独的问题(请问我在最后提出这个问题)。问题出在这里,当我分割数据时会引入nan值
我的原始数据框列值没有任何null / nan值,因为如下所示,
filtered_sentences.isnull().sum()
Out[652]:
setenceTokens 0
dtype: int64
但是当我使用以下行拆分时,
X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]
并且X_train包含null / nan值,请参见下文
X_train.isnull().sum()
Out[653]:
setenceTokens 21
dtype: int64
有21个值。我在NaNs suddenly appearing for sklearn KFolds中看到了类似的问题,但我使用了相同的问题,但仍然得到了难忘的价值观。如果我可以通过,则不需要检查值nan。很抱歉,这么长的帖子。
答案 0 :(得分:0)
我发现了问题。从这个解决方案,我没有得到nan值。问题是我创建数据框的方式。之前,我的数据框具有列值作为数组。像下面的
['feel','bad','literally','feel']
['feeling','heart','sinking']
,但其值应为
feel bad literally feel
feeling heart sinking
然后,当我从kfold拆分时,它没有给我nan值。希望这样可以节省时间。