Question

我正在尝试使用GradientBoostingClassifier

检测SQL注入

X = dataframe.as_matrix(['token_length','entropy','sqli_g_means','plain_g_means'])

# encode categorical feature
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(dataframe['type'].tolist())

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=7, random_state=0).fit(X_train, y_train)
print "Gradient Boosting Tree Acurracy: %f" % clf.score(X_test, y_test)

训练模型时发生错误。

Traceback (most recent call last):
File "ml_sql_injection.py", line 136, in <module>
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=7, random_state=0).fit(X_train, y_train)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 1404, in fit
y = self._validate_y(y, sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 1968, in _validate_y
    % n_trim_classes)
 ValueError: y contains 1 class after sample_weight trimmed classes with zero weights, while a minimum of 2 classes are required.

如何解决此类错误？

Answer 1

虽然我想为时已晚，但我仍然想为其他人回答这个问题。

此错误是指 y_train 仅包含1个值的事实，即只有1个类别可用于分类，但您至少需要2个。将数据集拆分为训练和测试时， y_train 只剩下一个类。

Answer 2

我也有这个问题。

这是一个很好的描述性错误：Y中的所有数据都具有相同的标签。我发现这非常令人惊讶，因为有大量数据。我猜想这可能是由于复制+粘贴/数据聚合问题/来自汇总我的数据的人的工件。也许大块都带有相同的标签？

我通过改组对其进行了修复，从而可靠地解决了该提升错误。

对于我在Pandas中的特定数据，它看起来像：

x = pd.read_csv(filename, delimiter=",", header=None)
x = x.sample(frac=1) # shuffle -> fixes boosting errors
y = x.iloc[:,[0]] #extract label from last column
x = x.drop([0],axis=1) #drop last column from X

这也为我带来了巨大而可靠的估算器改进：

Python Traceback-gradientboosting.py我该如何修复这种类型的错误

2 个答案: