Question

我正在尝试为 pandas 数据框中的分类变量创建虚拟变量。当我使用 scikit-learn 的 OneHotEncoder 尝试此操作时，出现错误。

这是我的数据库示例：

{'Country': {0: 'France', 1: 'Spain', 2: 'Germany', 3: 'Spain', 4: 'Germany'},
 'Age': {0: 44.0, 1: 27.0, 2: 30.0, 3: 38.0, 4: 40.0},
 'Salary': {0: 72000.0, 1: 48000.0, 2: 54000.0, 3: 61000.0, 4: nan},
 'Purchased': {0: 'No', 1: 'Yes', 2: 'No', 3: 'No', 4: 'Yes'}}

我将解释变量隔离如下：

x = data.iloc[:, :-1].values

并处理缺失值：

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

我正在尝试从 country 变量创建虚拟变量，但我一直收到错误消息。这是我使用的代码：

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(handle_unknown='ignore')
x[:, 0] = onehotencoder.fit_transform(x[:, 0])

这是回溯：

ValueError                                Traceback (most recent call last)
<ipython-input-64-692755a5a465> in <module>
      1 onehotencoder = OneHotEncoder(handle_unknown='ignore')
----> 2 x[:, 0] = onehotencoder.fit_transform(x[:, 0])

~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit_transform(self, X, y)
    408         """
    409         self._validate_keywords()
--> 410         return super().fit_transform(X, y)
    411 
    412     def transform(self, X):

~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit(self, X, y)
    383         """
    384         self._validate_keywords()
--> 385         self._fit(X, handle_unknown=self.handle_unknown)
    386         self.drop_idx_ = self._compute_drop_idx()
    387         return self

~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _fit(self, X, handle_unknown)
     72 
     73     def _fit(self, X, handle_unknown='error'):
---> 74         X_list, n_samples, n_features = self._check_X(X)
     75 
     76         if self.categories != 'auto':

~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
     41         if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
     42             # if not a dataframe, do normal check_array validation
---> 43             X_temp = check_array(X, dtype=None)
     44             if (not hasattr(X, 'dtype')
     45                     and np.issubdtype(X_temp.dtype, np.str_)):

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    618             # If input is 1D raise error
    619             if array.ndim == 1:
--> 620                 raise ValueError(
    621                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    622                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=['France' 'Spain' 'Germany' 'Spain' 'Germany' 'France' 'Spain' 'France'
 'Germany' 'France'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Answer 1

错误消息会告诉您如何修复错误的大部分内容：重塑您的数据。问题是它需要一个 numpy 数组。相反，您给它一个 pandas.Series 对象——它不支持 reshape。

encoded = onehotencoder.fit_transform(x[:, 0].values.reshape(-1, 1)

当然，现在您将有多个列，而不仅仅是 1。将其分配回 x[:, 0] 不太合适。

创建虚拟变量

1 个答案: