我正在尝试为 pandas
数据框中的分类变量创建虚拟变量。当我使用 scikit-learn 的 OneHotEncoder
尝试此操作时,出现错误。
这是我的数据库示例:
{'Country': {0: 'France', 1: 'Spain', 2: 'Germany', 3: 'Spain', 4: 'Germany'},
'Age': {0: 44.0, 1: 27.0, 2: 30.0, 3: 38.0, 4: 40.0},
'Salary': {0: 72000.0, 1: 48000.0, 2: 54000.0, 3: 61000.0, 4: nan},
'Purchased': {0: 'No', 1: 'Yes', 2: 'No', 3: 'No', 4: 'Yes'}}
我将解释变量隔离如下:
x = data.iloc[:, :-1].values
并处理缺失值:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
我正在尝试从 country 变量创建虚拟变量,但我一直收到错误消息。这是我使用的代码:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(handle_unknown='ignore')
x[:, 0] = onehotencoder.fit_transform(x[:, 0])
这是回溯:
ValueError Traceback (most recent call last)
<ipython-input-64-692755a5a465> in <module>
1 onehotencoder = OneHotEncoder(handle_unknown='ignore')
----> 2 x[:, 0] = onehotencoder.fit_transform(x[:, 0])
~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit_transform(self, X, y)
408 """
409 self._validate_keywords()
--> 410 return super().fit_transform(X, y)
411
412 def transform(self, X):
~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
688 if y is None:
689 # fit method of arity 1 (unsupervised transformation)
--> 690 return self.fit(X, **fit_params).transform(X)
691 else:
692 # fit method of arity 2 (supervised transformation)
~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit(self, X, y)
383 """
384 self._validate_keywords()
--> 385 self._fit(X, handle_unknown=self.handle_unknown)
386 self.drop_idx_ = self._compute_drop_idx()
387 return self
~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _fit(self, X, handle_unknown)
72
73 def _fit(self, X, handle_unknown='error'):
---> 74 X_list, n_samples, n_features = self._check_X(X)
75
76 if self.categories != 'auto':
~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
41 if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
42 # if not a dataframe, do normal check_array validation
---> 43 X_temp = check_array(X, dtype=None)
44 if (not hasattr(X, 'dtype')
45 and np.issubdtype(X_temp.dtype, np.str_)):
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
618 # If input is 1D raise error
619 if array.ndim == 1:
--> 620 raise ValueError(
621 "Expected 2D array, got 1D array instead:\narray={}.\n"
622 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=['France' 'Spain' 'Germany' 'Spain' 'Germany' 'France' 'Spain' 'France'
'Germany' 'France'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
答案 0 :(得分:2)
错误消息会告诉您如何修复错误的大部分内容:重塑您的数据。问题是它需要一个 numpy
数组。相反,您给它一个 pandas.Series
对象——它不支持 reshape
。
encoded = onehotencoder.fit_transform(x[:, 0].values.reshape(-1, 1)
当然,现在您将有多个列,而不仅仅是 1。将其分配回 x[:, 0]
不太合适。