适合聚类的归一化方法

时间:2021-03-30 07:38:06

标签: python cluster-analysis normalization data-mining

我的任务是首先从数据计算距离矩阵,然后使用距离矩阵作为聚类算法的输入。我需要在使用之前将距离矩阵归一化为 0~1,但在选择合适的方法时遇到问题。据我所知,Z-score 和 Min-Max 都是两种流行的归一化方法,您会建议哪一种用于聚类任务?

1 个答案:

答案 0 :(得分:0)

您肯定可以对数据进行某种特征缩放。

# Normalization

from sklearn.model_selection import train_test_split

X = df
y = target

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=27)

# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing dataabs
X_test_norm = norm.transform(X_test)

或者...

# data standardization with  sklearn
from sklearn.preprocessing import StandardScaler

# copy of datasets
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()

# numerical features
num_cols = ['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']

# apply standardization on numerical features
for i in num_cols:
    
    # fit on training data column
    scale = StandardScaler().fit(X_train_stand[[i]])
    
    # transform the training data column
    X_train_stand[i] = scale.transform(X_train_stand[[i]])
    
    # transform the testing data column
    X_test_stand[i] = scale.transform(X_test_stand[[i]])

有关详细信息,请参阅下面的链接。

https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/