Question

我正在探索使用'mle'方法来选择PCA的sklearn.decomposition.PCA实现中的PC数量。为此，我想计算校准数据集的平均对数密度，作为保留主成分数的函数。在下面的代码中，“方法1”正是如此。然而，该实现需要针对每个组件重新训练PCA（SVD）。根据我的理解，这种再培训不应该是必要的。我试图用“方法2”避免这种再训练但是失败了。关于如何计算每个PC的对数密度而不重复重新校准的任何建议都是有帮助的。我已经考虑过实现我自己的Probabilistic PCA版本，但在此之前我想确保我没有错过现有的程序来执行此操作。

%matplotlib inline
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

#============================================================
# DATA SIMULATION
#============================================================
m =1000 # number of samples
n = 11 # number of variables/features
k=3 # number of simulated components

# create random vector spanning the k-dimensional subspace
U = np.random.rand(n,k)

# simulate latent variables
T = np.random.randn(m,k)

# simulate variation in features due to latent source of variation
X= T.dot(U.T)

# add random measurement error with spherical Gaussian distribution
E = 0.1*np.random.randn(1000,11)
Y = X+E

#============================================================
# PCA DECOMPOSITION
#============================================================
maxpc=min(m,n)
pca0 = PCA(n_components=maxpc)
pca0.fit(Y)

cumsum = np.cumsum(pca0.explained_variance_ratio_)

plt.plot(cumsum)
print("log predictive score full model (maximum #PCs):")
print(pca0.score(X))


#============================================================
# COMPUTE LOG DENSITY SCORE AS A FUNCTION OF k
#============================================================

kk = np.arange(1, maxpc+1, 1)

# METHOD 1: retrain PCA for every candidate number of components (works as intended but expensive)
print("log predictive score reduced models - method 1 (works as intended but expensive):")
for k in kk:
    pca_local1 = PCA(n_components=k)
    pca_local1.fit(Y)
    print('#PCs: ',k,' score: ', pca_local1.score(X))

# METHOD 2: try to simply take the existing full model and adjust it (doesn't work)
print("log predictive score reduced models - method 2 (does not work as intended):")
for k in kk:
    pca_local2 = pca0
    pca_local2.n_component = k
    print('#PCs: ',k,' score: ', pca_local2.score(X))

sklearn.decomposition.PCA：有没有一种有效的方法来计算日志密度分数作为主成分数量的函数？

0 个答案: