将Pandas Dataframe中的列除以列的总和

时间:2016-12-02 20:35:23

标签: python pandas dataframe sum

我有一个数据框,我希望将A列中的每一行除以A列的总和,并在数据框中创建一个新列。

Example:

        Col A   New Col
        2       .22
        3       .33
        4       .44
Total = 9       1.00

我试图将Col A加起来然后尝试除以'Total',但因为Total不是一列而是一行,所以它不起作用。我只为新列中的每一行获取NaN。

df['New Col']= (df['ColA']/df.loc['Total']) 

我知道您也可以在一行代码中集成总和计算,而不是创建总计行,但不确定如何做到这一点,并且无法在线找到任何内容。

df['New Col']= (df['ColA']/df.sum()) 

想法?

3 个答案:

答案 0 :(得分:4)

df['new'] = df['ColA'] /  df['ColA'].sum()

应该有效

答案 1 :(得分:2)

另一种方法是使用transform


word_extract <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl = TRUE))

eng.reviews.list <- list()
for (i in 1:nrow(eng.reviews)) {
  z <- word_extract(tolower(as.character(eng.reviews[i,1])))
  eng.reviews.list[[i]] <- z
}

n.docs<-length(eng.reviews.list)
names(eng.reviews.list) <-  c(1:n.docs)
reviews.vector <- VectorSource(eng.reviews.list)
reviews.vector$Names <- names(eng.reviews.list)
reviews.corpus <- Corpus(reviews.vector)

reviews.corpus <- tm_map(reviews.corpus, removeNumbers)
reviews.corpus <- tm_map(reviews.corpus, stemDocument)
reviews.corpus <- tm_map(reviews.corpus, removePunctuation)
reviews.corpus <- tm_map(reviews.corpus, stripWhitespace)
reviews.corpus <- tm_map(reviews.corpus, removeWords, c(stopwords("english"),
                                              "can", "anything", "everything",
                                              "every", "any", "c", "the",
                                              "something"))

tdm <- DocumentTermMatrix(reviews.corpus)
tdm.tfidf <- weightTfIdf(tdm)
tdm.tfidf <- removeSparseTerms(tdm.tfidf, 0.999)
tfidf.matrix <- as.matrix(tdm.tfidf)

dist.matrix.jaccard <- proxy::dist(tfidf.matrix, method = "Jaccard")

set.seed(sample(1:1000, 1))
wss.summary <- c()
clust.improvement <- c()
stop.clustering <- c()
i <- 1
for (k in 1:15) {
iters = 200
kmeans_model <- kmeans(dist.matrix.jaccard,
                       centers = k, iter.max = iters, algorithm = "Forgy")
wss.summary[k] <- kmeans_model$tot.withinss

if (k>1) {
  clust.improvement <- (wss.summary[k-1]-wss.summary[k])/kmeans_model$totss
  #print(paste0('The reduction of the Sum of Squares within the clusters = ', 
   #           round(clust.improvement*100, 2),' %'))
  if (clust.improvement < 0.01 && k > 6) {
    stop.clustering[i] <- k-1
    i <- i+1
  }
}

}

plot(1:15, wss.summary, type="b",
     xlab="Number of Clusters",ylab="Within groups sum of squares")

recommended.clusters <- min(stop.clustering)
print(paste0('The recommended number of clusters: ', recommended.clusters))

points <- cmdscale(dist.matrix.jaccard, k = 2)
kmeans_model <-kmeans(dist.matrix.jaccard,
                      centers = recommended.clusters,
                      iter.max = iters, algorithm = "Forgy")```

I am not sure which part of this code should be a reactive variable, if its one reactive variable or multiple. Wss summary and ngrams based on cluster number will be ploted

答案 2 :(得分:1)

你非常接近。您想在sum()系列

上执行Col A
df['New Col'] = df['Col A']/df['Col A'].sum()

结果显示如下所示的数据框:

>>> df
   Col A   New Col
0      2  0.222222
1      3  0.333333
2      4  0.444444

现在,如果你df.sum(),你会得到一个包含每列总数的系列:

>>> df.sum()
Col A      9.0
New Col    1.0
dtype: float64