解释R代码功能

时间:2014-11-24 13:39:48

标签: r

我想进行途径富集分析。 我有21个重要基因列表,以及我想检查的多种途径(即检查KEGG途径,GOterms,复合物等的富集)。

我在旧的BioC帖子上找到了这个代码示例。但是,我无法适应自己。

首先, 1-这是什么意思?我不知道这种多冒号语法。

hyperg <- Category:::.doHyperGInternal

2 - 我不明白这条线是如何工作的。 hyperg.test是一个需要传递3个变量的函数,对吗?这条线是否以某种方式将“genes.by.pathways,significant.genes和all.geneIDs”传递给thr hyperg.test?

pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))

我想改编的代码

library(KEGGREST)
library(org.Hs.eg.db)

     # created named list, length 449, eg:
     # path:hsa00010: "Glycolysis / Gluconeogenesis"

pathways <- keggList("pathway", "hsa")

     # make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))

   # for demonstration, just use the first ten pathways

demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)

genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
     demo.pathway$GENE[c(TRUE, FALSE)]
      })

all.geneIDs <- keys(org.Hs.eg.db)

   # chose one of these for demonstration.  the first (a whole genome random
   # set of 100 genes)  has very little enrichment, the second, a random set
   # from the pathways themselves,  has very good enrichment in some pathways

set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)

   # the hypergeometric distribution is traditionally explained in terms of
   # drawing a sample of balls from an urn containing black and white balls.
   # to keep the arguments straight (in my mind at least), I use these terms
   # here also

hyperg <- Category:::.doHyperGInternal
hyperg.test <-
    function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
    white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
    white.balls.in.urn <- length(pathway.genes)
    total.balls.in.urn <- length(all.genes)
    black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
    balls.pulled.from.urn <- length(significant.genes)
    hyperg(white.balls.in.urn, black.balls.in.urn,
           balls.pulled.from.urn, white.balls.drawn, over)
}

pVals.by.pathway <-
    t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))

print(pVals.by.pathway)

1 个答案:

答案 0 :(得分:0)

您收到错误的原因是因为您似乎没有从bioconductor安装Category软件包。我怀疑这是因为三重冒号运算符:::。此运算符与双冒号运算符::非常相似。使用::,您可以从包中访问导出的对象而不加载它,:::允许访问未导出的对象(在这种情况下,来自hyperg的{​​{1}}函数) 。如果您安装了Category软件包,则代码可以正常运行。

关于Category声明:

sapply

您可以将其分解为单独的部分以理解它。首先,pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs)) 迭代sapply的元素并将它们传递给gene.by.pathway的第一个参数。以下参数是两个附加参数。这有点不清楚,我个人建议人们明确地确定参数,以避免意外的惊喜,并避免需要完全相同的顺序。在这种情况下,这有点重复但是避免愚蠢错误的好方法(例如在hyperg.test之后放significant.genes

改写为:

all.geneIds

这个循环完成后,pVals.by.pathway <- t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs)) 函数简化了矩阵的输出。但是,通过转置sapply,输出更加用户友好。

一般来说,在尝试理解复杂的t语句时,我发现最好将它们拆分为较小的部分并查看对象本身的外观。