Question

我目前正在从事一个科学项目，需要处理初始数据集，过滤/合并/计算内容。管道需要一系列步骤（〜10），每个步骤都在不同的python模块中运行。通常，python模块涉及一些中间文件的创建，最终运行外部bash命令，以调用外部程序。最终，我的问题是如何处理相当大（且还在不断增加）的变量（流水线中生成的文件的路径），我需要逐步跟踪这些变量。这是我的真实var duplicate = dt.AsEnumerable() .OrderBy((x,index) => index(ticketStatusOrder.IndexOf(x["TicketStatus"].ToString()))) //.OrderBy(x => x["TicketStatus"].ToString() == "Attended") //.ThenBy(x => x["TicketStatus"].ToString() == "Issued") //.ThenBy(x => x["TicketStatus"].ToString() == "Unpaid") //.ThenBy(x => x["TicketStatus"].ToString() == "Cancelled") .GroupBy(x => new {EventID = x["EventID"].ToString(), ContactID = x["ContactID"].ToString()}) .Select(x => x.FirstOrDefault()).CopyToDataTable();

的过分简化的摘要

main.py

如您所见，在我的管道中，我最终使用了import ld import kinship import PCA def main(args): #LD pruning & build new plink file args.ld_path = join(args.oPath,'ld/') ld.pruning(args) args.plink_path = join(args.oPath,'plink_files/') ld.build_plink_file(args) # build new plink file and calculate kinship pretty_print('KINSHIP') args.kinPath = join(args.oPath,'kinship/') kinship.download_king() kinship.kinship(args) #RUN PCA args.pca_path = join(args.oPath,'pca/') PCA.build_inliers(args) PCA.fast_pca_inliers(args) PCA.project_outliers(args) if __name__=='__main__': parser=argparse.ArgumentParser(description="Returning final list of variants after info_score filter and ld pruning") parser.add_argument('-b',"--bed", type=file_exists, help = "Folder in which the merged plink file is stored", required = True) parser.add_argument('-o',"--oPath",type = str, help = "folder in which to save the results", required = True) #LD PRUNING parser.add_argument('--ld',nargs=3,metavar = ('SIZE','STEP','THRESHOLD'),help ='size,step,threshold',required = True) #KINSHIP parser.add_argument('--degree',type=float,help='Degree for Kinship',default = 2) #PCA parser.add_argument('--pca-components',type=int,help='Components needed for pca',default = 20) main(args)类，“扩展”了解析器，创建了新变量，这样我就可以通过一个模块到另一个模块。我考虑过也使用argparse，但是在一些“官方”输出的顶部，还有一堆中间文件，它们不直接传递到管道的下一步，但可能需要向下移动几步道路，所以我宁愿不必一次全部定义它们。

是否有更好/更清洁的解决方案？

谢谢

Answer 1

我可能在第一个模块中有一个Config-Class，然后通过腌制来传递实例。任何值都可以简单地存储在类字典中。容易破解，但是对于一种简单的方法，它可能会解决问题...

import pickle

class Configuration:
    def __init__(self, file_path):
        self.path = file_path

    def dump(self):
        with open(self.path, 'w+') as config_file:
            pickle.dump(self, config_file)

config = Configuration(r"C:\temp\config.file")
config.kinship_degree = 3

config.dump()

with open(r"C:\temp\config.file") as cf:
    restored_config = pickle.load(cf)

print restored_config.kinship_degree)

否则，我可能会将不同的模块包装在管道模块中，该模块使用所需的参数实例化模块类并运行它们。

在管道中的模块之间共享变量

1 个答案: