行之间的条件数学运算

时间:2018-06-15 21:38:54

标签: python python-2.7

在收到投票后重新发布,确实回去尝试了一些东西,但我猜还是没有。

包含如下数据的文件:

name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592

我需要检查的两个条件和一个稍后需要完成的数学运算: A)哪些行/行在add1中具有重复值(总是== 2) B)如果它们等于2,则哪一行/行在add2中具有更大的值

以杰克为例:

jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052

jack有两个add1 == 2(发生两次),100174052更大,所以:

row1 = jack 45  656 48  100174766   100174052
row2 = jack 70  55  31  100174766   100170715

数学:

表示两行之间的每个单元格 row1 /(row1+row2)

杰克的

输出:

jack    0.391304348 0.922644163 0.607594937 100174766   100174052

最终所需输出

name    count   count1  count3  add1    add2
jack    0.391304348 0.922644163 0.607594937 100174766   100174052
john    0.534090909 0.58490566  0.414473684 102268764   102267908
到目前为止

代码:

我知道我没有考虑哪个add2更大,不知道在哪里以及如何做到

info = []
with open('file.tsv', 'r') as j:
    for i,line in enumerate(j):
        lines = line.strip().split('\t')
        info.append(lines)

uniq = {}
for index,row in enumerate(info, start =1):
    if row.count(row[4]) == 2:
       key = row[4] + ':' + row[5]
    if key not in uniq:
        uniq[key] = row[1:3]

for k, v in sorted(uniq.iteritems()):
    row1 = k,v
    row2 = k,v
    print 'row1: ', row1[0], '\n', 'row2: ',row2[0]

所有我看到的是:

row1:  100174766:100170715 
row2:  100174766:100170715
row1:  100174766:100174052 
row2:  100174766:100174052

而不是

row1:  100174766:100170715
row2:  100174766:100174052

2 个答案:

答案 0 :(得分:1)

(dat.sort_values('add2',ascending=[False]).groupby(['name','add1']).aggregate(lambda x: (x.iloc[0]/sum(x))))

                    count    count1    count3      add2
name  add1                                             
david 103361093  1.000000  1.000000  1.000000  1.000000
jack  100174766  0.391304  0.922644  0.607595  0.500008
john  102268764  0.534091  0.584906  0.414474  0.500000

答案 1 :(得分:1)

任何熊猫都可以做到,可以使用纯python完成 - 只需要更多代码:

使其成为运行f.e的完整minimal verifyable complete example。在https://pyfiddle.io内你需要创建文件:

# create file
with open("d.txt","w") as f:
    f.write("""name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592""")

除此之外,我定义了一些助手:

def printMe(gg):
    """Pretty prints a dictionary"""
    print ""
    for k in gg:
        print k, "\t:  ", gg[k]

def spaceEm(s):
    """Returns a string of input s with 2 spaces prepended"""
    return "  {}".format(s)

并开始阅读并计算您的价值观:

data = {}
with open("d.txt","r") as f:
    headers = f.readline().split() # store header line for later
    for line in f:
        if line.strip(): # just a guard against empty lines
            # name, *splitted = line.split() # python 3.x, you specced 2.7
            tmp = line.split()
            name = tmp[0]
            splitted = tmp[1:]
            nums = list(map(float,splitted))
            data.setdefault((name,nums[3]),[]).append(nums)
printMe(data)

# sort data
for nameAdd1 in data:
    # name     :  count   count1  count3  add1    add2 
    data[nameAdd1].sort(key = lambda x: -x[4]) # - "trick" to sort descending, you 
                                               # could use reverse=True instead 
printMe(data)


# calculate stuff and store in result
result = {}
for nameAdd1 in data:
    try:
        values = zip(*data[nameAdd1])

        # this results in value error if you can not decompose in r1,r2
        result[nameAdd1] = [r1 / (r1+r2) for r1,r2 in values]

    except ValueError:
        # this catches the case of only 1 value for a person 
        result[nameAdd1] = data[nameAdd1][0]
printMe(result)


# store as resultfile (will be overwritten each time)
with open("d2.txt","w") as f:
    # header
    f.write(headers[0])
    for h in headers[1:]:
        f.write(spaceEm(h))
    f.write("\n")

    # data
    for key in result:
        f.write(key[0]) # name
        for t in map(spaceEm,result[key]):
            f.write(t) # numbers
        f.write("\n")

输出:

# read from file
('jack', 100174766.0)   :   [[70.0, 55.0, 31.0, 100174766.0, 100170715.0], [45.0, 656.0, 48.0, 100174766.0, 100174052.0]]
('david', 103361093.0)  :   [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0)   :   [[41.0, 22.0, 89.0, 102268764.0, 102267805.0], [47.0, 31.0, 63.0, 102268764.0, 102267908.0]]

# sorted by add1
('jack', 100174766.0)   :   [[45.0, 656.0, 48.0, 100174766.0, 100174052.0], [70.0, 55.0, 31.0, 100174766.0, 100170715.0]]
('david', 103361093.0)  :   [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0)   :   [[47.0, 31.0, 63.0, 102268764.0, 102267908.0], [41.0, 22.0, 89.0, 102268764.0, 102267805.0]]

# result of calculations
('jack', 100174766.0)   :   [0.391304347826087, 0.9226441631504922, 0.6075949367088608, 0.5, 0.5000083281436545]
('david', 103361093.0)  :   [10.0, 56.0, 78.0, 103361093.0, 103368592.0]
('john', 102268764.0)   :   [0.5340909090909091, 0.5849056603773585, 0.4144736842105263, 0.5, 0.5000002517897694]

输入文件:

name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592

输出文件:

name  count  count1  count3  add1  add2
jack  0.391304347826087  0.9226441631504922  0.6075949367088608  0.5  0.5000083281436545
john  0.5340909090909091  0.5849056603773585  0.4144736842105263  0.5  0.5000002517897694
david  10.0  56.0  78.0  103361093.0  103368592.0

免责声明:我在3.x中编码并在http://pyfiddle.io中将其修改为2.7,之后可能会有一些"不需要"中介变量使它工作......

相关问题