自定义格式ID映射

时间:2016-07-25 14:40:58

标签: python r bash

我有两个数据库(txt文件)。一个是两列,制表符分隔的,包含名称和ID。

name1 \t ID1
name1 \t ID2
name2 \t ID9
name2 \t ID40
name3 \t ID3

另一个数据库与第一列中的第一个数据库具有相同的ID,而第二列列出了逗号分隔的相同类型的ID(这些是第一个数据库的子项,因为第二个数据库是分层)。

ID1 \t ID1,ID2,ID3
ID2 \t ID2, ID9

我想要做的是获得与第二个数据库格式相同的第三个数据库,但在第二个列中,我想将子ID替换为第一个数据库的名称。例如:

ID1 \t name1,name2,name3
ID2 \t name1,name2

有办法做到这一点吗?我是初学者,在使用Web服务之前必须映射ID,但这是进一步分析所需的自定义格式,我不知道从哪里开始。

提前致谢!

3 个答案:

答案 0 :(得分:0)

import csv

# Reading the first db is simple since there's only a fixed delimiter
# Use csv module to split the lines and create a dictionary that maps id to name

id_dictionary = {}
with open('db_1.txt', 'r') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for line in reader:
        id_dictionary[line[1]] = line[0]

# We can again split on tab but that will return 'name1,name2' etc as a single 
# string that we call split() on later.

row_data = []
with open('db_2.txt', 'r') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for line in reader:
        # ID remains unchanged, so keep the first value
        row = [line[0]]

        # Split the string into individual elements in a list
        id_codes = line[1].split(',')

        # List comprehension to look for ID in the dictionary and return the
        # name stored against it
        translated = [id_dictionary.get(item) for item in id_codes]

        # Add translated to the list that we are using to represent a row
        row.extend(translated)

        # Append the row to our collection of rows
        row_data.append(row)

with open('db_3.txt', 'w') as outfile:
    for row in row_data:
        outfile.write(row[0])
        outfile.write('\t')
        outfile.write(','.join(map(str,row[1:]))) # Join values by a comma
        outfile.write('\n')

答案 1 :(得分:0)

您可以尝试这一行awk脚本:

awk -v FS="\t|," -v OFS="," 'FILENAME=="file_name.txt" {str[$2]=$1;next;} {for(i=2;i<=NF;i++) {sub($i,str[$i],$i)};a=$1;$1="";print a"\t"$0}' file_name.txt fileID.txt|sed -e 's/,//' -e 's/,$//'

awk的“file_name.txt”是txt文件,其第一列具有“name1,name2 ...”,而“fileID.txt”在第一列中具有“ID1,ID2,...” “

sed用于修剪列表开头和末尾的逗号,这些逗号不是必需的。

答案 2 :(得分:0)

#suppose database files are f1.txt,f2.txt,f3.txt
#use set to get key-value format datas
def getArr(f):
    i=f.readline()
    arr=[]
    while i:
        i=i.replace('\n','')
        arr.append(i.split('\t'))
        i=f.readline()
    return arr
if __name__=="__main__":
    f1=file("f1.txt")
    f2=file("f2.txt")
    f3=open('f3.txt','w')
    arr1=getArr(f1)
    arr2=getArr(f2)
    dic={}
    for array in arr1:
        dic[array[1]]=array[0]
    for i in arr2:
        keys=i[1].split(',')
        print keys
        line=i[0]+'\t'
        for key in keys:
            line+=dic.get(key)+','
        line=line[:-1]+'\n'
        f3.write(line)
    f1.close()
    f2.close()
    f3.close()