通过硬编码键检索dict值可以正常工作。不能通过计算键检索。为什么?

时间:2018-07-13 15:51:38

标签: string python-2.7 list dictionary format

我通过比较两组ID(ID集来自字典{ID: XML "RECORD" element})来生成ID的公共列表。有了公用列表后,我想对其进行迭代并从字典(将其写入磁盘)中检索与ID对应的值。

当我使用diff_comm_checker函数计算公共ID列表时,无法检索ID对应的字典值。但是,它不会因KeyError而失败。我也可以打印出ID。

当我将ID硬编码为common_id值时,我可以检索dict值。

common_ids = diff_comm_checker( list_1, list_2, "text")
# does nothing - no failures

common_ids = ['0603599998140032MB']
#gives me:

0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE788>
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE3E0>

因此,我怀疑字符串之间是否存在某些差异。我检查了两个函数的输出,并使用以下命令将它们与硬编码值进行了比较:

print [(_id, type(_id), repr(_id)) for _id in common_ids][0]

我对两者完全相同:

>>> ('0603599998140032MB', <type 'str'>, "'0603599998140032MB'")

我还遵循了另一个问题的建议,并使用了difflib.ndiff:

common_ids1 = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "text")
common_ids = ['0603599998140032MB']
print "\n".join(difflib.ndiff(common_ids1, common_ids))
>>>  0603599998140032MB

同样,两者之间似乎没有任何区别。

下面是完整的代码示例:

from StringIO import StringIO
import xml.etree.cElementTree as ET
from itertools import chain, islice

def diff_comm_checker(list_1, list_2, text):
    """Checks 2 lists. If no difference, pass. Else return common set between two lists"""

    symm_diff = set(list_1).symmetric_difference(list_2)
    if not symm_diff:
        pass
    else:
        mismatches_in1_not2 = set(list_1).difference( set(list_2) )
        mismatches_in2_not1 = set(list_2).difference( set(list_1) )

        if mismatches_in1_not2:
            mismatch_logger(
                mismatches_in1_not2,"{}\n1: {}\n2: {}".format(text, list_1, list_2), 1, 2)
        if mismatches_in2_not1:
            mismatch_logger(
                mismatches_in2_not1,"{}\n2: {}\n1: {}".format(text, list_1, list_2), 2, 1)

    set_common = set(list_1).intersection( set(list_2) )
    if set_common:
        return sorted(set_common)
    else:
        return "no common set: {}\n".format(text)


def chunks(iterable, size=10):
    iterator = iter(iterable)
    for first in iterator:
        yield chain([first], islice(iterator, size - 1))

def get_elements_iteratively(file):
    """Create unique ID out of image number and case number, return it along with corresponding xml element"""

    tag = "RECORD"

    tree = ET.iterparse(StringIO(file), events=("start","end"))
    context = iter(tree)
    _, root = next(context)

    for event, record in context:
        if event == 'end' and record.tag == tag:
            xml_element_2 = ''
            xml_element_1 = ''
            for child in record.getchildren():
                if child.tag == "IMAGE_NUMBER":
                    xml_element_1 = child.text
                if child.tag == "CASE_NUM":
                    xml_element_2 = child.text
            r_id = "{}{}".format(xml_element_1, xml_element_2)
            record.set("R", r_id)
            yield (r_id, record)
            root.clear()

def get_chunks(file, chunk_size):
    """Breaks XML into chunks, yields dict containing unique IDs and corresponding xml elements"""

    iterable = get_elements_iteratively(file)

    for chunk in chunks(iterable, chunk_size):
        ids_records = {}
        for k in chunk:
            ids_records[k[0]]=k[1]

        yield ids_records

def create_new_xml(xml_list):

    chunk = 5000

    chunk_rec_ids_1 = get_chunks(xml_list[0], chunk)
    chunk_rec_ids_2 = get_chunks(xml_list[1], chunk)
    to_write = [chunk_rec_ids_1, chunk_rec_ids_2]

    ######################################################################################
    ### WHAT'S GOING HERE ??? WHAT'S THE DIFFERENCE BETWEEN THE OUTPUTS OF THESE TWO ? ###

    common_ids = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
    #common_ids = ['0603599998140032MB']

    ######################################################################################

    for _id in common_ids:
        print _id
        for gen_obj in to_write:
            for kv_pair in gen_obj:
                if kv_pair[_id]:
                    print _id, kv_pair[_id].attrib, kv_pair[_id]


if __name__ == '__main__':

    xml_1 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
    xml_2 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
    create_new_xml([xml_1, xml_2])

1 个答案:

答案 0 :(得分:0)

问题不在于从diff_comm_checker返回的common_ids的类型或值。问题在于函数diff_comm_checker或在构造函数的参数时破坏了to_write的值

如果您尝试这样做,您会明白我的意思

common_ids = ['0603599998140032MB']
diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")

这将导致错误的行为,而不使用diff_comm_checker()的返回值

这是因为to_write是一个生成器,而对diff_comm_checker的调用耗尽了该生成器。当在循环的if语句中使用时,生成器将完成/清空。您可以使用 list

从生成器创建列表:
chunk_rec_ids_1 = list(get_chunks(xml_list[0], chunk))
chunk_rec_ids_2 = list(get_chunks(xml_list[1], chunk))

但这可能还有其他含义(内存使用情况...)

此外,在diff_comm_checker中此构造的目的是什么?

    if not symm_diff:
       pass

我认为无论symm_diff是否为 None ,都不会发生任何事情。