Python: Compare first n characters of item in list to first n characters of all other items in same list

时间:2019-04-16 23:54:26

标签: python python-2.7 duplicates comparison list-comprehension

I need to compare the first n characters of items in a list to the first n characters of other items in the same list, then remove or keep one of those items.

In the example list below, “AB2222_100” and “AB2222_P100” would be considered duplicates (even though they're technically unique) because the first 6 characters match. When comparing the two values, if x[-4:] = "P100", then that value would be kept in the list and the value without the “P” would be removed. The other items in the list would be kept since there isn’t a duplicate, regardless of whether it's “P100” or “100” suffix at the end of the string. For this case, there will never be more than one duplicate (either a “P” or not).

  • AB1111_100
  • AB2222_100
  • AB2222_P100
  • AB3333_P100
  • AB4444_100
  • AB5555_P100

I understand slicing and comparing, but everything is assuming unique values. I was hoping to use list comprehension instead of a long for loop, but also want to understand what I'm seeing. I've gotten lost trying to figure out collections, sets, zip, etc. for this non-unique scenario.

Slicing and comparing isn't going to retain the required suffix that needs to be maintained in the final list.

newList = [x[:6] for x in myList]

This is how it should start and end.

myList = ['ABC1111_P100', 'ABC2222_100', 'ABC2222_P100', 'ABC3333_P100', 'ABC4444_100', 'ABC5555_P100']

newList = ['ABC1111_P100', 'ABC2222_P100', 'ABC3333_P100', 'ABC4444_100', 'ABC5555_P100']

1 个答案:

答案 0 :(得分:0)

如您的评论中所述,您不能一口气做到这一点。您可以在O(n)时间内完成此操作,但这会占用一些额外空间:

myList = ['ABC1111_P100', 'ABC2222_100', 'ABC2222_P100', 'ABC3333_P100', 'ABC4444_100', 'ABC5555_P100']
seen = dict()

print(myList)
for x in myList:
    # grab the start and end of the string
    start, end = x.split('_')
    if start in seen: # If we have seen this value before
        if seen[start] != 'P100': # Did that ending have a P value?
            seen[start] = end # If not swap out the P value
    else:
        # If we have not seen this before then add it to our dict.
        seen[start] = end

final_list = ["{}_{}".format(key, value) for key, value in seen.items()]
print(final_list)