在列表中查找近似重复项

时间:2021-06-15 08:15:23

标签: c# list linq duplicates

我有一个包含 30 万人的列表,其中有一些重复。但最重要的是,有些近似重复。

例如。 : Id LastName FirstName BirthDate

  • 1 肯尼迪约翰 01/01/2000
  • 2 肯尼迪约翰菲茨杰拉德 01/01/2000

我想找到这些重复项并将它们分开处理。我在 Linq 的 GroupBy 中找到了一些示例,但是我找不到具有这两个细微之处的解决方案:

  1. 将名字与 StartsWith 匹配
  2. 完整地保留整个对象(不仅是带有 Select new 的姓氏)

目前,我有以下几点。它可以完成工作,但速度非常慢,而且我很确定它可以更流畅:

var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
    List<Person> doubles = ListPersons
        .Where(x => x.Id != c.Id
        && x.LastName == c.LastName
        && (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
        && x.BirthDate == c.BirthDate 
        ).ToList();

    if (doubles.Any())
    {
       doubles.Add(c); // add the current guy
       dictionary.Add(key++, doubles);
    }

    // Ugly hack to remove the doubles already found
    ListPersons = ListPersons.Except(doubles).ToList();
}

// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)

最后:

借助下面的帮助和 IEqualityComparer :

// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();

// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
 foreach (Person c in listC)
 {
   // Some treatment
 }
}

1 个答案:

答案 0 :(得分:3)

您始终可以构建自己的 IEqualityComparer<T>

public class PersonComparer : IEqualityComparer<Person>
{
    public bool Equals(Person x, Person y)
    {
        return x?.LastName == y?.LastName && x?.BirthDate == y?.BirthDate
            && (x?.FirstName?.StartsWith(y?.FirstName) == true || y?.FirstName?.StartsWith(x?.FirstName) == true) ;
    }

    public int GetHashCode(Person obj)
    {
        unchecked 
        {
            int hash = 17;
            hash = hash * 23 + (obj?.LastName?.GetHashCode() ?? 0);
            hash = hash * 23 + (obj?.BirthDate.GetHashCode() ?? 0);
            return hash;
        }
    }
}

如果您只想保留第一个,请删除其他重复项:

ListPersons = ListPersons
    .GroupBy(x => x, new PersonComparer())
    .Select(g => g.First())
    .ToList();

您可以将此比较器用于许多其他 LINQ 方法,甚至用于字典或 HashSet<T>。例如,您还可以通过这种方式删除重复项:

HashSet<Person> persons = new HashSet<Person>(ListPersons, new PersonComparer());

纯 LINQ 的另一种方式:

ListPersons = ListPersons.Distinct(new PersonComparer()).ToList();
相关问题