我有一个包含 30 万人的列表,其中有一些重复。但最重要的是,有些近似重复。
例如。 : Id LastName FirstName BirthDate
我想找到这些重复项并将它们分开处理。我在 Linq 的 GroupBy 中找到了一些示例,但是我找不到具有这两个细微之处的解决方案:
目前,我有以下几点。它可以完成工作,但速度非常慢,而且我很确定它可以更流畅:
var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
List<Person> doubles = ListPersons
.Where(x => x.Id != c.Id
&& x.LastName == c.LastName
&& (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
&& x.BirthDate == c.BirthDate
).ToList();
if (doubles.Any())
{
doubles.Add(c); // add the current guy
dictionary.Add(key++, doubles);
}
// Ugly hack to remove the doubles already found
ListPersons = ListPersons.Except(doubles).ToList();
}
// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)
最后:
借助下面的帮助和 IEqualityComparer :
// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();
// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
foreach (Person c in listC)
{
// Some treatment
}
}
答案 0 :(得分:3)
您始终可以构建自己的 IEqualityComparer<T>
:
public class PersonComparer : IEqualityComparer<Person>
{
public bool Equals(Person x, Person y)
{
return x?.LastName == y?.LastName && x?.BirthDate == y?.BirthDate
&& (x?.FirstName?.StartsWith(y?.FirstName) == true || y?.FirstName?.StartsWith(x?.FirstName) == true) ;
}
public int GetHashCode(Person obj)
{
unchecked
{
int hash = 17;
hash = hash * 23 + (obj?.LastName?.GetHashCode() ?? 0);
hash = hash * 23 + (obj?.BirthDate.GetHashCode() ?? 0);
return hash;
}
}
}
如果您只想保留第一个,请删除其他重复项:
ListPersons = ListPersons
.GroupBy(x => x, new PersonComparer())
.Select(g => g.First())
.ToList();
您可以将此比较器用于许多其他 LINQ 方法,甚至用于字典或 HashSet<T>
。例如,您还可以通过这种方式删除重复项:
HashSet<Person> persons = new HashSet<Person>(ListPersons, new PersonComparer());
纯 LINQ 的另一种方式:
ListPersons = ListPersons.Distinct(new PersonComparer()).ToList();