Question

我正在尝试在List对象上实现重复的对象查找方法。遍历List并使用多个线程查找重复对象是目标。到目前为止，我使用ExecutorService如下。

ExecutorService executor = Executors.newFixedThreadPool(5);
    for (int i = 0; i < jobs; i++) {
        Runnable worker = new TaskToDo(jobs);
        executor.execute(worker);
    }
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");

在TaskToDo类中，我遍历循环。当检测到重复时，将从列表中删除其中一个副本。以下是我遇到的问题，

在执行程序中使用多个线程时，它不会按预期结果。列表中仍存在一些重复值。但执行者的单个线程完美无缺。我试过了 List<String> list = Collections.synchronizedList(new LinkedList<String>())也存在同样的问题。
我可以使用哪种最佳数据结构来删除重复项以获得更好的性能？

Google提供了一些使用Concurrent结构的结果。但很难找到实现这一目标的正确方法。感谢您的帮助。在此先感谢... :)

以下是迭代指定列表对象的代码。这里将比较文件的实际内容。

for(int i = currentTemp; i < list.size() - 1; i++){
        if(isEqual(list.get(currentTemp), list.get(i+1))){
            synchronized (list) {
                list.remove(i + 1);
                i--;
}}}

Answer 1

使用您当前的逻辑，您必须以更粗略的粒度进行同步，否则您可能会删除错误的元素。

for (int i = currentTemp; i < list.size() - 1; i++) {
  synchronized (list) {
    if (i + 1 > list.size() && isEqual(list.get(currentTemp), list.get(i+1))) {
      list.remove(i + 1);
      i--;
    }
  }
}

您可以看到，isEqual()检查必须位于synchronized块内，以确保等效检查的原子性与元素删除。假设您的大多数并发处理优势来自使用isEqual()的列表元素的异步比较，此更改将使您寻求的任何好处无效。

另外，检查同步块之外的list.size()是不够的，因为列表元素可以被其他线程删除。除非你有办法在其他线程删除元素时调整列表索引，否则你的代码会在不知不觉中跳过检查列表中的一些元素。其他线程正在从当前线程的for循环中转移元素。

使用附加列表来跟踪应删除的索引，可以更好地实现此任务：

private volatile Set<Integer> indexesToRemove =
  Collections.synchronizedSet(new TreeSet<Integer>(
    new Comparator<Integer>() {
      @Override public int compare(Integer i1, Integer i2) {
        return i2.compareTo(i1); // sort descending for later element removal
      }
    }
  ));

上述内容应与list在同一共享级别声明。然后迭代列表的代码应该如下所示，不需要同步：

int size = list.size();
for (int i = currentTemp; i < size - 1; i++) {
  if (!indexesToRemove.contains(i + 1)) {
    if (isEqual(list.get(currentTemp), list.get(i+1))) {
      indexesToRemove.add(i + 1);
    }
  }
}

最后，在您将工作线程join()编辑回单个线程后，执行此操作以删除列表：

for (Integer i: indexesToRemove) {
  list.remove(i.intValue());
}

因为我们对indicesToRemove使用了降序排序的TreeSet，所以我们可以简单地迭代它的索引并从列表中删除每个索引。

Answer 2

如果您的算法对可能真正受益于多个线程的足够数据起作用，那么您会遇到另一个问题，这会降低任何性能优势。每个线程都必须扫描整个列表，看看它正在处理的元素是否重复，这将导致CPU缓存不断丢失，因为各种线程竞争访问列表的不同部分。

这称为False Sharing。

即使False Sharing没有得到你，你也要在O（N ^ 2）中重复删除列表，因为对于列表的每个元素，你重新迭代整个列表。

相反，请考虑使用Set来初始收集数据。如果您不能这样做，请测试将列表元素添加到Set的性能。这应该是解决这个问题的一种非常有效的方法。

Answer 3

如果您尝试重复删除大量文件，那么您真的应该使用基于哈希的结构。同时修改列表是危险的，尤其是因为列表中的索引会不断地从你下面改变，这很糟糕。

如果你可以使用Java 8，我的方法看起来就像这样。我们假设你有一个List<String> fileList。

 Collection<String> deduplicatedFiles = fileList.parallelStream()
    .map(FileSystems.getDefault()::getPath) // convert strings to Paths
    .collect(Collectors.toConcurrentMap(
       path -> {
          try {
             return ByteBuffer.wrap(Files.readAllBytes(path)),
             // read out the file contents and wrap in a ByteBuffer
             // which is a suitable key for a hash map
          } catch (IOException e) {
            throw new RuntimeException(e);
          }
        },
       path -> path.toString(), // in the values, convert back to string
       (first, second) -> first) // resolve duplicates by choosing arbitrarily
    .values();

那是整个的东西：它同时读取所有文件，对它们进行哈希处理（尽管使用未指定的哈希算法，可能不是伟大的），重复删除它们，然后吐出列出具有不同内容的文件列表。

如果您使用的是Java 7，那么我所做的就是这样。

 CompletionService<Void> service = new ExecutorCompletionService<>(
     Executors.newFixedThreadPool(4));
 final ConcurrentMap<ByteBuffer, String> unique = new ConcurrentHashMap<>();
 for (final String file : fileList) {
    service.submit(new Runnable() {
      @Override public void run() {
        try {
          ByteBuffer buffer = ByteBuffer.wrap(Files.readAllBytes(
              FileSystem.getDefault().getPath(file)));
          unique.putIfAbsent(buffer, file);
        } catch (IOException e) {
          throw new RuntimeException(e);
        }
      }, null);
 }
 for (int i = 0; i < fileList.size(); i++) {
   service.take();
 }
 Collection<String> result = unique.values();

多线程：识别重复的对象

3 个答案: