Question

首先，再次感谢已经回答了我的问题。我不是一个非常有经验的程序员，这是我第一次使用多线程。

我得到的一个例子就像我的问题一样。我希望这可以缓解我们的情况。

public class ThreadMeasuring {
private static final int TASK_TIME = 1; //microseconds
private static class Batch implements Runnable {
    CountDownLatch countDown;
    public Batch(CountDownLatch countDown) {
        this.countDown = countDown;
    }

    @Override
    public void run() {         
        long t0 =System.nanoTime();
        long t = 0;
        while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }

        if(countDown!=null) countDown.countDown();
    }
}

public static void main(String[] args) {
    ThreadFactory threadFactory = new ThreadFactory() {
        int counter = 1;
        @Override
        public Thread newThread(Runnable r) {
            Thread t = new Thread(r, "Executor thread " + (counter++));
            return t;
        }
    };

  // the total duty to be divided in tasks is fixed (problem dependent). 
  // Increase ntasks will mean decrease the task time proportionally. 
  // 4 Is an arbitrary example.
  // This tasks will be executed thousands of times, inside a loop alternating 
  // with serial processing that needs their result and prepare the next ones.
    int ntasks = 4; 
    int nthreads = 2;
    int ncores = Runtime.getRuntime().availableProcessors();
    if (nthreads<ncores) ncores = nthreads;     

    Batch serial = new Batch(null);
    long serialTime = System.nanoTime();
    serial.run();
    serialTime = System.nanoTime() - serialTime;

    ExecutorService executor = Executors.newFixedThreadPool( nthreads, threadFactory );
    CountDownLatch countDown = new CountDownLatch(ntasks);

    ArrayList<Batch> batches = new ArrayList<Batch>();
    for (int i = 0; i < ntasks; i++) {
        batches.add(new Batch(countDown));
    }

    long start = System.nanoTime();
    for (Batch r : batches){
        executor.execute(r);
    }

    // wait for all threads to finish their task
    try {
        countDown.await();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    long tmeasured = (System.nanoTime() - start);

    System.out.println("Task time= " + TASK_TIME + " ms");
    System.out.println("Number of tasks= " + ntasks);
    System.out.println("Number of threads= " + nthreads);
    System.out.println("Number of cores= " + ncores);
    System.out.println("Measured time= " + tmeasured);
    System.out.println("Theoretical serial time= " + TASK_TIME*1000000*ntasks);
    System.out.println("Theoretical parallel time= " + (TASK_TIME*1000000*ntasks)/ncores);
    System.out.println("Speedup= " + (serialTime*ntasks)/(double)tmeasured);

    executor.shutdown();
}
 }

而不是进行计算，每批只等待一段时间。该程序计算加速，理论上它总是2，但如果'TASK_TIME'很小，则可以小于1（实际上是速度下降）。

我的计算时间在前1毫秒，通常更快。 1毫秒我发现一点点加速大约30％，但实际上，在我的程序中，我注意到速度下降。

这段代码的结构与我的程序非常相似，所以如果你能帮我优化线程处理，我将非常感激。

亲切的问候。

下面，原始问题：

您好。

我想在我的程序中使用多线程，因为它可以大大提高它的效率，我相信。其大部分运行时间是由于独立计算。

我的程序有成千上万的独立计算（要解决几个线性系统），但它们只是由几十个左右的小组同时发生。每个组都需要几毫秒才能运行。在这些计算组中的一组之后，程序必须按顺序运行一段时间，然后我必须再次解决线性系统。

实际上，可以看出要求解决的这些独立线性系统位于循环内，循环数千次，与依赖于先前结果的顺序计算交替进行。我加快程序的想法是在并行线程中计算这些独立的计算，将每个组划分为（我可用的处理器数量）批量的独立计算。因此，原则上根本没有排队。

我尝试使用FixedThreadPool和CachedThreadPool，它比串行处理更慢。每次我需要解决批次时，似乎需要花费太多时间来创建新的Treads。

有没有更好的方法来处理这个问题？我使用的这些池似乎适用于每个线程花费更多时间而不是数千个更小线程的情况......

谢谢！最诚挚的问候！

Answer 1

线程池不会反复创建新线程。这就是为什么他们是游泳池。

您使用了多少个线程以及您拥有多少CPU /核心？什么是系统负载（通常，当您以串行方式执行它们时，以及在使用池执行时）？是同步还是涉及任何类型的锁定？

并行执行的算法是否与序列执行完全相同（您的描述似乎表明序列正在重复使用先前迭代的某些结果）。

Answer 2

我不确定你是如何进行计算的，但是如果你把它们分成小组，那么你的应用程序可能已经适合生产者/消费者模式了。

此外，您可能对使用BlockingQueue感兴趣。计算消费者将阻塞，直到队列中存在某些内容并且该块在take()调用上发生。

private static class Batch implements Runnable {
    CountDownLatch countDown;
    public Batch(CountDownLatch countDown) {
        this.countDown = countDown;
    }

    CountDownLatch getLatch(){
        return countDown;
    }

    @Override
    public void run() {         
        long t0 =System.nanoTime();
        long t = 0;
        while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }

        if(countDown!=null) countDown.countDown();
    }
}

class CalcProducer implements Runnable {
    private final BlockingQueue queue;
    CalcProducer(BlockingQueue q) { queue = q; }
    public void run() {
        try {
            while(true) { 
                CountDownLatch latch = new CountDownLatch(ntasks);
                for(int i = 0; i < ntasks; i++) {
                    queue.put(produce(latch)); 
                }
                // don't need to wait for the latch, only consumers wait
            }
        } catch (InterruptedException ex) { ... handle ...}
    }

    CalcGroup produce(CountDownLatch latch) {
        return new Batch(latch);
    }
}

class CalcConsumer implements Runnable {
    private final BlockingQueue queue;

    CalcConsumer(BlockingQueue q) { queue = q; }

    public void run() {
        try {
            while(true) { consume(queue.take()); }
        } catch (InterruptedException ex) { ... handle ...}
    }

    void consume(Batch batch) { 
        batch.Run();
        batch.getLatch().await();
    }
}

class Setup {
    void main() {
        BlockingQueue<Batch> q = new LinkedBlockingQueue<Batch>();
        int numConsumers = 4;

        CalcProducer p = new CalcProducer(q);
        Thread producerThread = new Thread(p);
        producerThread.start();

        Thread[] consumerThreads = new Thread[numConsumers];

        for(int i = 0; i < numConsumers; i++)
        {
            consumerThreads[i] = new Thread(new CalcConsumer(q));
            consumerThreads[i].start();
        }
    }
}

很抱歉，如果有任何语法错误，我一直在咀嚼C＃代码，有时我会忘记正确的java语法，但总的想法就在那里。

Answer 3

如果您遇到无法扩展到多个内核的问题，则需要更改程序，否则您遇到的问题并不像您想象的那样平行。我怀疑你有其他类型的bug，但根据给出的信息不能说。

此测试代码可能会有所帮助。

Time per million tasks 765 ms

码

ExecutorService es = Executors.newFixedThreadPool(4);
Runnable task = new Runnable() {
    @Override
    public void run() {
        // do nothing.
    }
};
long start = System.nanoTime();
for(int i=0;i<1000*1000;i++) {
    es.submit(task);
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
long time = System.nanoTime() - start;
System.out.println("Time per million tasks "+time/1000/1000+" ms");

编辑：假设你有一个循环连续执行此操作。

for(int i=0;i<1000*1000;i++)
    doWork(i);

您可能会认为像这样更改为循环会更快，但问题是开销可能大于增益。

for(int i=0;i<1000*1000;i++) {
    final int i2 = i;
    ex.execute(new Runnable() {
        public void run() {
            doWork(i2);
        }
    }
}

因此，您需要创建批量工作（每个线程至少一个），因此有足够的任务来保持所有线程忙，但没有那么多线程花费时间在头顶上的任务。

final int batchSize = 10*1000;
for(int i=0;i<1000*1000;i+=batchSize) {
    final int i2 = i;
    ex.execute(new Runnable() {
        public void run() {
            for(int i3=i2;i3<i2+batchSize;i3++)
               doWork(i3);
        }
    }
}

EDIT2：RENning atest在线程之间复制数据。

for (int i = 0; i < 20; i++) {
    ExecutorService es = Executors.newFixedThreadPool(1);
    final double[] d = new double[4 * 1024];
    Arrays.fill(d, 1);
    final double[] d2 = new double[4 * 1024];
    es.submit(new Runnable() {
        @Override
        public void run() {
            // nothing.
        }
    }).get();
    long start = System.nanoTime();
    es.submit(new Runnable() {
        @Override
        public void run() {
            synchronized (d) {
                System.arraycopy(d, 0, d2, 0, d.length);
            }
        }
    });
    es.shutdown();
    es.awaitTermination(10, TimeUnit.SECONDS);
    // get a the values in d2.
    for (double x : d2) ;
    long time = System.nanoTime() - start;
    System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}

开始很糟糕，但温度高达50美元。

Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
 ... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.

Answer 4

从我读过的内容：“成千上万的独立计算......同时发生......需要几毫秒才能运行”在我看来，你的问题非常适合GPU编程。

我认为它可以回答你的问题。 GPU编程正变得越来越流行。有CUDA和Java的Java绑定。 OpenCL的。如果您可以使用它，我会说它。

Answer 5

这是我正在思考的诗歌概要

class WorkerThread extends Thread {

    Queue<Calculation> calcs;
    MainCalculator mainCalc;

    public void run() {
        while(true) {
            while(calcs.isEmpty()) sleep(500); // busy waiting? Context switching probably won't be so bad.
            Calculation calc = calcs.pop(); // is it pop to get and remove? you'll have to look
            CalculationResult result = calc.calc();
            mainCalc.returnResultFor(calc,result);      
        }
    }


}

另一种选择，如果您正在调用外部程序。不要将它们放在一个一个循环中，否则它们不会并行运行。你可以把它们放在一个一个处理它们的循环中，但不能一次处理它们。

Process calc1 = Runtime.getRuntime.exec("myCalc paramA1 paramA2 paramA3");
Process calc2 = Runtime.getRuntime.exec("myCalc paramB1 paramB2 paramB3");
Process calc3 = Runtime.getRuntime.exec("myCalc paramC1 paramC2 paramC3");
Process calc4 = Runtime.getRuntime.exec("myCalc paramD1 paramD2 paramD3");

calc1.waitFor();
calc2.waitFor();
calc3.waitFor();
calc4.waitFor();

InputStream is1 = calc1.getInputStream();
InputStreamReader isr1 = new InputStreamReader(is1);
BufferedReader br1 = new BufferedReader(isr1);
String resultStr1 = br1.nextLine();

InputStream is2 = calc2.getInputStream();
InputStreamReader isr2 = new InputStreamReader(is2);
BufferedReader br2 = new BufferedReader(isr2);
String resultStr2 = br2.nextLine();

InputStream is3 = calc3.getInputStream();
InputStreamReader isr3 = new InputStreamReader(is3);
BufferedReader br3 = new BufferedReader(isr3);
String resultStr3 = br3.nextLine();

InputStream is4 = calc4.getInputStream();
InputStreamReader isr4 = new InputStreamReader(is4);
BufferedReader br4 = new BufferedReader(isr4);
String resultStr4 = br4.nextLine();

Answer 6

嗯，CachedThreadPool似乎只针对你的情况而创建。如果你很快就重用它就不会重新创建线程，如果你在使用新线程之前花了整整一分钟，那么线程创建的开销就相对微不足道了。

但除非您也可以并行访问数据，否则您不能指望并行执行来加速计算。如果使用大量锁定，许多同步方法等，您将在开销上花费更多，而不是在并行处理上获得。检查您的数据是否可以并行有效地处理，并且代码中没有明显的同步lurkinb。

此外，如果数据完全适合缓存，CPU会有效地处理数据。如果每个线程的数据集大于缓存的一半，则两个线程将竞争缓存并发出许多RAM读取，而一个线程（如果仅使用一个核心）可能执行得更好，因为它避免了在其执行的紧密循环中的RAM读取。检查一下。

是否可以在不重复创建线程的情况下使用多线程？

6 个答案: