更有效的多次调用GetStringAsync的方法?

时间:2017-03-08 02:45:14

标签: c# multithreading task-parallel-library tpl-dataflow

我有(我的网址列表大约是1000个网址),我想知道是否有更有效的来自同一网站的多个网址(已经更改ServicePointManager.DefaultConnectionLimit)。

同样最好重复使用相同的HttpClient或在每次调用时创建一个新的,下面只使用一个而不是多个。

using (var client = new HttpClient { Timeout = new TimeSpan(0, 5, 0) })
{
    var tasks = urls.Select(async url =>
    {
        await client.GetStringAsync(url).ContinueWith(response =>
        {
           var resultHtml = response.Result;
           //process the html

        });
    }).ToList();

    Task.WaitAll(tasks.ToArray());
}

正如@cory所建议的那样 这是使用TPL的修改后的代码,但是我必须将MaxDegreeOfParallelism = 100设置为与基于任务的速度大致相同,可以改进下面的代码吗?

var downloader = new ActionBlock<string>(async url =>
{
    var client = new WebClient();
    var resultHtml = await client.DownloadStringTaskAsync(new Uri(url));


}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });

foreach(var url in urls)
{
    downloader.Post(url);
}
downloader.Complete();
downloader.Completion.Wait();

FINAL

public void DownloadUrlContents(List<string> urls)
{
    var watch = Stopwatch.StartNew();

    var httpClient = new HttpClient();
    var downloader = new ActionBlock<string>(async url =>
    {
        var data = await httpClient.GetStringAsync(url);
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });

    Parallel.ForEach(urls, (url) =>
    {
        downloader.SendAsync(url);
    });
    downloader.Complete();
    downloader.Completion.Wait();

    Console.WriteLine($"{MethodBase.GetCurrentMethod().Name} {watch.Elapsed}");    
}

2 个答案:

答案 0 :(得分:2)

虽然您的代码可行,但为ActionBlock引入缓冲区块是一种常见做法。为什么要这样做?第一个原因是任务队列大小,您可以轻松地平衡队列中的消息数。第二个原因是将消息添加到缓冲区几乎是即时的,之后它是TPL Dataflow&#39;有责任处理您的所有物品:

// async method here
public async Task DownloadUrlContents(List<string> urls)
{
    var watch = Stopwatch.StartNew();

    var httpClient = new HttpClient();

    // you may limit the buffer size here
    var buffer = new BufferBlock<string>();
    var downloader = new ActionBlock<string>(async url =>
        {
            var data = await httpClient.GetStringAsync(url);
            // handle data here
        },
        // note processot count usage here
        new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount });
    // notify TPL Dataflow to send messages from buffer to loader
    buffer.LinkTo(downloader, new DataflowLinkOptions {PropagateCompletion = true});

    foreach (var url in urls)
    {
        // do await here
        await buffer.SendAsync(url);
    }
    // queue is done
    buffer.Complete();

    // now it's safe to wait for completion of the downloader
    await downloader.Completion;

    Console.WriteLine($"{MethodBase.GetCurrentMethod().Name} {watch.Elapsed}");
}

答案 1 :(得分:0)

基本上,重新使用HttpClient会更好,因为您不必在每次发送请求时进行身份验证,并且可以使用Cookie保存会话状态,除非您初始化它在每次创作时都带有令牌/饼干。除此之外,它都归结为ServicePoint,您可以在其中设置允许的最大并发连接数。

要以更易维护的方式并行执行调用,我建议使用AsyncEnumerator NuGet package,它允许您编写如下代码:

using System.Collections.Async;

await uris.ParallelForEachAsync(
    async uri =>
    {
        var html = await httpClient.GetStringAsync(uri, cancellationToken);
        // process HTML
    },
    maxDegreeOfParallelism: 5,
    breakLoopOnException: false,
    cancellationToken: cancellationToken);