Question

我正在开发一个包含Twitter数据的社区检测项目，我需要根据关系创建一个网络。我收集并过滤了200,000个UID。我的下一步是在其中创建一个朋友/关注者网络。

我使用Ruby脚本和Twitter gem来收集，处理和存储数据。为了克服API调用限制，我使用Apigee代理，所以现在没有速率限制的问题。

获取两个UID之间关系状态的调用位于：https://dev.twitter.com/docs/api/1/get/friendships/show

我需要加快收集数据的过程。目前我的终端中有很多脚本同时运行。我发现这种方法很难管理和扩展。是否有更快，更有效，更易于管理的方式来做同样的事情？或者，我缺少一种完全不同且更好的方法吗？

Answer 1

您可以尝试使用nokogori并解析https://twitter.com/#!/USERNAME/followers

的HTML页面

Answer 2

我能想到的一件事是使用EC2实例并部署脚本，你可以获得最大的实例并使用它几个小时。一个好处是你有更强大的实例和更快的互联网连接。

而且如果你只是收集公共数据，这意味着你不必通过OAuth进行身份验证（如果我错了请纠正我），我会使用Perl脚本或Python，它比Ruby更快地使用Gem。

Answer 3

为什么不使用logstash来收集数据。 Logstash为您提供了大量发送数据的选项，以便您可以轻松地对其进行过滤。在将其发送到输出之前，您甚至可以通过logstash过滤所有数据。可用的输出选项是Elasticsearch（用于实时搜索，分析和可视化），数据库（mysql，MSSQL等）等等。

Logstash - https://www.elastic.co/products/logstash

Twitter Logstash插件 - https://www.elastic.co/guide/en/logstash/current/plugins-inputs-twitter.html

Answer 4

使用线程包装脚本

您可能只需要踩踏的bash或python包装器脚本。该脚本将拆分工作并自动为您调用。这样做的好处是您不必重写太多就可以使用它。下面的假设可能会将运行时间从111小时减少到1.1小时。

说您当前的解决方案是这样：

file_of_200k_uids.txt
ruby ruby_script.rb "file_of_200k_uids.txt"

因此ruby_script.rb遍历所有200K UID并执行网络任务，比如说每2秒相当于40万秒。

建议的解决方案（使用BASH4 +编写包装线程）：

file_of_200k_uids.txt
ruby ruby_script.rb "file_of_200k_uids.txt"
bash_thread_manager.sh

bash_thread_manager.sh的内容如下：

# -- Step one have the bash script break down the large file --
# and place the results in a /path/to/folder
cp file_of_200k_uids.txt /path/to/folder/temp_file_of_200k_uids.txt
split -d -b 10M file_of_200k_uids.txt uids_list
rm /path/to/folder/temp_file_of_200k_uids.txt

# -- Now run through the folders and launch the script you need to do the work --
# -- it will create instances of your script up to a max number (e.g. 100) --
child="$$"
for filename in /path/to/folder/*; do

    num_children=$(ps --no-headers -o pid --ppid=$child | wc -w)
    let num_children=num_children-1

    if [[ $num_children -gt 100 ]] ; then
        sleep 60
    else
        ruby ruby_script.rb "$filename" > /output/result-${RANDOM}.txt &
    fi

done
wait
# -- final step would be a for loop that combines all of the files
cat /output/result-*.txt >> all.txt

bash脚本将管理从文件中调用UID，并将数据作为单独的线程收集，直到您定义的数目。在下面的示例中，我们将temp_file_of_200k_uids.txt分割为较小的最大10MB文件，然后使用bash脚本一次调用这些10MB文件中的100个。每当它的线程数降到100以下时，它就会增加到100。现在，您可以将其速度提高100倍，依此类推。

进一步阅读： https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/ Multithreading in Bash

有没有更好的方法收集Twitter数据？

4 个答案: