Python 脚本的多处理运行时错误

时间:2021-03-31 14:55:37

标签: python web-scraping multiprocessing

运行我的代码时出现以下错误:

...
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
...

我正在以下列方式运行脚本:

if __name__ == '__main__':
    object_ = NameToLinkedInScraper(csv_file_name='people_small.csv', person_name_column='company', organization_name_column='email')
    object_.multiproc_job()

和班级:

class NameToLinkedInScraper:
    pool = Pool()

    proxy_list = PROXY_LIST
    csv_file_name = None
    person_name_column = None
    organization_name_column = None
    find_facebook = False
    find_twitter = False

    def __init__(self, csv_file_name, person_name_column, organization_name_column, find_facebook=False,
                 find_twitter=False):
        self.csv_file_name: str
        self.person_name_column: str
        self.organization_name_column: str
        self.df = pd.read_csv(csv_file_name)

    def internal_linkedin_job(self, _df):
        _df['linkedin_profile'] = np.nan
        _df['linkedin_profile'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'link', output_generic=False), axis=1)

    def internal_generic_linkedin_job(self, _df):
        _df['linkedin_generic'] = np.nan
        _df['linkedin_generic'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'link', output_generic=True), axis=1)

    def internal_facebook_twitter_job(self, _df):
        _df['title'] = np.nan
        _df['title'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'title'), axis=1)
        if self.find_facebook:
            _df['facebook_profile'] = np.nan
            _df['facebook_profile'] = _df.apply(
                lambda row: term_scraper(
                    str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                    'link', output_generic=False, social_network='facebook'), axis=1)
        if self.find_twitter:
            _df['twitter_profile'] = np.nan
            _df['twitter_profile'] = _df.apply(
                lambda row: term_scraper(
                    str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                    'link', output_generic=False, social_network='twitter'), axis=1)

    def multiproc_job(self):
        linkedin_profile_proc = Process(target=self.internal_linkedin_job, args=self.df)
        linkedin_generic_profile_proc = Process(target=self.internal_generic_linkedin_job, args=self.df)
        internal_facebook_twitter_job = Process(target=self.internal_facebook_twitter_job, args=self.df)
        jobs = [linkedin_profile_proc, linkedin_generic_profile_proc, internal_facebook_twitter_job]
        for j in jobs:
            j.start()

        for j in jobs:
            j.join()
        self.df.to_csv(sys.path[0] + "\\" + self.csv_file_name + "_" + ".csv")

而且我不知道出了什么问题,脚本在 Windows 中运行,我找不到答案。 我尝试将 freeze_support() 添加到 main 中,但没有成功,还将进程创建和作业分配从类转移到 main。

1 个答案:

答案 0 :(得分:2)

通过将 pool 创建为类属性,它会在导入期间定义 NameToLinkedInScraper 时执行(“main”文件由子项导入,因此他们可以访问所有相同的类和函数)。如果允许这样做,它将递归地继续创建更多子项,然后导入相同的文件并自己创建更多子项。这就是在 __main__ 导入时禁用生成子进程的原因。您应该只在 Pool 中调用 __init__,这样只有在您创建类的实例时才会创建新的子进程。一般来说,除非是静态数据,或者需要在类的所有实例之间共享,否则应该避免使用类属性而不是实例属性。

相关问题