在预处理抓取的文本时不要删除多余的行

时间:2020-09-02 06:54:49

标签: nutch

使用小节进行爬网时,它会从爬网的文本中删除所有多余的行。我想保留文本以及网站上所有的新行。例如:在抓取此页面https://www.modernfamilydental.net/时,预期输出为:

\n\n\n\nSan Francisco, CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal window\n\n\n\n\n\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\n\n\n\n\n\n\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n(415) 752-5244\n\n\n \n\n\n\n\n\n\n\n\n\nMenu\n\n\n\n\nHome\n\n\nServices\n \nLatest Equipment\n\n\nInsurance\n\n\nTeeth Whitening\n\n\nCrowns & Bridges\n\n\nSmile Makeovers\n\n\nResin Composite Bonding\n\n\nVeneers\n\n\nImplant Retained Dentures\n\n\nNight Guards\n\n\nMetal-Free Restoration\n\n\nInvisalign\n\n\nDental Examination

但是胡说八道的输出是:

San Francisco, CA Dentist\nWould you like to switch to the accessible version of this site?\nGo to accessible site\nClose modal window\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n(415) 752-5244\nMenu\nHome\nServices\nLatest Equipment\nInsurance\nTeeth Whitening\nCrowns & Bridges\nSmile Makeovers\n\n\nResin Composite Bonding\nVeneers\nImplant Retained Dentures\nNight Guards\nMetal-Free Restoration\nInvisalign\nDental Examination

我需要在配置文件中进行哪些配置更改?

p.s。如果我的问题很傻,请原谅我,因为我是新手。

0 个答案:

没有答案
相关问题