提取电影脚本时BeautifulSoup吓坏了

时间:2019-02-08 01:23:03

标签: python beautifulsoup

我正在尝试从this website获取电影脚本作为文本。它可以很好地工作到某一点,直到文本变成这样:

5   .   

   /   b   >   



                   T   H   E       W   A   L   L   S       C   O   M   E       A   L   I   V   E   !       A       s   e   e   m   i   n   g   l   y       i   n   f   i   n   i   t   e       s   w   a   r   m       o   f       F   I   R   E   

                   D   E   M   O   N   S       r   a   l   l   y       t   o       S   u   r   t   u   r   '   s       a   i   d   .   

这是我的代码

import requests
from bs4 import BeautifulSoup

website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, "lxml")
text = soup.pre

在打印出text时,它会显示预期的输出,直到第5节为止。然后,我在上面得到恼人的文字...

关于为什么发生这种情况以及如何解决的任何想法?

2 个答案:

答案 0 :(得分:0)

我使用'html.parser'代替了'lxml',并且能够以正确的格式显示整个脚本:

import requests
from bs4 import BeautifulSoup

website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, 'html.parser')
text = soup.pre

即第5节的开头显示为:

<b>                           BLUE DRAFT 05/20/16                   5.
</b>

    THE WALLS COME ALIVE! A seemingly infinite swarm of FIRE
    DEMONS rally to Surtur's aid.

<b>                         THOR
</b>               I make grave mistakes all the time.
               Everything seems to work out.

    In the shadows, a massive FIRE DRAGON ROARS.

    The fire demons SURGE FORWARD. Thor backs up, HAMMERING
    AWAY. He then leaps back, SPRINGBOARDS off the wall, and-

答案 1 :(得分:0)

奇怪的... 我在计算机上尝试了原始代码,但无法重现您描述的间距问题。 我有lxml-4.3.0,bs4版本4.7.1和python 3.7.1。 您有什么版本?