我正在尝试从this website获取电影脚本作为文本。它可以很好地工作到某一点,直到文本变成这样:
5 .
/ b >
T H E W A L L S C O M E A L I V E ! A s e e m i n g l y i n f i n i t e s w a r m o f F I R E
D E M O N S r a l l y t o S u r t u r ' s a i d .
这是我的代码
import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, "lxml")
text = soup.pre
在打印出text
时,它会显示预期的输出,直到第5节为止。然后,我在上面得到恼人的文字...
关于为什么发生这种情况以及如何解决的任何想法?
答案 0 :(得分:0)
我使用'html.parser'
代替了'lxml'
,并且能够以正确的格式显示整个脚本:
import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, 'html.parser')
text = soup.pre
即第5节的开头显示为:
<b> BLUE DRAFT 05/20/16 5.
</b>
THE WALLS COME ALIVE! A seemingly infinite swarm of FIRE
DEMONS rally to Surtur's aid.
<b> THOR
</b> I make grave mistakes all the time.
Everything seems to work out.
In the shadows, a massive FIRE DRAGON ROARS.
The fire demons SURGE FORWARD. Thor backs up, HAMMERING
AWAY. He then leaps back, SPRINGBOARDS off the wall, and-
答案 1 :(得分:0)
奇怪的... 我在计算机上尝试了原始代码,但无法重现您描述的间距问题。 我有lxml-4.3.0,bs4版本4.7.1和python 3.7.1。 您有什么版本?