Question

我在提取中文文本并将其写入文件时遇到了麻烦。

str = "全球紧张致富豪财富缩水 贝索斯丁磊分列跌幅前两位";
f=open('test.txt','w');
f.write(str);

上面的代码运行正常。在下面的代码中写入文件显示乱码。

import requests;
from bs4 import BeautifulSoup

f=open('data.txt','w');

def techSinaCrawler():
    url="http://tech.sina.com.cn/"
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for li in soup.findAll('li',{'data-sudaclick': 'yaowenlist-1'}):
        for link in li.findAll('a'):
            href = link.get('href')
            techSinaInsideLinkCrawler(href);            

def techSinaInsideLinkCrawler(url):

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for data in soup.findAll('h1',{'id': 'main_title'}):
        str='main_title'+':'+ data.string
        f.write(str);
        f.write('\n');

techSinaCrawler();

感谢您的帮助

Answer 1

在Python 2中，如果您正在处理ASCII以外的编码，那么使用codecs.open（）是一个好主意。这样，您就不需要手动编码您编写的所有内容。此外，如果您希望文件名中包含非ASCII字符，则应将os.walk（）传递给Unicode字符串：

import codecs
with codecs.open("c:/Users/me/filename.txt", "a", encoding="utf-8") as d:
   for dir, subdirs, files in os.walk(u"c:/temp"):
      for f in files:
         fname = os.path.join(dir, f)
         print fname
         d.write(fname + "\n")

无需调用d.close（），with块已经解决了这个问题。

Answer 2

解决了..

刚刚将.text更改为.content

plain_text = source_code.text to plain_text = source_code.content

将输出作为中文文本。

获得了理想的结果

如何在python

2 个答案: