从dta文件中读取python中的俄语

时间:2018-04-14 21:37:49

标签: python character-encoding

我正在尝试从.dta文件中读取俄语字符,当我打印到终端时,我看到以下内容: Яномамо

这看起来不像俄罗斯,所以我想知道我必须做些什么来获得真正的俄罗斯人物。我尝试了以下内容并且它没有工作: ex = row['name_rus'].encode("cp1251")

我得到了:UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>

任何指导将不胜感激!如果有帮助的话,我将我的脚本附加到目前为止。

import pandas as pd
import re

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print ('.')
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print ('\nloaded {} rows'.format(len(df)))

    return df



def store_oid(df1, df2, df3):
    mapping = {}
    to_write = {}
    for index, row in df1.iterrows():
        russian = row['id']
        mapping[russian] = index
    for index, row in df2.iterrows():
        russian_words = row['name_rus']
        russian_words = [x.strip() for x in russian_words.split(',')]
        for word in russian_words: 
            if word in mapping:
                oid = int(row['oid'])
                to_write[word] = oid
    for index, row in df3.iterrows():
        ex = row['name_rus'].encode("cp1251")
        print(ex)
def main():
    long_words = load_large_dta('russian_english_names.dta')
    single_words = load_large_dta('Ruthenia2_duplicates.dta')
    tradition_english = load_large_dta('tradition_english_russian_EA_Augu16th.dta')
    store_oid(single_words, long_words, tradition_english)

if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:0)

问题在于终端,而不是python。将终端更改为使用UTF-8编码。