解码特殊字符

时间:2017-06-13 16:02:01

标签: python csv pandas dataframe character-encoding

我无法使用printhistogram打印特殊字符。

def class_data():
    df = pd.read_csv('words.csv', sep=',')
    df = df.astype(str)
    df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
    df = df.replace(['à', 'â', 'À'], 'a', regex=True)
    df.manual_raw_value = df.manual_raw_value.str.lower()

classes=set(df.manual_raw_value.apply(list).sum())
print("number of classes is ", len(classes))
print("classes are " ,classes)

# histogram
pd.Series(list(df.manual_raw_value.str.cat())).value_counts().plot(kind="bar")

我得

('number of classes is ', 73)

和班级:

('classes are ', set(['\x82', '\x87', '*', '\xac', '\xaf', '\xae', '>', '!', ' ', '"', '%', "'", '\xb0', ')', '(', '+', '\xaa', '-', ',', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '\xbb', ':', '=', '?', '\xb4', '@', '\xc3', '\xc2', '\xa7', '\xa1', '\xb9', '\xe2', '_', 'a', '&', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '\xab', '\x94']))

这是直方图enter image description here 为什么我会在直方图中得到?时获得special char? 与

print("classes are " ,classes) l get for `special chars`   
 '\xab', '\x94'  , how to display the appropriate char ? is it related to encoding 

0 个答案:

没有答案