如何计算中文,韩文和英文单词的数量

时间:2018-03-08 02:39:37

标签: python word

我的句子混有中文,韩文和英文单词。我在Python中使用了len()函数,但它给了我错误的答案。例如,我们有字符串

a = '여보세요,我是Jason. Nice to meet you☺❤'

正确的字号(不包括标点符号)为13,但len(a) = 32

如何正确计算单词数?

非常感谢。

2 个答案:

答案 0 :(得分:2)

您可以查看here。我删除了中文标点符号并计算了表情符号的数量。

import re
import emoji
IDEOGRAPHIC_SPACE = 0x3000

def is_asian(char):
    """Is the character Asian?"""
    return ord(char) > IDEOGRAPHIC_SPACE

def filter_jchars(c):
    """Filters Asian characters to spaces"""
    if is_asian(c):
        return ' '
    return c

def nonj_len(word):
    u"""Returns number of non-Asian words in {word}
    – 日本語AアジアンB -> 2
    – hello -> 1
    @param word: A word, possibly containing Asian characters
    """
    # Here are the steps:
    # 日spam本eggs
    # -> [' ', 's', 'p', 'a', 'm', ' ', 'e', 'g', 'g', 's']
    # -> ' spam eggs'
    # -> ['spam', 'eggs']
    # The length of which is 2!
    chars = [filter_jchars(c) for c in word]
    return len(''.join(chars).split())

def emoji_count(text):
    return len([i for i in a if i in emoji.UNICODE_EMOJI])

def get_wordcount(text):
    """Get the word/character count for text

    @param text: The text of the segment
    """

    characters = len(text)
    chars_no_spaces = sum([not x.isspace() for x in text])
    asian_chars =  sum([is_asian(x) for x in text])
    non_asian_words = nonj_len(text)
    emoji_chars = emoji_count(text)
    words = non_asian_words + asian_chars + emoji_chars

    return dict(characters=characters,
                chars_no_spaces=chars_no_spaces,
                asian_chars=asian_chars,
                non_asian_words=non_asian_words,
                emoji_chars = emoji_chars,
                words=words)

def dict2obj(dictionary):
    """Transform a dictionary into an object"""
    class Obj(object):
        def __init__(self, dictionary):
            self.__dict__.update(dictionary)
    return Obj(dictionary)

def get_wordcount_obj(text):
    """Get the wordcount as an object rather than a dictionary"""
    return dict2obj(get_wordcount(text))

if __name__ == '__main__':
    a = '여보세요,我是Jason. Nice to meet you☺❤'
    a = re.sub(r'[\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*():;《)《》“”()»〔〕-]+', "", a)
    b = get_wordcount_obj(a)
    print(b.words)

答案 1 :(得分:0)

Python中的len运算符,当应用于字符串时,会为您提供该字符串中字符的数量,而不是单词数。

如果你想知道字符串中单词的数量,你需要确定一个如何定义单词的机制 - 对于普通英语,例如可以使用空格,你可以使用{{1} }。对于包含unicode字符的混合语言字符串,您需要定义自定义规则,包括分离出每个字符为单词的情况与单词用空格分隔的情况 - 在您的示例中,您需要单独计算英语单词的数量中国人,韩国人和表情符号。