堀　浩一

MeCabを使っていると、"'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte" というエラーが起こる

（老いぼれ人工知能研究者のプログラミング日記）

2019年5月14日

形態素解析に mecab-python3 を使っています。

どういうわけかわからないのですが、 "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte" というエラーが、再現性なく起こります。

検索してみましたらところ、その原因は誰もわかっていらっしゃらないようなのですが、所望のparsingを行う前にnull stringをparseさせたらそのエラーを回避できる、と複数の方々が書かれていました。
試してみましたら、確かにうまく行きました。

下の通りです。


import MeCab

def extractNouns(text):
    tagger = MeCab.Tagger()
    normallyprocessed = True

    tagger.parse("")
    # No one seems to know why this works,
    # but this tagger.parse("") can avoid the unicode decoding error
    # in the following parsing.
    
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        try:
            word = node.surface
        except Exception as e:
            print(str(e))
            print('parsing error occured but ignored')
            normallyprocessed = False
        if normallyprocessed and word.isalpha():
            meta = node.feature.split(",")
            if meta[0] == '名詞':
               keywords.append(word)
        node = node.next
        normallyprocessed = True
    return keywords

追記：その後、教え子から、この問題の原因はgarbage collectionだ、という情報を教えてもらいました。その説明は、ここに載っています。

To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.