Koichi Hori

Home About

   日本語

Unicode decode error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte" when using MeCab

(Diary of an Old Professor who is still Programming)

14 May 2019

I am using mecab-python3 - morphological analyzer for Japanese text.

I do not know why but mecab-python3 sometimes causes the error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte".

Searching on internet, I have found no answer about the cause of the error, but some people say that we can avoid this error by parsing a null string before carrying out parsing tasks.
I have tried this workaround and have found this certainly works.

Here is an example:


import MeCab

def extractNouns(text):
    tagger = MeCab.Tagger()
    normallyprocessed = True

    tagger.parse("")
    # No one seems to know why this works,
    # but this tagger.parse("") can avoid the unicode decoding error
    # in the following parsing.
    
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        try:
            word = node.surface
        except Exception as e:
            print(str(e))
            print('parsing error occured but ignored')
            normallyprocessed = False
        if normallyprocessed and word.isalpha():
            meta = node.feature.split(",")
            if meta[0] == '名詞':
               keywords.append(word)
        node = node.next
        normallyprocessed = True
    return keywords

CC0
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.




Related entries:
What is Artificial Intelligence?
Culture as the base of our country: Prof. Inose
Difference between Science and Engineering
Aligning Facebook button and Twitter button
Using Python on Windows
Toward AI-embedded Society where AI is Not Recognized as AI
What an old AI researcher thinks after watching the movie "Green Book" - about Racism, Discrimination, and AI (Artificial Intelligence)
Using unicode characters in Windows command line
Civilization, Culture, Science, and Technology
AI support for Ethical AI Design
Koichi Hori
Koichi Hori Top page