Koichi Hori

Home About

   日本語

Unicode decode error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte" when using MeCab

(Diary of an Old AI Researcher who is still Programming)

14 May 2019

I am using mecab-python3 - morphological analyzer for Japanese text.

I do not know why but mecab-python3 sometimes causes the error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte".

Searching on internet, I have found no answer about the cause of the error, but some people say that we can avoid this error by parsing a null string before carrying out parsing tasks.
I have tried this workaround and have found this certainly works.

Here is an example:


import MeCab

def extractNouns(text):
    tagger = MeCab.Tagger()
    normallyprocessed = True

    tagger.parse("")
    # No one seems to know why this works,
    # but this tagger.parse("") can avoid the unicode decoding error
    # in the following parsing.
    
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        try:
            word = node.surface
        except Exception as e:
            print(str(e))
            print('parsing error occured but ignored')
            normallyprocessed = False
        if normallyprocessed and word.isalpha():
            meta = node.feature.split(",")
            if meta[0] == '名詞':
               keywords.append(word)
        node = node.next
        normallyprocessed = True
    return keywords

CC0
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.




Related entries:
Aligning Facebook button and Twitter button
Using Python on Windows
Using unicode characters in Windows command line
What an old AI researcher thinks after watching the movie "Green Book" - about Racism, Discrimination, and AI (Artificial Intelligence)
Civilization, Culture, Science, and Technology
Koichi Hori Top page
The University of Tokyo Academic Archives Portal - UTokyo Digital Collections
UNESCO: `Do you know AI or AI knows you better? Thinking Ethics of AI'
Toward AI-embedded Society where AI is Not Recognized as AI
AI support for Ethical AI Design
Difference between Science and Engineering
Culture as the base of our country: Prof. Inose
Koichi Hori
What is Artificial Intelligence?