Simple intent recognition and question answering with DeepPavlov - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 37: invalid continuation byte

Titus · May 19, 2020, 6:44am

Good morning.

I have come across this error:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe5 in position 37: invalid continuation byte

The error comes up when running this part of the script:

from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.faq.tfidf_logreg_en_faq)
model_config[“dataset_reader”][“data_path”] = “./data/faq.csv”
model_config[“dataset_reader”][“data_url”] = None
faq = train_model(model_config)
a = faq([“some question”])
a

It is happening because I have utf-16 characters in the .csv file such as ‘Luleå’, a city in Sweden.
How can I fix this problem?

Thank you for your time.

Vasily · May 19, 2020, 8:21am

Hey @Titus ,

The dataset_reader uses pandas.read_csv to load the data. You can try to change the encoding option of read_csv in /deeppavlov/dataset_readers/faq_reader.py.

Let me know if it’s helpful.

Titus · May 19, 2020, 8:50am

Thank you for the quick reply, @Vasily!

I’ll change the encoding later this afternoon, and I’ll get back with the results.

Have a great day!

Titus · May 19, 2020, 10:15am

Hey @Vasily!

I looked, but I am not sure which file to change, and what would be the right way to do it.
Could you please give me an example of how to set .cvs reader property to read utf-16?

Thank you.

Vasily · May 19, 2020, 12:01pm

Try following these steps (assuming the problem relates to the encoding):

Clone the repo https://github.com/deepmipt/DeepPavlov/
Locate the file deeppavlov/dataset_readers/faq_reader.py
Change read_csv(data_url) to read_csv(data_url, encoding='utf-16')

Then run your code inside the DeepPavlov folder.

Hopefully it’s helpful

Titus · May 19, 2020, 12:54pm

Thank you @Vasily,

Unfortunately it doesn’t work.

I made the following changes to the deeppavlov/dataset_readers/faq_reader.py, but I get the same error.

if data_url is not None:
#data = read_csv(data_url)
data = read_csv(data_url, encoding=‘utf-16’)
elif data_path is not None:
#data = read_csv(data_path)
data = read_csv(data_path, encoding=‘utf-16’)
else:
raise ValueError(“Please specify data_path or data_url parameter”)

Is there anything else you think might help?

Thank you again.

Vasily · May 19, 2020, 2:57pm

I would suggest you trying to read the data in a separate Python script. Once you find the appropriate parameters, you can change the dataset_reader.

Make sure that you save the data in utf-8, otherwise try pd.read_csv("filename", encoding = "ISO-8859-1", engine='python') assuming you are using Windows.

Titus · May 19, 2020, 3:13pm

Thank you @Vasily!

I’ll give it a try.
Have a great evening!

warrenfelsh · January 11, 2021, 9:41am

Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data. Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a python byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode(‘utf-8’).strip()

Or

Use encoding format ISO-8859-1 to solve the issue.

Topic		Replies	Views
Emo_bert3.tar.gz seems broken DeepPavlov Dream	2	23	August 20, 2024
CERTIFICATE_VERIFY_FAILED when running deeppavlov DeepPavlov Library	1	1647	December 6, 2019
Проблема с установкой для ODQA DeepPavlov Library	3	1494	July 19, 2019
Question about testing CSV reader (With REST) Documentation	4	461	May 28, 2020
How to change dataset for the demo Simple intent recognition question answering bot? DeepPavlov Library	2	334	April 13, 2022

Simple intent recognition and question answering with DeepPavlov - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 37: invalid continuation byte

Related topics