Simple intent recognition and question answering with DeepPavlov - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 37: invalid continuation byte

Good morning.

I have come across this error:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe5 in position 37: invalid continuation byte

The error comes up when running this part of the script:

from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.faq.tfidf_logreg_en_faq)
model_config[“dataset_reader”][“data_path”] = “./data/faq.csv”
model_config[“dataset_reader”][“data_url”] = None
faq = train_model(model_config)
a = faq([“some question”])
a

It is happening because I have utf-16 characters in the .csv file such as ‘Luleå’, a city in Sweden.
How can I fix this problem?

Thank you for your time.

Hey @Titus ,

The dataset_reader uses pandas.read_csv to load the data. You can try to change the encoding option of read_csv in /deeppavlov/dataset_readers/faq_reader.py.

Let me know if it’s helpful.

Thank you for the quick reply, @Vasily!

I’ll change the encoding later this afternoon, and I’ll get back with the results.

Have a great day!

Hey @Vasily!

I looked, but I am not sure which file to change, and what would be the right way to do it.
Could you please give me an example of how to set .cvs reader property to read utf-16?

Thank you.

Try following these steps (assuming the problem relates to the encoding):

  1. Clone the repo https://github.com/deepmipt/DeepPavlov/
  2. Locate the file deeppavlov/dataset_readers/faq_reader.py
  3. Change read_csv(data_url) to read_csv(data_url, encoding='utf-16')

Then run your code inside the DeepPavlov folder.

Hopefully it’s helpful

Thank you @Vasily,

Unfortunately it doesn’t work.

I made the following changes to the deeppavlov/dataset_readers/faq_reader.py, but I get the same error.

if data_url is not None:
#data = read_csv(data_url)
data = read_csv(data_url, encoding=‘utf-16’)
elif data_path is not None:
#data = read_csv(data_path)
data = read_csv(data_path, encoding=‘utf-16’)
else:
raise ValueError(“Please specify data_path or data_url parameter”)

Is there anything else you think might help?

Thank you again.

I would suggest you trying to read the data in a separate Python script. Once you find the appropriate parameters, you can change the dataset_reader.

Make sure that you save the data in utf-8, otherwise try pd.read_csv("filename", encoding = "ISO-8859-1", engine='python') assuming you are using Windows.

Thank you @Vasily!

I’ll give it a try.
Have a great evening!

Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data. Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a python byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode(‘utf-8’).strip()

Or

Use encoding format ISO-8859-1 to solve the issue.