I have come across this error:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe5 in position 37: invalid continuation byte
The error comes up when running this part of the script:
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model
model_config = read_json(configs.faq.tfidf_logreg_en_faq)
model_config[“dataset_reader”][“data_path”] = “./data/faq.csv”
model_config[“dataset_reader”][“data_url”] = None
faq = train_model(model_config)
a = faq([“some question”])
a
It is happening because I have utf-16 characters in the .csv file such as ‘Luleå’, a city in Sweden.
How can I fix this problem?
The dataset_reader uses pandas.read_csv to load the data. You can try to change the encoding option of read_csv in /deeppavlov/dataset_readers/faq_reader.py.
I looked, but I am not sure which file to change, and what would be the right way to do it.
Could you please give me an example of how to set .cvs reader property to read utf-16?
I made the following changes to the deeppavlov/dataset_readers/faq_reader.py, but I get the same error.
if data_url is not None: #data = read_csv(data_url)
data = read_csv(data_url, encoding=‘utf-16’)
elif data_path is not None: #data = read_csv(data_path)
data = read_csv(data_path, encoding=‘utf-16’)
else:
raise ValueError(“Please specify data_path or data_url parameter”)
I would suggest you trying to read the data in a separate Python script. Once you find the appropriate parameters, you can change the dataset_reader.
Make sure that you save the data in utf-8, otherwise try pd.read_csv("filename", encoding = "ISO-8859-1", engine='python') assuming you are using Windows.
Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data. Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a python byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):
a.encode(‘utf-8’).strip()
Or
Use encoding format ISO-8859-1 to solve the issue.