UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x92 in position

# UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x92 in position

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x92 in position

The Python “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x92 in
position: invalid start byte” occurs when we specify an incorrect encoding when
decoding a bytes object.

To solve the error, specify the correct encoding, e.g. cp1252.

unicodedecodeerror utf 8 codec cant decode byte 0x92

Here is an example of how the error occurs.

main.py
my_bytes = 'abc ’ def'.encode('cp1252') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 4: invalid start byte my_str = my_bytes.decode('utf-8')

encoding does not match

Notice that the string was encoded using the cp1252 encoding and then we tried
to decode the bytes object using the utf-8 encoding.

The two encodings don’t match.

The 0x92 byte can be decoded using the cp1252 encoding.

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

# Use the same encoding that was used to encode the string to bytes

When decoding a bytes object, we have to use the same encoding that was used to
encode the string to a bytes object.

In the example, we can set the encoding to cp1252.

main.py
my_bytes = 'abc ’ def'.encode('cp1252') my_str = my_bytes.decode('cp1252') print(my_str) # 👉️ "abc ’ def"

encodings match

The encoding used to encode the string to bytes matches the encoding used to decode the bytes object, so everything works as expected.

Windows-1252 or cp1252 is a
single-byte character encoding of the Latin alphabet.

If that doesn’t work, try using the latin-1 encoding and see if the data is
legible.

main.py
my_bytes = 'abc ’ def'.encode('cp1252') my_str = my_bytes.decode('latin-1') print(my_str) # 👉️ "abc def"

Notice that the character that cannot be decoded got stripped when using the
latin-1 encoding.

You can also try using the mac_roman encoding if the file or string was
created on a macOS machine.

main.py
my_bytes = 'abc ’ def'.encode('cp1252') my_str = my_bytes.decode('mac_roman') print(my_str) # 👉️ "abc í def"

# Specify the correct encoding when reading from a file using pandas

If you got the error when reading from a file using
pandas, try setting the encoding to
cp1252 or latin-1.

main.py
import pandas as pd # 👇️ set encoding to cp1252 df = pd.read_csv('employees.csv', sep='|', encoding='cp1252') print(df)

set encoding when reading file with pandas

The code sample assumes that you have an employees.csv file in the same
directory as your Python script.

employees.csv
first_name,last_name Alice,Smith Bobby,Hadz

You can try doing the same if using the native
open() function.

main.py
import csv with open('employees.csv', newline='', encoding='cp1252') as csvfile: csv_reader = list(csv.reader(csvfile, delimiter='|')) print(csv_reader)

# Setting the errors keyword argument to ignore

如果错误仍然存​​在,您可以将
errors 的关键字参数设置
ignore以忽略无法解码的字符。

请注意,忽略无法解码的字符会导致数据丢失。

主程序
import csv # 👇️ set errors to ignore with open('employees.csv', newline='', encoding='utf-8', errors='ignore') as csvfile: csv_reader = list(csv.reader(csvfile, delimiter='|')) print(csv_reader)
errors使用设置为的不正确编码打开文件不会引发. ignoreUnicodeDecodeError

rb如果您必须从中读取文件,请确保您没有以(读取二进制)模式打开文件。

如果不需要与之交互,则以二进制模式打开文件

如果您不需要与文件内容进行交互,您可以在不解码的情况下以二进制模式打开它。

主程序
with open('example.txt', 'rb') as f: data = f.read() print(data)

我们以二进制模式(使用模式)打开文件rb,因此该data变量包含一个字节对象。


以二进制模式打开文件时
不应
指定编码。

如果您需要将文件上传到远程服务器并且不需要对其进行解码,则可以使用这种方法。

错误是如何发生的

string编码是将 a 转换为对象的过程,解码是将对象转换为 a 的bytes过程bytesstring

解码字节对象时,我们必须使用与将字符串编码为字节对象相同的编码。

下面的示例展示了如何使用与解码 bytes 对象不同的编码将字符串编码为字节,从而导致错误。

主程序
my_text = 'a ’ b' my_binary_data = my_text.encode('cp1252') # ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2: invalid start byte my_text_again = my_binary_data.decode('utf-8')

cp1252我们可以通过使用编码来解码字节对象来解决错误。

主程序
my_text = 'a ’ b' my_binary_data = my_text.encode('cp1252') my_text_again = my_binary_data.decode('cp1252') print(my_text_again) # 👉️ "a ’ b"