从 Python 中的字符串中删除非 utf-8 字符

在 Python 中从字符串中删除非 utf-8 字符

Remove the non utf-8 characters from a String in Python

从字符串中删除非 utf-8 字符：

使用该str.encode()方法将字符串编码为字节对象。
errors将关键字参数设置ignore为删除任何非 utf-8 字符。
使用该bytes.decode()方法将字节对象解码为字符串。

主程序


my_str = '\x86bobbyhadz.com\x86'

result = my_str.encode(
    'utf-8', errors='ignore'
).decode('utf-8')

print(result)  # 👉️ 'bobbyhadz.com'

如果您从字节对象开始，请改用以下代码示例。

主程序


my_bytes = 'bobbyhadz.com'.encode('utf-8')

result = my_bytes.decode(
  'utf-8', errors='ignore'
).encode('utf-8')

print(result)  # 👉️ b'bobbyhadz.com'

该示例从字符串中删除非 utf-8 字符。

str.encode方法将字符串的编码版本作为字节对象返回。默认编码是
utf-8.

当errors关键字参数设置为时ignore，无法编码的字符将被丢弃。

任何无法使用编码进行编码的字符utf-8将从字符串中删除。

下一步是bytes使用编码解码对象utf-8。

主程序


my_str = 'abc'

result = my_str.encode('utf-8', errors='ignore').decode('utf-8')
print(result)  # 👉️ 'abc'

bytes.decode方法返回从给定字节解码的字符串。默认编码是utf-8.

结果是一个不包含任何非 utf-8 字符的字符串。

从文件中删除非 utf-8 字符

如果在读取文件时需要删除非 utf-8 字符，请使用循环
for遍历文件中的行并重复相同的过程。

主程序


with open('example.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()

    for line in lines:
        line = line.encode(
          'utf-8', errors='ignore'
        ).decode('utf-8')

        print(line)

该代码示例假定您有一个example.txt文件位于与 Python 脚本相同的目录中。

string编码是将 a 转换为对象的过程，解码是将对象转换为 a 的过程。 bytes bytesstring

以 Bytes 对象开头时删除非 utf-8 字符

如果您从对象开始bytes，则必须decode()先使用该方法将字节对象解码为字符串。

主程序


my_bytes = 'bobbyhadz.com'.encode('utf-8')

result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8')

print(result)  # 👉️ b'bobbyhadz.com'

确保在调用方法时
将errors关键字参数设置为在转换为字符串时删除任何非 utf-8 字符。ignoredecode()

额外资源

您可以通过查看以下教程来了解有关相关主题的更多信息：