从 Python 中的字符串中删除非 ASCII 字符

在 Python 中从字符串中删除非 ASCII 字符

Remove non-ASCII characters from a string in Python

要从字符串中删除非 ASCII 字符:

  1. 检查字符串中的每个字符是否具有小于 128 的 unicode 代码点。
  2. 前 128 个 unicode 代码点代表 ASCII 字符。
  3. 使用该join()方法将匹配的字符连接成一个字符串。
主程序
# 👇️ check if each character in string is ASCII def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0 # ---------------------------------------------------- # 👇️ encode using ASCII encoding and drop all non-ASCII characters def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0

第一个函数使用生成器表达式迭代字符串。

生成器表达式用于对每个元素执行一些操作,或者选择满足条件的元素子集。
主程序
def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0

在每次迭代中,我们检查字符的 Unicode 代码点是否小于 128。

The first 128 Unicode code points represent the ASCII characters.

The ord function takes a
string that represents 1 Unicode character and returns an integer representing
the Unicode code point of the given character.

The last step is to join the characters that satisfy the condition.

The str.join method
takes an iterable as an argument and returns a string which is the concatenation
of the strings in the iterable.

The string the method is called on is used as the separator between the
elements.

Alternatively, you can use the str.encode() and bytes.decode() methods.

To remove the non-ASCII characters from a string:

  1. Use the str.encode() method to encode the string using the ASCII encoding.
  2. Set the errors argument to ignore, so all non-ASCII characters are
    dropped.
  3. Use the bytes.decode() method to convert the bytes object to a string.
main.py
def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0 print(ord('a')) # 👉️ 97 print(ord('b')) # 👉️ 98

The str.encode
method returns an encoded version of the string as a bytes object. The default
encoding is utf-8.

We set the encoding to ascii and the errors argument to ignore.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

All characters that cannot be encoded using the ASCII encoding will get dropped
from the string.

The last step is to use the bytes.decode() method to decode the bytes object
to a string.

编码是将 a 转换string对象的过程,解码是将对象转换为 a的过程 bytes bytesstring

bytes.decode
方法返回从给定字节解码的字符串
默认编码是
utf-8.

结果是一个不包含任何非 ASCII 字符的字符串。