在 Python 中从字符串中删除非 ASCII 字符
Remove non-ASCII characters from a string in Python
要从字符串中删除非 ASCII 字符:
- 检查字符串中的每个字符是否具有小于 128 的 unicode 代码点。
- 前 128 个 unicode 代码点代表 ASCII 字符。
- 使用该
join()
方法将匹配的字符连接成一个字符串。
# 👇️ check if each character in string is ASCII def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0 # ---------------------------------------------------- # 👇️ encode using ASCII encoding and drop all non-ASCII characters def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0
第一个函数使用生成器表达式迭代字符串。
def remove_non_ascii(string): return ''.join(char for char in string if ord(char) < 128) print(remove_non_ascii('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii('a_b^0')) # 👉️ a_b^0
在每次迭代中,我们检查字符的 Unicode 代码点是否小于 128。
The ord function takes a
string that represents 1 Unicode character and returns an integer representing
the Unicode code point of the given character.
The last step is to join the characters that satisfy the condition.
The str.join method
takes an iterable as an argument and returns a string which is the concatenation
of the strings in the iterable.
The string the method is called on is used as the separator between the
elements.
Alternatively, you can use the str.encode()
and bytes.decode()
methods.
To remove the non-ASCII characters from a string:
- Use the
str.encode()
method to encode the string using the ASCII encoding. - Set the
errors
argument toignore
, so all non-ASCII characters are
dropped. - Use the
bytes.decode()
method to convert the bytes object to a string.
def remove_non_ascii_2(string): return string.encode('ascii', errors='ignore').decode() print(remove_non_ascii_2('a€bñcá')) # 👉️ 'abc' print(remove_non_ascii_2('a_b^0')) # 👉️ a_b^0 print(ord('a')) # 👉️ 97 print(ord('b')) # 👉️ 98
The str.encode
method returns an encoded version of the string as a bytes object. The default
encoding is utf-8
.
We set the encoding to ascii
and the errors
argument to ignore
.
errors
keyword argument is set to ignore
, characters that cannot be encoded are dropped.All characters that cannot be encoded using the ASCII encoding will get dropped
from the string.
The last step is to use the bytes.decode()
method to decode the bytes object
to a string.
string
为对象的过程,解码是将对象转换为 a的过程。 bytes
bytes
string
bytes.decode
方法返回从给定字节解码的字符串。默认编码是
utf-8
.
结果是一个不包含任何非 ASCII 字符的字符串。