我需要一个正则表达式模式来创建一个包含三个独立列的 Pandas DataFrame:日期、用户和消息。我相信正则表达式是这种情况的最佳选择,但如果有其他方法可以更快地实现相同的结果,我想知道。
该模式必须:
-
捕获整个消息,包括多行消息。
-
检测何时没有用户,并将其归类为用户列中的组通知。
我已经创建了一个,但我觉得它并没有真正优化,在大型数据集上运行需要几秒钟。这是我目前拥有的模式,以及我需要提取的数据和格式的示例。
data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""
pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'
#pass the pattern and data to split it to get the list of messages with users
messages = re.split(pattern, data)[1:]
#extract all dates
dates = re.findall(pattern, data)
#create dataframe
df = pd.DataFrame({'user_message': messages, 'date': dates})
#separate Users and Message
users = []
messages = []
for message in df['user_message']:
entry = re.split(r'([\w\W]+?):\s', message)
if entry[1:]: # user name
users.append(entry[1])
messages.append(" ".join(entry[2:]))
else:
users.append('group_notification')
messages.append(entry[0])
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)
13
最佳答案
4
假设用户不能弄乱并构建模仿新条目的多行消息,我可能会手动解析它而不是使用正则表达式,然后将数据加载到 pandas 中(如果您确实需要的话)。
这里的技巧是维护“先前”记录的句柄,以便在我们有后续消息的情况下我们可以扩展先前的消息。
您可以尝试以下概念:
import io
import datetime
import pandas
FILE_TEXT = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User 3: Hi : StackOverflow
"""
def try_parse_date(date_string):
potential_formats = [
"%m/%d/%y, %H:%M",
"%d/%m/%y, %H:%M"
]
final_error = None
for potential_format in potential_formats:
try:
return datetime.datetime.strptime(date_string, potential_format)
except ValueError as e:
final_error = e
raise final_error
results = []
prior = {}
with io.StringIO(FILE_TEXT) as file_in:
for index, row in enumerate(file_in):
row = row.strip()
if not row:
continue
## ------------------------
## We have a row, break it in to potential parts
## ------------------------
potential_date, *potential_user_and_message = row.rstrip("\n").split(" - ")
potential_user_and_message = " - ".join(potential_user_and_message)
## ------------------------
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
try:
best_date_guess = try_parse_date(potential_date)
except ValueError:
if not index:
raise ValueError(f"The first row must start with a valid date")
print(f"**Warning** row {index} does not begin with valid date")
prior["message"] += row
continue
## ------------------------
## ------------------------
## Break the remaining text into user and message
## ------------------------
potential_user, *potential_message = potential_user_and_message.split(":")
if potential_message:
best_user_guess = potential_user
best_message_guess = ":".join(potential_message)
else:
best_user_guess = "group_notification"
best_message_guess = potential_user_and_message
## ------------------------
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## ------------------------
## Reset the prior record to this record
## ------------------------
prior = {"date": best_date_guess, "user": best_user_guess, "message": best_message_guess}
## ------------------------
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
print(pandas.DataFrame(results))
这应该给你:
date user message
0 2022-09-06 11:28:00 group_notification Group creator created group "Example"
1 2022-09-06 11:39:00 User1 This is some text
2 2022-09-06 11:58:00 group_notification Group creator changed group name to "Example2"
3 2022-09-06 12:13:00 User2 This issome textwith multiple lines
4 2022-09-06 13:13:00 group_notification Admin changed group profile photo
5 2022-09-06 14:45:00 User 3 Hi : StackOverflow
4
-
这个方法很好,但是如果用户名包含任何空格(我没有说清楚;很抱歉),比如这里的“用户 3”,那么它就会被归类为群组通知:data = “”” 9/6/22, 11:28 – 群组创建者创建群组“示例” 9/6/22, 11:28 – 群组创建者创建群组“示例” 9/6/22, 11:39 – 用户 1:这是一些文本 9/6/22, 11:58 – 群组创建者将群组名称更改为“示例 2” 9/6/22, 12:13 – 用户 2:这是一些包含多行的文本 9/6/22, 13:13 – 管理员更改了群组个人资料照片 9/6/22, 14:45 – 用户 3:Hi StackOverflow“””.strip()
–
-
1如果用户名可以包含空格,那么您必须按 进行拆分
:
,但这会引入一个问题,即群组消息绝不能包含这样的字符。在我看来,此时也许应该回到数据源并确定它是否被正确拆分和转义。
– -
我注意到另一个问题,这将日期格式限制为只有一种,如果我想尝试三种不同的日期格式怎么办?我无法捕获 ValueError 并尝试格式列表中的另一种格式,因为这会干扰此行“prior[“message”] += row”中的多行消息。我尝试了 dateutil.parser.parse(parts[0]),但速度太慢了
– -
1从技术上讲,你可以嵌套 try 块,并且只有最后、最内层的 except 执行前一个/继续
–
|
这与 JonSG 的方法类似,但从一开始就使用了 pandas。
import pandas as pd
def pandas_solution(data: str) -> pd.DataFrame:
df = pd.DataFrame({"line": data.splitlines(keepends=True)})
# Parse date.
# Assuming that multi-line messages have no line with a date prefix.
df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%y, %H:%M", errors="coerce")
# If the line has no date, it must be a multi-line message.
# If it is a multi-line message, merge it into the last line with a date prefix.
is_multiline_message = df["date"].isna()
df["message_index"] = (~is_multiline_message).cumsum()
df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
df = df.groupby("message_index").agg({"date": "first", "body": "sum"})
# Parse user and message.
# Assuming that group notifications have no colon.
df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
is_group_notification = df["message"].isna()
df.loc[is_group_notification, "message"] = df.loc[is_group_notification, "user"]
df.loc[is_group_notification, "user"] = "group_notification"
df.drop("body", axis=1, inplace=True)
return df
基准:
import datetime
import io
import re
import timeit
import pandas as pd
pd.set_option("display.width", 1000)
pd.set_option("display.max_rows", 300)
pd.set_option("display.max_columns", 100)
def baseline(data):
pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'
# pass the pattern and data to split it to get the list of messages with users
messages = re.split(pattern, data)[1:]
# extract all dates
dates = re.findall(pattern, data)
# create dataframe
df = pd.DataFrame({'user_message': messages, 'date': dates})
# separate Users and Message
users = []
messages = []
for message in df['user_message']:
entry = re.split(r'([\w\W]+?):\s', message)
if entry[1:]: # user name
users.append(entry[1])
messages.append(" ".join(entry[2:]))
else:
users.append('group_notification')
messages.append(entry[0])
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)
return df
def jon_solution(data: str) -> pd.DataFrame:
results = []
prior = {}
with io.StringIO(data) as file_in:
for row in file_in:
parts = row.split(" - ")
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
try:
current = {
"date": datetime.datetime.strptime(parts[0], "%m/%d/%y, %H:%M"),
"user": "group_notification",
"message": ""
}
except ValueError:
prior["message"] += row
continue
## ------------------------
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## ------------------------
## Break the remaining text into user and message
## ------------------------
sub_parts = parts[1].split(":")
if len(sub_parts) > 1:
current["user"] = sub_parts[0]
current["message"] = ":".join(sub_parts[1:])
else:
current["message"] = parts[1]
## ------------------------
prior = current
## ------------------------
## The prior record is complete
## ------------------------
results.append(prior)
## ------------------------
return pd.DataFrame(results)
def pandas_solution(data: str) -> pd.DataFrame:
df = pd.DataFrame({"line": data.splitlines(keepends=True)})
# Parse date.
# Assuming that multi-line messages have no line with a date prefix.
df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%y, %H:%M", errors="coerce")
# If the line has no date, it must be a multi-line message.
# If it is a multi-line message, merge it into the last line with a date prefix.
is_multiline_message = df["date"].isna()
df["message_index"] = (~is_multiline_message).cumsum()
df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
df = df.groupby("message_index").agg({"date": "first", "body": "sum"}).reset_index(drop=True)
# Parse user and message.
# Assuming that group notifications have no colon.
df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
is_group_notification = df["message"].isna()
df.loc[is_group_notification, "message"] = df.loc[is_group_notification, "user"]
df.loc[is_group_notification, "user"] = "group_notification"
df.drop("body", axis=1, inplace=True)
return df
def benchmark():
data = """\
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with - multiple: lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""
data *= 100000
for f in [baseline, jon_solution, pandas_solution]:
t = timeit.repeat(lambda: f(data), number=1, repeat=3)
print(f"{f.__name__}: {min(t)}")
benchmark()
结果:
baseline: 7.355876599998737
jon_solution: 5.135336500003177
pandas_solution: 1.501740499999869
我还没有测试过,但可能更快,如果我们可以使用话,它可能会更快。
1
-
这非常优雅,尽管我发现如果使用正则表达式,JonSG 的解决方案会更快。如果我设法实现 Polars 或 Numba,也许你的解决方案会更快。
–
|
我注意到 JonSG 解决方案的瓶颈在于 strptime 部分,因此我改用 regex,结果发现它甚至比 pandas 解决方案还要快。
import io
import pandas as pd
import re
def jon_regex(data:str) -> pd.DataFrame:
results = []
prior = {}
date_pattern = re.compile(r'^\d{1,2}/\d{1,2}/\d{2}, \d{1,2}:\d{2}$')
with io.StringIO(data.strip()) as file_in:
for row in file_in:
parts = row.split(" - ")
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
found_date = False
# Check if date
check = bool(date_pattern.match(parts[0]))
if check:
current = {
"date": parts[0],
"user": "group_notification",
"message": ""
}
found_date = True
if not found_date:
if prior:
prior["message"] += " " + row.strip()
continue # Pasar a la siguiente fila
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## Break the remaining text into user and message
## ------------------------
sub_parts = parts[1].split(":")
if len(sub_parts) > 1:
current["user"] = sub_parts[0].strip()
current["message"] = ":".join(sub_parts[1:]).strip()
else:
current["message"] = parts[1].strip()
prior = current
# ------------------------
# The prior record is complete
# ------------------------
if prior:
results.append(prior)
# Create Dataframe
df = pd.DataFrame(results)
#Change to datetime if needed
df['date'] = pd.to_datetime(df['date'],format='%m/%d/%y, %H:%M')
return df
基准:
import io
import pandas as pd
import re
import datetime
import timeit
def jon_solution(data: str) -> pd.DataFrame:
results = []
prior = {}
with io.StringIO(data) as file_in:
for row in file_in:
parts = row.split(" - ")
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
try:
current = {
"date": datetime.datetime.strptime(
parts[0], "%m/%d/%y, %H:%M"),
"user": "group_notification",
"message": ""
}
except ValueError:
prior["message"] += row
continue
## ------------------------
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## ------------------------
## Break the remaining text into user and message
## ------------------------
sub_parts = parts[1].split(":")
if len(sub_parts) > 1:
current["user"] = sub_parts[0]
current["message"] = ":".join(sub_parts[1:])
else:
current["message"] = parts[1]
## ------------------------
prior = current
## ------------------------
## The prior record is complete
## ------------------------
results.append(prior)
## ------------------------
return pd.DataFrame(results)
def pandas_solution(data: str) -> pd.DataFrame:
df = pd.DataFrame({"line": data.splitlines(keepends=True)})
# Parse date.
# Assuming that multi-line messages have no line with a date prefix.
df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
df["date"] = pd.to_datetime(
df["date"], format="%m/%d/%y, %H:%M", errors="coerce")
# If the line has no date, it must be a multi-line message.
# If it is a multi-line message, merge it into the last line with a date prefix.
is_multiline_message = df["date"].isna()
df["message_index"] = (~is_multiline_message).cumsum()
df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
df = df.groupby("message_index").agg(
{"date": "first", "body": "sum"}).reset_index(drop=True)
# Parse user and message.
# Assuming that group notifications have no colon.
df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
is_group_notification = df["message"].isna()
df.loc[is_group_notification, "message"] = df.loc[
is_group_notification, "user"]
df.loc[is_group_notification, "user"] = "group_notification"
df.drop("body", axis=1, inplace=True)
return df
def jon_regex(data:str) -> pd.DataFrame:
results = []
prior = {}
date_pattern = re.compile(r'^\d{1,2}/\d{1,2}/\d{2}, \d{1,2}:\d{2}$')
with io.StringIO(data.strip()) as file_in:
for row in file_in:
parts = row.split(" - ")
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
found_date = False
# Check if date
check = bool(date_pattern.match(parts[0]))
if check:
current = {
"date": parts[0],
"user": "group_notification",
"message": ""
}
found_date = True
if not found_date:
if prior:
prior["message"] += " " + row.strip()
continue # Pasar a la siguiente fila
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## Break the remaining text into user and message
## ------------------------
sub_parts = parts[1].split(":")
if len(sub_parts) > 1:
current["user"] = sub_parts[0].strip()
current["message"] = ":".join(sub_parts[1:]).strip()
else:
current["message"] = parts[1].strip()
prior = current
# ------------------------
# The prior record is complete
# ------------------------
if prior:
results.append(prior)
# Create Dataframe
df = pd.DataFrame(results)
#Change to datetime if needed
df['date'] = pd.to_datetime(df['date'],format='%m/%d/%y, %H:%M')
return df
def benchmark():
data = """\
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User 3: Hi StackOverflow
"""
data *= 100000
for f in [jon_solution, pandas_solution,jon_regex]:
t = timeit.repeat(lambda: f(data), number=1, repeat=3)
print(f"{f.__name__}: {min(t)}")
benchmark()
结果:
jon_solution: 4.189222200075164
pandas_solution: 1.2554905998986214
jon_regex: 0.977813399862498
2
-
我想知道“我的”开销中有多少是解析,又有多少是创建日期时间对象。无论如何,这都做得很好。
– -
1@JonSG 好吧,使用 pyinstrument 作为分析器,我们可以看到它大约为 75%:100.0%wrapper legend.py:26└─99.7%jon_solution legend.py:61 ═─76.7%datetime._strptime_datetime _strptime.py:551│═─70.3%_strptime _strptime.py:293││═─35.3%[self] _strptime.py││═─20.4%_getlang _strptime.py:26││││═─18.3%getlocale locale.py:582│││││═─8.7%_parse_localename locale.py:464(它只是分析器输出的一部分;它不适合全部在这里)
–
|
使用更简单的正则表达式来捕获date
并将用户和消息user_message
分开post-processes
会减慢大型数据集的执行速度。
pattern = r'(\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2})\s-\s(.*)'
matches = re.findall(pattern, data)
df = pd.DataFrame(matches, columns=['date', 'user_message'])
users = []
messages = []
for message in df['user_message']:
entry = re.split(r'([^:]+):\s', message, maxsplit=1)
if len(entry) > 2: # user and message found
users.append(entry[1])
messages.append(entry[2].strip())
else:
users.append('group_notification') messages.append(entry[0].strip())
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)
这将输出为:
date user message
0 9/6/22, 11:28 group_notification Group creator created group "Example"
1 9/6/22, 11:39 User1 This is some text
2 9/6/22, 11:58 group_notification Group creator changed group name to "Example2"
3 9/6/22, 12:13 User2 This is\nsome text\nwith multiple lines
4 9/6/22, 13:13 group_notification Admin changed group profile photo
5 9/6/22, 14:45 User3 Hi StackOverflow
1
-
这似乎运行良好且快速。但是,如果有多行组通知消息,例如以下数据:数据 =“”9/6/22,11:28 – 组创建者创建了组“Example”9/6/22,11:39 – User1:这是一些文本9/6/22,11:58 – 组创建者将组名称更改为“Example2”9/6/22,12:13 – User2:这是一些多行文本9/6/22,13:13 – 管理员更改了组个人资料照片9/6/22,14:45 – User3:Hi StackOverflow“”所有消息都会被转移。
–
|
/:-
”–
–
–
–
–
|