我需要一个正则表达式模式来创建一个包含三个独立列的 Pandas DataFrame:日期、用户和消息。我相信正则表达式是这种情况的最佳选择,但如果有其他方法可以更快地实现相同的结果,我想知道。

该模式必须:

  • 捕获整个消息,包括多行消息。

  • 检测何时没有用户,并将其归类为用户列中的组通知。

我已经创建了一个,但我觉得它并没有真正优化,在大型数据集上运行需要几秒钟。这是我目前拥有的模式,以及我需要提取的数据和格式的示例。

data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""


pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

#pass the pattern and data to split it to get the list of messages with users
messages = re.split(pattern, data)[1:]

#extract all dates
dates = re.findall(pattern, data)

#create dataframe
df = pd.DataFrame({'user_message': messages, 'date': dates})

#separate Users and Message
users = []
messages = []
for message in df['user_message']:
    entry = re.split(r'([\w\W]+?):\s', message)
    if entry[1:]:  # user name
        users.append(entry[1])
        messages.append(" ".join(entry[2:]))
    else:
        users.append('group_notification')
        messages.append(entry[0])

df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)

13

  • 1
    用户名或消息是否可以包含以下每个字符“ /:-


    – 


  • 2
    正则表达式很灵活,但通常很慢。如果你处理 ASCII 字符串,那么你可以用本机编译语言进行自己的解析,这可能会快 10 倍到 100 倍(取决于本机代码的优化级别)。


    – 

  • 1
    如果您确实想使用正则表达式来实现灵活性,那么您可以稍微使用一下引擎来减少回溯。这需要一些平均解析速度较慢的示例字符串(未提供)和实际的正则表达式。另一种方法是使用保证花费线性时间的正则表达式引擎,但我认为这不是问题。


    – 

  • 1
    一种简单的替代方法是使用多个进程来执行正则表达式计算,因为正则表达式代码通常受计算限制(否则会使其变慢)。


    – 

  • 3
    如果用户可以编写多行消息,他们可能会插入看起来像是来自其他人的行。您如何处理这种情况?


    – 


最佳答案
4

假设用户不能弄乱并构建模仿新条目的多行消息,我可能会手动解析它而不是使用正则表达式,然后将数据加载到 pandas 中(如果您确实需要的话)。

这里的技巧是维护“先前”记录的句柄,以便在我们有后续消息的情况下我们可以扩展先前的消息。

您可以尝试以下概念:

import io
import datetime
import pandas

FILE_TEXT = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User 3: Hi : StackOverflow
"""

def try_parse_date(date_string):
    potential_formats = [
        "%m/%d/%y, %H:%M",
        "%d/%m/%y, %H:%M"
    ]

    final_error = None
    for potential_format in potential_formats:
        try:
            return datetime.datetime.strptime(date_string, potential_format)
        except ValueError as e:
            final_error = e
    raise final_error

results = []
prior = {}
with io.StringIO(FILE_TEXT) as file_in:
    for index, row in enumerate(file_in):
        row = row.strip()
        if not row:
            continue

        ## ------------------------
        ## We have a row, break it in to potential parts
        ## ------------------------
        potential_date, *potential_user_and_message = row.rstrip("\n").split(" - ")
        potential_user_and_message = " - ".join(potential_user_and_message)
        ## ------------------------

        ## ------------------------
        ## If this row does not start with a date
        ## assume it is part of the prior message
        ## ------------------------
        try:
            best_date_guess = try_parse_date(potential_date)
        except ValueError:
            if not index:
                raise ValueError(f"The first row must start with a valid date")
            print(f"**Warning** row {index} does not begin with valid date")
            prior["message"] += row
            continue
        ## ------------------------

        ## ------------------------
        ## Break the remaining text into user and message
        ## ------------------------
        potential_user, *potential_message = potential_user_and_message.split(":")
        if potential_message:
            best_user_guess = potential_user
            best_message_guess = ":".join(potential_message)
        else:
            best_user_guess = "group_notification"
            best_message_guess = potential_user_and_message
        ## ------------------------

        ## ------------------------
        ## The prior record is complete
        ## ------------------------
        if prior:
            results.append(prior)
        ## ------------------------

        ## ------------------------
        ## Reset the prior record to this record
        ## ------------------------
        prior = {"date": best_date_guess, "user": best_user_guess, "message": best_message_guess}
        ## ------------------------

    ## ------------------------
    ## The prior record is complete
    ## ------------------------
    if prior:
        results.append(prior)
    ## ------------------------

print(pandas.DataFrame(results))

这应该给你:

                 date                user                                         message
0 2022-09-06 11:28:00  group_notification           Group creator created group "Example"
1 2022-09-06 11:39:00               User1                               This is some text
2 2022-09-06 11:58:00  group_notification  Group creator changed group name to "Example2"
3 2022-09-06 12:13:00               User2             This issome textwith multiple lines
4 2022-09-06 13:13:00  group_notification               Admin changed group profile photo
5 2022-09-06 14:45:00              User 3                              Hi : StackOverflow

4

  • 这个方法很好,但是如果用户名包含任何空格(我没有说清楚;很抱歉),比如这里的“用户 3”,那么它就会被归类为群组通知:data = “”” 9/6/22, 11:28 – 群组创建者创建群组“示例” 9/6/22, 11:28 – 群组创建者创建群组“示例” 9/6/22, 11:39 – 用户 1:这是一些文本 9/6/22, 11:58 – 群组创建者将群组名称更改为“示例 2” 9/6/22, 12:13 – 用户 2:这是一些包含多行的文本 9/6/22, 13:13 – 管理员更改了群组个人资料照片 9/6/22, 14:45 – 用户 3:Hi StackOverflow“””.strip()


    – 


  • 1
    如果用户名可以包含空格,那么您必须按 进行拆分:,但这会引入一个问题,即群组消息绝不能包含这样的字符。在我看来,此时也许应该回到数据源并确定它是否被正确拆分和转义。


    – 

  • 我注意到另一个问题,这将日期格式限制为只有一种,如果我想尝试三种不同的日期格式怎么办?我无法捕获 ValueError 并尝试格式列表中的另一种格式,因为这会干扰此行“prior[“message”] += row”中的多行消息。我尝试了 dateutil.parser.parse(parts[0]),但速度太慢了


    – 

  • 1
    从技术上讲,你可以嵌套 try 块,并且只有最后、最内层的 except 执行前一个/继续


    – 

这与 JonSG 的方法类似,但从一开始就使用了 pandas。

import pandas as pd

def pandas_solution(data: str) -> pd.DataFrame:
    df = pd.DataFrame({"line": data.splitlines(keepends=True)})

    # Parse date.
    # Assuming that multi-line messages have no line with a date prefix.
    df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
    df["date"] = pd.to_datetime(df["date"], format="%m/%d/%y, %H:%M", errors="coerce")

    # If the line has no date, it must be a multi-line message.
    # If it is a multi-line message, merge it into the last line with a date prefix.
    is_multiline_message = df["date"].isna()
    df["message_index"] = (~is_multiline_message).cumsum()
    df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
    df = df.groupby("message_index").agg({"date": "first", "body": "sum"})

    # Parse user and message.
    # Assuming that group notifications have no colon.
    df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
    is_group_notification = df["message"].isna()
    df.loc[is_group_notification, "message"] = df.loc[is_group_notification, "user"]
    df.loc[is_group_notification, "user"] = "group_notification"
    df.drop("body", axis=1, inplace=True)
    return df

基准:

import datetime
import io
import re
import timeit

import pandas as pd

pd.set_option("display.width", 1000)
pd.set_option("display.max_rows", 300)
pd.set_option("display.max_columns", 100)


def baseline(data):
    pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

    # pass the pattern and data to split it to get the list of messages with users
    messages = re.split(pattern, data)[1:]

    # extract all dates
    dates = re.findall(pattern, data)

    # create dataframe
    df = pd.DataFrame({'user_message': messages, 'date': dates})

    # separate Users and Message
    users = []
    messages = []
    for message in df['user_message']:
        entry = re.split(r'([\w\W]+?):\s', message)
        if entry[1:]:  # user name
            users.append(entry[1])
            messages.append(" ".join(entry[2:]))
        else:
            users.append('group_notification')
            messages.append(entry[0])

    df['user'] = users
    df['message'] = messages
    df.drop(columns=['user_message'], inplace=True)
    return df


def jon_solution(data: str) -> pd.DataFrame:

    results = []
    prior = {}
    with io.StringIO(data) as file_in:
        for row in file_in:
            parts = row.split(" - ")

            ## ------------------------
            ## If this row does not start with a date
            ## assume it is part of the prior message
            ## ------------------------
            try:
                current = {
                    "date": datetime.datetime.strptime(parts[0], "%m/%d/%y, %H:%M"),
                    "user": "group_notification",
                    "message": ""
                }
            except ValueError:
                prior["message"] += row
                continue
            ## ------------------------

            ## ------------------------
            ## The prior record is complete
            ## ------------------------
            if prior:
                results.append(prior)
            ## ------------------------

            ## ------------------------
            ## Break the remaining text into user and message
            ## ------------------------
            sub_parts = parts[1].split(":")
            if len(sub_parts) > 1:
                current["user"] = sub_parts[0]
                current["message"] = ":".join(sub_parts[1:])
            else:
                current["message"] = parts[1]
            ## ------------------------

            prior = current

        ## ------------------------
        ## The prior record is complete
        ## ------------------------
        results.append(prior)
        ## ------------------------

    return pd.DataFrame(results)


def pandas_solution(data: str) -> pd.DataFrame:
    df = pd.DataFrame({"line": data.splitlines(keepends=True)})

    # Parse date.
    # Assuming that multi-line messages have no line with a date prefix.
    df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
    df["date"] = pd.to_datetime(df["date"], format="%m/%d/%y, %H:%M", errors="coerce")

    # If the line has no date, it must be a multi-line message.
    # If it is a multi-line message, merge it into the last line with a date prefix.
    is_multiline_message = df["date"].isna()
    df["message_index"] = (~is_multiline_message).cumsum()
    df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
    df = df.groupby("message_index").agg({"date": "first", "body": "sum"}).reset_index(drop=True)

    # Parse user and message.
    # Assuming that group notifications have no colon.
    df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
    is_group_notification = df["message"].isna()
    df.loc[is_group_notification, "message"] = df.loc[is_group_notification, "user"]
    df.loc[is_group_notification, "user"] = "group_notification"
    df.drop("body", axis=1, inplace=True)
    return df


def benchmark():
    data = """\
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with - multiple: lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""
    data *= 100000

    for f in [baseline, jon_solution, pandas_solution]:
        t = timeit.repeat(lambda: f(data), number=1, repeat=3)
        print(f"{f.__name__}: {min(t)}")


benchmark()

结果:

baseline: 7.355876599998737
jon_solution: 5.135336500003177
pandas_solution: 1.501740499999869

我还没有测试过,但可能更快,如果我们可以使用话,它可能会更快。

1

  • 这非常优雅,尽管我发现如果使用正则表达式,JonSG 的解决方案会更快。如果我设法实现 Polars 或 Numba,也许你的解决方案会更快。


    – 

我注意到 JonSG 解决方案的瓶颈在于 strptime 部分,因此我改用 regex,结果发现它甚至比 pandas 解决方案还要快。

import io
import pandas as pd
import re

def jon_regex(data:str) -> pd.DataFrame:
results = []
prior = {}
date_pattern = re.compile(r'^\d{1,2}/\d{1,2}/\d{2}, \d{1,2}:\d{2}$')

with io.StringIO(data.strip()) as file_in:
    for row in file_in:
        parts = row.split(" - ")

        ## ------------------------
        ## If this row does not start with a date
        ## assume it is part of the prior message
        ## ------------------------
        found_date = False
        # Check if date
        check = bool(date_pattern.match(parts[0]))
        if check:
            current = {
                "date": parts[0],
                "user": "group_notification",
                "message": ""
            }
            found_date = True

        if not found_date:
            if prior:
                prior["message"] += " " + row.strip()
            continue  # Pasar a la siguiente fila

        ## ------------------------
        ## The prior record is complete
        ## ------------------------
        if prior:
            results.append(prior)

        ## ------------------------
        ## Break the remaining text into user and message
        ## ------------------------
        sub_parts = parts[1].split(":")
        if len(sub_parts) > 1:
            current["user"] = sub_parts[0].strip()
            current["message"] = ":".join(sub_parts[1:]).strip()
        else:
            current["message"] = parts[1].strip()

        prior = current

    # ------------------------
    # The prior record is complete
    # ------------------------
    if prior:
        results.append(prior)

    # Create Dataframe
    df = pd.DataFrame(results)

    #Change to datetime if needed
    df['date'] = pd.to_datetime(df['date'],format='%m/%d/%y, %H:%M')
    
    return df

基准:

import io
import pandas as pd
import re
import datetime
import timeit
def jon_solution(data: str) -> pd.DataFrame:

    results = []
    prior = {}
    with io.StringIO(data) as file_in:
        for row in file_in:
            parts = row.split(" - ")

            ## ------------------------
            ## If this row does not start with a date
            ## assume it is part of the prior message
            ## ------------------------
            try:
                current = {
                    "date": datetime.datetime.strptime(
                        parts[0], "%m/%d/%y, %H:%M"),
                    "user": "group_notification",
                    "message": ""
                }
            except ValueError:
                prior["message"] += row
                continue
            ## ------------------------

            ## ------------------------
            ## The prior record is complete
            ## ------------------------
            if prior:
                results.append(prior)
            ## ------------------------

            ## ------------------------
            ## Break the remaining text into user and message
            ## ------------------------
            sub_parts = parts[1].split(":")
            if len(sub_parts) > 1:
                current["user"] = sub_parts[0]
                current["message"] = ":".join(sub_parts[1:])
            else:
                current["message"] = parts[1]
            ## ------------------------

            prior = current

        ## ------------------------
        ## The prior record is complete
        ## ------------------------
        results.append(prior)
        ## ------------------------

    return pd.DataFrame(results)


def pandas_solution(data: str) -> pd.DataFrame:
    df = pd.DataFrame({"line": data.splitlines(keepends=True)})

    # Parse date.
    # Assuming that multi-line messages have no line with a date prefix.
    df[["date", "body"]] = df["line"].str.split(" - ", n=1, expand=True)
    df["date"] = pd.to_datetime(
        df["date"], format="%m/%d/%y, %H:%M", errors="coerce")

    # If the line has no date, it must be a multi-line message.
    # If it is a multi-line message, merge it into the last line with a date prefix.
    is_multiline_message = df["date"].isna()
    df["message_index"] = (~is_multiline_message).cumsum()
    df.loc[is_multiline_message, "body"] = df.loc[is_multiline_message, "line"]
    df = df.groupby("message_index").agg(
        {"date": "first", "body": "sum"}).reset_index(drop=True)

    # Parse user and message.
    # Assuming that group notifications have no colon.
    df[["user", "message"]] = df["body"].str.split(": ", n=1, expand=True)
    is_group_notification = df["message"].isna()
    df.loc[is_group_notification, "message"] = df.loc[
        is_group_notification, "user"]
    df.loc[is_group_notification, "user"] = "group_notification"
    df.drop("body", axis=1, inplace=True)
    return df

def jon_regex(data:str) -> pd.DataFrame:
    results = []
    prior = {}
    date_pattern = re.compile(r'^\d{1,2}/\d{1,2}/\d{2}, \d{1,2}:\d{2}$')

    with io.StringIO(data.strip()) as file_in:
        for row in file_in:
            parts = row.split(" - ")

            ## ------------------------
            ## If this row does not start with a date
            ## assume it is part of the prior message
            ## ------------------------
            found_date = False
            # Check if date
            check = bool(date_pattern.match(parts[0]))
            if check:
                current = {
                    "date": parts[0],
                    "user": "group_notification",
                    "message": ""
                }
                found_date = True

            if not found_date:
                if prior:
                    prior["message"] += " " + row.strip()
                continue  # Pasar a la siguiente fila

            ## ------------------------
            ## The prior record is complete
            ## ------------------------
            if prior:
                results.append(prior)

            ## ------------------------
            ## Break the remaining text into user and message
            ## ------------------------
            sub_parts = parts[1].split(":")
            if len(sub_parts) > 1:
                current["user"] = sub_parts[0].strip()
                current["message"] = ":".join(sub_parts[1:]).strip()
            else:
                current["message"] = parts[1].strip()

            prior = current

        # ------------------------
        # The prior record is complete
        # ------------------------
        if prior:
            results.append(prior)

        # Create Dataframe
        df = pd.DataFrame(results)

        #Change to datetime if needed
        df['date'] = pd.to_datetime(df['date'],format='%m/%d/%y, %H:%M')
        
        return df

def benchmark():
    data = """\
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User 3: Hi StackOverflow
"""
    data *= 100000

    for f in [jon_solution, pandas_solution,jon_regex]:
        t = timeit.repeat(lambda: f(data), number=1, repeat=3)
        print(f"{f.__name__}: {min(t)}")


benchmark()

结果:

jon_solution: 4.189222200075164
pandas_solution: 1.2554905998986214
jon_regex: 0.977813399862498

2

  • 我想知道“我的”开销中有多少是解析,又有多少是创建日期时间对象。无论如何,这都做得很好。


    – 

  • 1
    @JonSG 好吧,使用 pyinstrument 作为分析器,我们可以看到它大约为 75%:100.0%wrapper legend.py:26└─99.7%jon_solution legend.py:61 ═─76.7%datetime._strptime_datetime _strptime.py:551│═─70.3%_strptime _strptime.py:293││═─35.3%[self] _strptime.py││═─20.4%_getlang _strptime.py:26││││═─18.3%getlocale locale.py:582│││││═─8.7%_parse_localename locale.py:464(它只是分析器输出的一部分;它不适合全部在这里)


    – 


使用更简单的正则表达式来捕获date并将用户和消息user_message分开post-processes会减慢大型数据集的执行速度。


pattern = r'(\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2})\s-\s(.*)'

matches = re.findall(pattern, data)
df = pd.DataFrame(matches, columns=['date', 'user_message'])
users = []
messages = []
for message in df['user_message']:
    entry = re.split(r'([^:]+):\s', message, maxsplit=1)
    if len(entry) > 2:  # user and message found
        users.append(entry[1])
        messages.append(entry[2].strip())
    else:
        users.append('group_notification')      messages.append(entry[0].strip())

df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)

这将输出为:

              date                user                                            message
0  9/6/22, 11:28  group_notification               Group creator created group "Example"
1  9/6/22, 11:39               User1                                 This is some text
2  9/6/22, 11:58  group_notification  Group creator changed group name to "Example2"
3  9/6/22, 12:13               User2  This is\nsome text\nwith multiple lines
4  9/6/22, 13:13  group_notification               Admin changed group profile photo
5  9/6/22, 14:45               User3                              Hi StackOverflow

1

  • 这似乎运行良好且快速。但是,如果有多行组通知消息,例如以下数据:数据 =“”9/6/22,11:28 – 组创建者创建了组“Example”9/6/22,11:39 – User1:这是一些文本9/6/22,11:58 – 组创建者将组名称更改为“Example2”9/6/22,12:13 – User2:这是一些多行文本9/6/22,13:13 – 管理员更改了组个人资料照片9/6/22,14:45 – User3:Hi StackOverflow“”所有消息都会被转移。


    –