我有一个包含交易的 pandas 数据框。交易要么记为付款,要么记为 ledger_account_booking。一笔交易可以有多笔付款和/或多个分类账帐户记账。因此,我的列paymentsledger_account_bookings包含一个字典列表,其中字典中的列表数量可能有所不同。一个小的示例数据框如下所示:

交易 ID 总金额 日期 付款 ledger_account_bookings
4308 645,83 2024 年 8 月 30 日 [] []
4254 291,67 2024 年 2 月 7 日 [] [{‘ledger_id’:’4265’,’金额’:’291,67’}]
4128 847 2024 年 2 月 14 日 [{‘payment_id’:’4128’,’金额’:’847.0’}] []
4248 4286,98 2024 年 6 月 25 日 [{‘payment_id’:’4261’,’金额’:’400.0’},
 {‘payment_id’:’4262’,’金额’:’11.0’},
 {‘payment_id’:’4263’,’金额’:’1668.51’},
 {‘payment_id’:’4264’,’金额’:’1868.54’},
 {‘payment_id’:’4265’,’金额’:’20.91’},
 {‘payment_id’:’4266’,’金额’:’2.21’},
 {‘payment_id’:’4267’,’金额’:’309.62’}]
[{‘ledger_id’:’4265’,’金额’:’6,19’}]
4192 6130,22 2024 年 4 月 24 日 [{‘payment_id’:’4193’,’金额’:’9.68’}] [{‘ledger_id’:’4222’,’amount’:’2106.0’},
 {‘ledger_id’:’4222’,’amount’:’4014.54’}]
4090 1158,98 2024 年 1 月 25 日 [{‘id’:’4110′,’amount’:’16.22’},
 {‘id’:’4111’, ‘amount’:’84.0’},
 {‘id’:’4112’, ‘amount’:’41.99’},
 {‘id’:’4113, ‘amount’:’9.11’,}
 {‘id’:’4114’, ‘amount’:’10.0’},
 {‘id’:’4115’, ‘amount’:’997.16’}]
[{‘ledger_id’:’4231’,’金额’:’-0.32′},
 {‘ledger_id’:’4231’,’金额’:’-0.18′}]

我想要的是,其中一列中的每个字典paymentsledger_account_bookings数据框都成为一行。预期结果如下所示:

交易 ID 总金额 日期 付款编号 付款金额 分类帐 ID ledger_amount
4308 645,83 2024 年 8 月 30 日
4254 291,67 2024 年 2 月 7 日 4265 291,67
4128 847 2024 年 2 月 14 日 4128 847.0
4248 4286,98 2024 年 6 月 25 日 4261 400.0
4248 4286,98 2024 年 6 月 25 日 4262 11.0
4248 4286,98 2024 年 6 月 25 日 4263 1668.51
4248 4286,98 2024 年 6 月 25 日 4264 1868.4
4248 4286,98 2024 年 6 月 25 日 4265 20.91
4248 4286,98 2024 年 6 月 25 日 4266 2.21
4248 4286,98 2024 年 6 月 25 日 4267 309.62
4248 4286,98 2024 年 6 月 25 日 4265 6,19
4192 6130,22 2024 年 4 月 24 日 4193 9.68
4192 6130,22 2024 年 4 月 24 日 4222 2106
4192 6130,22 2024 年 4 月 24 日 4222 4014.54
4090 1158,98 2024 年 1 月 25 日 4110 16.22
4090 1158,98 2024 年 1 月 25 日 4111 84.0
4090 1158,98 2024 年 1 月 25 日 4112 41.99
4090 1158,98 2024 年 1 月 25 日 4113 9.11
4090 1158,98 2024 年 1 月 25 日 4114 10.0
4090 1158,98 2024 年 1 月 25 日 4115 997.16
4090 1158,98 2024 年 1 月 25 日 4231 0.32
4090 1158,98 2024 年 1 月 25 日 4231 0.18

例如,交易 4248 有 7 笔付款和 1 笔分类账账户记账。因此,生成的数据框将有 8 行。交易 4192 有 2 笔付款和 1 笔分类账账户记账,因此生成的数据框应有 3 行。

我知道如何针对一列实现这一点,例如使用以下代码:

df_explode = df_financial_mutations.explode(['payments'])
#Normalize the json column into separate columns
df_normalized = json_normalize(df_explode['payments'])
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

问题是,我不知道如何对两列执行此操作。如果我ledger_account_bookings再次调用explode on,结果会变得模糊,因为我已经对付款列进行了explode,因此我的数据框中引入了“重复”行。因此,在对付款进行explode的地方,我现在有两行在列中具有完全相同的值ledger_account_bookings。当我再次explode时,这次是在另一列上,那些“重复”也会被explode,所以我的数据框现在包含没有意义的数据行。

我该如何解决需要同时分解两列的问题?我已经看到了payments但不幸的是,和的列表ledger_account_bookings大小可能不同,也可能是动态的(例如,可能有 0-5 个付款和 0-5 个 ledger_account_bookings,没有固定值)

任何帮助都将不胜感激。

5

  • 4
    来编辑该问题


    – 

  • 这很困难,因为我的数据很大而且嵌套很深,我不想让我的问题变得混乱。我会尝试


    – 

  • ledger_account 列与上面提供的示例非常相似


    – 

  • 结果应该是什么?如果payments有 4 个字典,并且ledger_account_bookings有 2 个,那么数据框中的所有组合应该是 8 个最终行还是您期望什么?


    – 

  • 我已经使用示例输入和输出编辑了该问题。


    – 


最佳答案
2

按元组处理数据的通用解决方案:

#in tuple set original and new columns names prefixes
cols = [('payments', 'payments'),('ledger_account_bookings', 'ledger')]
L = []
for col, prefix in cols:
    df_explode = df_financial_mutations.pop(col).explode()
    #Normalize the json column into separate columns
    df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
    #Add prefix to the columns that were 'exploded'
    #Remove missing values if all NaNs per rows
    df_normalized = df_normalized.add_prefix(f'{prefix}_').dropna(how='all')

    L.append(df_normalized)

#join original columns to concanecated list of DataFrames
out = df_financial_mutations.join(pd.concat(L)).reset_index(drop=True)

#clean data - replacement missing values by another column
out['payments_id'] = out['payments_id'].fillna(out.pop('payments_payment_id'))
#renaming columns names
out = out.rename(columns={'ledger_ledger_id':'ledger_id'})

print (out)
    transaction_id total_amount        date payments_amount payments_id  \
0             4308      645,83   30-8-2024              NaN         NaN   
1             4254      291,67    2-7-2024              NaN         NaN   
2             4128         847   14-2-2024            847.0        4128   
3             4248     4286,98   25-6-2024            400.0        4261   
4             4248     4286,98   25-6-2024             11.0        4262   
5             4248     4286,98   25-6-2024          1668.51        4263   
6             4248     4286,98   25-6-2024          1868.54        4264   
7             4248     4286,98   25-6-2024            20.91        4265   
8             4248     4286,98   25-6-2024             2.21        4266   
9             4248     4286,98   25-6-2024           309.62        4267   
10            4248     4286,98   25-6-2024              NaN         NaN   
11            4192     6130,22   24-4-2024             9.68        4193   
12            4192     6130,22   24-4-2024              NaN         NaN   
13            4192     6130,22   24-4-2024              NaN         NaN   
14            4090     1158,98   25-1-2024            16.22        4110   
15            4090     1158,98   25-1-2024             84.0        4111   
16            4090     1158,98   25-1-2024            41.99        4112   
17            4090     1158,98   25-1-2024             9.11        4113   
18            4090     1158,98   25-1-2024             10.0        4114   
19            4090     1158,98   25-1-2024           997.16        4115   
20            4090     1158,98   25-1-2024              NaN         NaN   
21            4090     1158,98   25-1-2024              NaN         NaN   

   ledger_id ledger_amount  
0        NaN           NaN  
1       4265        291,67  
2        NaN           NaN  
3        NaN           NaN  
4        NaN           NaN  
5        NaN           NaN  
6        NaN           NaN  
7        NaN           NaN  
8        NaN           NaN  
9        NaN           NaN  
10      4265          6,19  
11       NaN           NaN  
12      4222        2106.0  
13      4222       4014.54  
14       NaN           NaN  
15       NaN           NaN  
16       NaN           NaN  
17       NaN           NaN  
18       NaN           NaN  
19       NaN           NaN  
20      4231         -0.32  
21      4231         -0.18  

我建议分别处理每一列并加入原始数据 – 解决方案分别处理每一列:

#extract column payments by pop and expoding
df_explode = df_financial_mutations.pop('payments').explode()
#Normalize the json column into separate columns 
#Rewrite new index by original values from exploded DataFrame
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

#Rewrite missing values from  payments_id by payments_payment_id and remove column
df_normalized['payments_id'] = (df_normalized['payments_id']
                                 .fillna(df_normalized.pop('payments_payment_id')))

#Remove missing values if all NaNs per rows
df_normalized = df_normalized.dropna(how='all')

print (df_normalized)
  payments_amount payments_id
2           847.0        4128
3           400.0        4261
3            11.0        4262
3         1668.51        4263
3         1868.54        4264
3           20.91        4265
3            2.21        4266
3          309.62        4267
4            9.68        4193
5           16.22        4110
5            84.0        4111
5           41.99        4112
5            9.11        4113
5            10.0        4114
5          997.16        4115

df_explode1 = df_financial_mutations.pop('ledger_account_bookings').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized1 = pd.json_normalize(df_explode1).set_index(df_explode1.index)
#Add prefix to the columns that were 'exploded'
df_normalized1 = df_normalized1.add_prefix('ledger_')
#Remove missing values if all NaNs per rows
df_normalized1 = df_normalized1.dropna(how='all')

print (df_normalized1)
  ledger_ledger_id ledger_amount
1             4265        291,67
3             4265          6,19
4             4222        2106.0
4             4222       4014.54
5             4231         -0.32
5             4231         -0.18

out = df_financial_mutations.join(pd.concat([df_normalized, df_normalized1]))
print (out)

   transaction_id total_amount        date payments_amount payments_id  \
0            4308      645,83   30-8-2024              NaN         NaN   
1            4254      291,67    2-7-2024              NaN         NaN   
2            4128         847   14-2-2024            847.0        4128   
3            4248     4286,98   25-6-2024            400.0        4261   
3            4248     4286,98   25-6-2024             11.0        4262   
3            4248     4286,98   25-6-2024          1668.51        4263   
3            4248     4286,98   25-6-2024          1868.54        4264   
3            4248     4286,98   25-6-2024            20.91        4265   
3            4248     4286,98   25-6-2024             2.21        4266   
3            4248     4286,98   25-6-2024           309.62        4267   
3            4248     4286,98   25-6-2024              NaN         NaN   
4            4192     6130,22   24-4-2024             9.68        4193   
4            4192     6130,22   24-4-2024              NaN         NaN   
4            4192     6130,22   24-4-2024              NaN         NaN   
5            4090     1158,98   25-1-2024            16.22        4110   
5            4090     1158,98   25-1-2024             84.0        4111   
5            4090     1158,98   25-1-2024            41.99        4112   
5            4090     1158,98   25-1-2024             9.11        4113   
5            4090     1158,98   25-1-2024             10.0        4114   
5            4090     1158,98   25-1-2024           997.16        4115   
5            4090     1158,98   25-1-2024              NaN         NaN   
5            4090     1158,98   25-1-2024              NaN         NaN   

  ledger_ledger_id ledger_amount  
0              NaN           NaN  
1             4265        291,67  
2              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3             4265          6,19  
4              NaN           NaN  
4             4222        2106.0  
4             4222       4014.54  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5             4231         -0.32  
5             4231         -0.18  

#Create default index if necessary
out = out.reset_index(drop=True)

5

  • 非常感谢花时间回答我的问题。这几乎是完美的。是否也可以在结果数据框中为支付或分类帐帐户预订中的每个条目分配一行?例如,您的解决方案为交易 4248 生成 7 行,其中第一行包含付款和分类帐帐户预订。理想情况下,我会有一个解决方案,其中每个付款/分类帐帐户预订都有一个单独的行,因此结果数据框将有 8 行交易 4248


    – 

  • 这很管用 🙂 很棒


    – 

  • 你能进一步解释一下这个步骤吗,我不明白你在那里做什么:#用 payment_payment_id 重写 payment_id 中的缺失值并删除列df_normalized['payments_id'] = (df_normalized['payments_id'].fillna(df_normalized.pop('payments_id')))为什么我们要在这里使用 fillna 函数?


    – 


  • @Psychotechnopath – 原因是有两种类型的 idin jsonpayment_id并且id只需要最终 DataFrame 中的列。因此解决方案payment_id列替换 NaN 。换句话说 – 原因是列中的数据payments_idpayments_idpayments


    – 


  • @Psychotechnopath – 我创建了更通用的解决方案并添加到答案中。


    – 

为了便于概括,我提出以下解决方案和示例。该解决方案也适用于两行以上的情况。

import pandas as pd
import numpy as np

def extract_nested(df: pd.DataFrame,
                   column: str, 
                   prefix: (bool, None, str) = None, 
                   *args, **kwargs):
    df_i = df[column].explode(ignore_index=False)  # Merge the extracted data back with the non-nested columns
    df_i = pd.json_normalize(df_i, *args, **kwargs).set_index(df_i.index)  # Normalize the JSON-like structure into tabular format by containing the index
    
    if prefix != False:  # Add prefix to the resulting columns if set
        df_i = df_i.add_prefix(f'{column}_' if prefix is None else prefix)
    return df_i

def vstack_extract_nested(df: pd.DataFrame, 
                          columns: list, 
                          prefix: (bool, None, str, list) = None):
    df_ = df[df.columns.difference(columns)] # Select all columns except the ones containing nested data for the later merge
    prefix = prefix if isinstance(prefix, list) else [prefix]*len(columns)  # get prefix in the right shape
    df_list = [df_[~df[columns].astype(bool).any(axis=1)]]  # DataFrame rows that don't contain nested data
    
    for i, p_i in zip(columns, prefix):
        df_i = extract_nested(df, i, prefix=p_i).dropna(how='all')
        df_list.append(pd.merge(df_, df_i, left_index=True, right_index=True, how='right'))  # Merge the extracted data back with the non-nested columns

    return pd.concat(df_list).sort_index() # concat DataFrame rows that don't contain nested data with the those who does

例子

具有 3 个嵌套列的示例

df = pd.DataFrame({'id': 100+np.arange(5), 'col1': [[]]*5, 'col2': [[]]*5, 'col3': [[]]*5})
df.at[1, 'col1'] = [{'a': 1, 'b': 2}]
df.at[3, 'col1'] = [{'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df.at[2, 'col2'] = [{'A': 10, 'B': 20}]
df.at[3, 'col2'] = [{'A': 30, 'B': 40}, {'A': 50, 'B': 60}]
df.at[2, 'col3'] = [{'X': 100, 'Y': 200}]
df.at[3, 'col3'] = [{'X': 300, 'Y': 400}, {'X': 500, 'Y': 600}]

vstack_extract_nested(df, ['col1', 'col2', 'col3'], prefix=[None, '','c3_'])
#     id  col1_a  col1_b       A       B   c3_X   c3_Y
# 0  100     NaN     NaN     NaN     NaN    NaN    NaN
# 1  101     1.0     2.0     NaN     NaN    NaN    NaN
# 2  102     NaN     NaN    10.0    20.0    NaN    NaN
# 2  102     NaN     NaN     NaN     NaN  100.0  200.0
# 3  103     3.0     4.0     NaN     NaN    NaN    NaN
# 3  103     5.0     6.0     NaN     NaN    NaN    NaN
# 3  103     NaN     NaN    30.0    40.0    NaN    NaN
# 3  103     NaN     NaN    50.0    60.0    NaN    NaN
# 3  103     NaN     NaN     NaN     NaN  300.0  400.0
# 3  103     NaN     NaN     NaN     NaN  500.0  600.0
# 4  104     NaN     NaN     NaN     NaN    NaN    NaN

OP 用例示例

df = pd.DataFrame({'transaction_id': [4308, 4254, 4128, 4090], 'payments': [[]]*4, 'ledger_account_bookings': [[]]*4})
df.at[1, 'ledger_account_bookings'] = [{'ledger_id': '4265', 'amount': '291,67'}]
df.at[2, 'payments'] = [{'payment_id': '4128', 'amount': '847.0'}]
df.at[3, 'ledger_account_bookings'] = [{'ledger_id': '4231', 'amount': '-0.32'},
                                       {'ledger_id': '4231', 'amount': '-0.18'}]
df.at[3, 'payments'] = [{'id': '4110','amount': '16.22'},
                        {'id': '4111', 'amount': '84.0'},
                        {'id': '4112', 'amount': '41.99'}]

# extract the nested columns
df_exp = vstack_extract_nested(df, ['ledger_account_bookings', 'payments'], prefix=['ledger_', 'payments_'])

# correct naming
df_exp.fillna(df_exp.pop('payments_payment_id'), inplace=True)  # correct naming of payments column (payment_id, payments_payment_id)
df_exp.rename(columns={'ledger_ledger_id':'ledger_id'}, inplace=True)  # correct naming of ledger column
df_exp
#    transaction_id ledger_id ledger_amount payments_amount payments_id
# 0            4308       NaN           NaN             NaN         NaN
# 1            4254      4265        291,67             NaN         NaN
# 2            4128       NaN           NaN           847.0         NaN
# 3            4090      4231         -0.32             NaN         NaN
# 3            4090      4231         -0.18             NaN         NaN
# 3            4090       NaN           NaN           16.22        4110
# 3            4090       NaN           NaN            84.0        4111
# 3            4090       NaN           NaN           41.99        4112

2

  • 1
    这种方法对我不起作用 – 给出KeyError: "None of [Index(['col1', 'col2'], dtype='object')] are in the [columns]"。也许这条线上出了什么问题?mask_all_empty = ~df[['col1', 'col2']].astype(bool).any(axis=1)


    – 


  • 1
    好主意。应该是~df[columns].astype(bool).any(axis=1)。改了。


    –