我有一个包含交易的 pandas 数据框。交易要么记为付款,要么记为 ledger_account_booking。一笔交易可以有多笔付款和/或多个分类账帐户记账。因此,我的列payments
和ledger_account_bookings
包含一个字典列表,其中字典中的列表数量可能有所不同。一个小的示例数据框如下所示:
交易 ID | 总金额 | 日期 | 付款 | ledger_account_bookings |
---|---|---|---|---|
4308 | 645,83 | 2024 年 8 月 30 日 | [] | [] |
4254 | 291,67 | 2024 年 2 月 7 日 | [] | [{‘ledger_id’:’4265’,’金额’:’291,67’}] |
4128 | 847 | 2024 年 2 月 14 日 | [{‘payment_id’:’4128’,’金额’:’847.0’}] | [] |
4248 | 4286,98 | 2024 年 6 月 25 日 | [{‘payment_id’:’4261’,’金额’:’400.0’}, {‘payment_id’:’4262’,’金额’:’11.0’}, {‘payment_id’:’4263’,’金额’:’1668.51’}, {‘payment_id’:’4264’,’金额’:’1868.54’}, {‘payment_id’:’4265’,’金额’:’20.91’}, {‘payment_id’:’4266’,’金额’:’2.21’}, {‘payment_id’:’4267’,’金额’:’309.62’}] |
[{‘ledger_id’:’4265’,’金额’:’6,19’}] |
4192 | 6130,22 | 2024 年 4 月 24 日 | [{‘payment_id’:’4193’,’金额’:’9.68’}] | [{‘ledger_id’:’4222’,’amount’:’2106.0’}, {‘ledger_id’:’4222’,’amount’:’4014.54’}] |
4090 | 1158,98 | 2024 年 1 月 25 日 | [{‘id’:’4110′,’amount’:’16.22’}, {‘id’:’4111’, ‘amount’:’84.0’}, {‘id’:’4112’, ‘amount’:’41.99’}, {‘id’:’4113, ‘amount’:’9.11’,} {‘id’:’4114’, ‘amount’:’10.0’}, {‘id’:’4115’, ‘amount’:’997.16’}] |
[{‘ledger_id’:’4231’,’金额’:’-0.32′}, {‘ledger_id’:’4231’,’金额’:’-0.18′}] |
我想要的是,其中一列中的每个字典payments
或ledger_account_bookings
数据框都成为一行。预期结果如下所示:
交易 ID | 总金额 | 日期 | 付款编号 | 付款金额 | 分类帐 ID | ledger_amount |
---|---|---|---|---|---|---|
4308 | 645,83 | 2024 年 8 月 30 日 | 钠 | 钠 | 钠 | 钠 |
4254 | 291,67 | 2024 年 2 月 7 日 | 南 | 钠 | 4265 | 291,67 |
4128 | 847 | 2024 年 2 月 14 日 | 4128 | 847.0 | 钠 | 钠 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4261 | 400.0 | 钠 | 钠 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4262 | 11.0 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4263 | 1668.51 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4264 | 1868.4 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4265 | 20.91 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4266 | 2.21 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 4267 | 309.62 | 钠 | 南 |
4248 | 4286,98 | 2024 年 6 月 25 日 | 钠 | 钠 | 4265 | 6,19 |
4192 | 6130,22 | 2024 年 4 月 24 日 | 4193 | 9.68 | 钠 | 钠 |
4192 | 6130,22 | 2024 年 4 月 24 日 | 钠 | 钠 | 4222 | 2106 |
4192 | 6130,22 | 2024 年 4 月 24 日 | 钠 | 钠 | 4222 | 4014.54 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4110 | 16.22 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4111 | 84.0 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4112 | 41.99 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4113 | 9.11 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4114 | 10.0 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 4115 | 997.16 | 钠 | 钠 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 钠 | 钠 | 4231 | 0.32 |
4090 | 1158,98 | 2024 年 1 月 25 日 | 钠 | 钠 | 4231 | 0.18 |
例如,交易 4248 有 7 笔付款和 1 笔分类账账户记账。因此,生成的数据框将有 8 行。交易 4192 有 2 笔付款和 1 笔分类账账户记账,因此生成的数据框应有 3 行。
我知道如何针对一列实现这一点,例如使用以下代码:
df_explode = df_financial_mutations.explode(['payments'])
#Normalize the json column into separate columns
df_normalized = json_normalize(df_explode['payments'])
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')
问题是,我不知道如何对两列执行此操作。如果我ledger_account_bookings
再次调用explode on,结果会变得模糊,因为我已经对付款列进行了explode,因此我的数据框中引入了“重复”行。因此,在对付款进行explode的地方,我现在有两行在列中具有完全相同的值ledger_account_bookings
。当我再次explode时,这次是在另一列上,那些“重复”也会被explode,所以我的数据框现在包含没有意义的数据行。
我该如何解决需要同时分解两列的问题?我已经看到了payments
但不幸的是,和的列表ledger_account_bookings
大小可能不同,也可能是动态的(例如,可能有 0-5 个付款和 0-5 个 ledger_account_bookings,没有固定值)
任何帮助都将不胜感激。
5
最佳答案
2
按元组处理数据的通用解决方案:
#in tuple set original and new columns names prefixes
cols = [('payments', 'payments'),('ledger_account_bookings', 'ledger')]
L = []
for col, prefix in cols:
df_explode = df_financial_mutations.pop(col).explode()
#Normalize the json column into separate columns
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
#Remove missing values if all NaNs per rows
df_normalized = df_normalized.add_prefix(f'{prefix}_').dropna(how='all')
L.append(df_normalized)
#join original columns to concanecated list of DataFrames
out = df_financial_mutations.join(pd.concat(L)).reset_index(drop=True)
#clean data - replacement missing values by another column
out['payments_id'] = out['payments_id'].fillna(out.pop('payments_payment_id'))
#renaming columns names
out = out.rename(columns={'ledger_ledger_id':'ledger_id'})
print (out)
transaction_id total_amount date payments_amount payments_id \
0 4308 645,83 30-8-2024 NaN NaN
1 4254 291,67 2-7-2024 NaN NaN
2 4128 847 14-2-2024 847.0 4128
3 4248 4286,98 25-6-2024 400.0 4261
4 4248 4286,98 25-6-2024 11.0 4262
5 4248 4286,98 25-6-2024 1668.51 4263
6 4248 4286,98 25-6-2024 1868.54 4264
7 4248 4286,98 25-6-2024 20.91 4265
8 4248 4286,98 25-6-2024 2.21 4266
9 4248 4286,98 25-6-2024 309.62 4267
10 4248 4286,98 25-6-2024 NaN NaN
11 4192 6130,22 24-4-2024 9.68 4193
12 4192 6130,22 24-4-2024 NaN NaN
13 4192 6130,22 24-4-2024 NaN NaN
14 4090 1158,98 25-1-2024 16.22 4110
15 4090 1158,98 25-1-2024 84.0 4111
16 4090 1158,98 25-1-2024 41.99 4112
17 4090 1158,98 25-1-2024 9.11 4113
18 4090 1158,98 25-1-2024 10.0 4114
19 4090 1158,98 25-1-2024 997.16 4115
20 4090 1158,98 25-1-2024 NaN NaN
21 4090 1158,98 25-1-2024 NaN NaN
ledger_id ledger_amount
0 NaN NaN
1 4265 291,67
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 4265 6,19
11 NaN NaN
12 4222 2106.0
13 4222 4014.54
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 NaN NaN
20 4231 -0.32
21 4231 -0.18
我建议分别处理每一列并加入原始数据 – 解决方案分别处理每一列:
#extract column payments by pop and expoding
df_explode = df_financial_mutations.pop('payments').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')
#Rewrite missing values from payments_id by payments_payment_id and remove column
df_normalized['payments_id'] = (df_normalized['payments_id']
.fillna(df_normalized.pop('payments_payment_id')))
#Remove missing values if all NaNs per rows
df_normalized = df_normalized.dropna(how='all')
print (df_normalized)
payments_amount payments_id
2 847.0 4128
3 400.0 4261
3 11.0 4262
3 1668.51 4263
3 1868.54 4264
3 20.91 4265
3 2.21 4266
3 309.62 4267
4 9.68 4193
5 16.22 4110
5 84.0 4111
5 41.99 4112
5 9.11 4113
5 10.0 4114
5 997.16 4115
df_explode1 = df_financial_mutations.pop('ledger_account_bookings').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized1 = pd.json_normalize(df_explode1).set_index(df_explode1.index)
#Add prefix to the columns that were 'exploded'
df_normalized1 = df_normalized1.add_prefix('ledger_')
#Remove missing values if all NaNs per rows
df_normalized1 = df_normalized1.dropna(how='all')
print (df_normalized1)
ledger_ledger_id ledger_amount
1 4265 291,67
3 4265 6,19
4 4222 2106.0
4 4222 4014.54
5 4231 -0.32
5 4231 -0.18
out = df_financial_mutations.join(pd.concat([df_normalized, df_normalized1]))
print (out)
transaction_id total_amount date payments_amount payments_id \
0 4308 645,83 30-8-2024 NaN NaN
1 4254 291,67 2-7-2024 NaN NaN
2 4128 847 14-2-2024 847.0 4128
3 4248 4286,98 25-6-2024 400.0 4261
3 4248 4286,98 25-6-2024 11.0 4262
3 4248 4286,98 25-6-2024 1668.51 4263
3 4248 4286,98 25-6-2024 1868.54 4264
3 4248 4286,98 25-6-2024 20.91 4265
3 4248 4286,98 25-6-2024 2.21 4266
3 4248 4286,98 25-6-2024 309.62 4267
3 4248 4286,98 25-6-2024 NaN NaN
4 4192 6130,22 24-4-2024 9.68 4193
4 4192 6130,22 24-4-2024 NaN NaN
4 4192 6130,22 24-4-2024 NaN NaN
5 4090 1158,98 25-1-2024 16.22 4110
5 4090 1158,98 25-1-2024 84.0 4111
5 4090 1158,98 25-1-2024 41.99 4112
5 4090 1158,98 25-1-2024 9.11 4113
5 4090 1158,98 25-1-2024 10.0 4114
5 4090 1158,98 25-1-2024 997.16 4115
5 4090 1158,98 25-1-2024 NaN NaN
5 4090 1158,98 25-1-2024 NaN NaN
ledger_ledger_id ledger_amount
0 NaN NaN
1 4265 291,67
2 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 4265 6,19
4 NaN NaN
4 4222 2106.0
4 4222 4014.54
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 4231 -0.32
5 4231 -0.18
#Create default index if necessary
out = out.reset_index(drop=True)
5
-
非常感谢花时间回答我的问题。这几乎是完美的。是否也可以在结果数据框中为支付或分类帐帐户预订中的每个条目分配一行?例如,您的解决方案为交易 4248 生成 7 行,其中第一行包含付款和分类帐帐户预订。理想情况下,我会有一个解决方案,其中每个付款/分类帐帐户预订都有一个单独的行,因此结果数据框将有 8 行交易 4248
– -
这很管用 🙂 很棒
– -
你能进一步解释一下这个步骤吗,我不明白你在那里做什么:#用 payment_payment_id 重写 payment_id 中的缺失值并删除列
df_normalized['payments_id'] = (df_normalized['payments_id'].fillna(df_normalized.pop('payments_id')))
为什么我们要在这里使用 fillna 函数?
–
-
@Psychotechnopath – 原因是有两种类型的
id
injson
–payment_id
并且id
只需要最终 DataFrame 中的列。因此解决方案逐payment_id
列替换 NaN 。换句话说 – 原因是列中的数据payments_id
payments_id
payments
–
-
@Psychotechnopath – 我创建了更通用的解决方案并添加到答案中。
–
|
为了便于概括,我提出以下解决方案和示例。该解决方案也适用于两行以上的情况。
import pandas as pd
import numpy as np
def extract_nested(df: pd.DataFrame,
column: str,
prefix: (bool, None, str) = None,
*args, **kwargs):
df_i = df[column].explode(ignore_index=False) # Merge the extracted data back with the non-nested columns
df_i = pd.json_normalize(df_i, *args, **kwargs).set_index(df_i.index) # Normalize the JSON-like structure into tabular format by containing the index
if prefix != False: # Add prefix to the resulting columns if set
df_i = df_i.add_prefix(f'{column}_' if prefix is None else prefix)
return df_i
def vstack_extract_nested(df: pd.DataFrame,
columns: list,
prefix: (bool, None, str, list) = None):
df_ = df[df.columns.difference(columns)] # Select all columns except the ones containing nested data for the later merge
prefix = prefix if isinstance(prefix, list) else [prefix]*len(columns) # get prefix in the right shape
df_list = [df_[~df[columns].astype(bool).any(axis=1)]] # DataFrame rows that don't contain nested data
for i, p_i in zip(columns, prefix):
df_i = extract_nested(df, i, prefix=p_i).dropna(how='all')
df_list.append(pd.merge(df_, df_i, left_index=True, right_index=True, how='right')) # Merge the extracted data back with the non-nested columns
return pd.concat(df_list).sort_index() # concat DataFrame rows that don't contain nested data with the those who does
例子
具有 3 个嵌套列的示例
df = pd.DataFrame({'id': 100+np.arange(5), 'col1': [[]]*5, 'col2': [[]]*5, 'col3': [[]]*5})
df.at[1, 'col1'] = [{'a': 1, 'b': 2}]
df.at[3, 'col1'] = [{'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df.at[2, 'col2'] = [{'A': 10, 'B': 20}]
df.at[3, 'col2'] = [{'A': 30, 'B': 40}, {'A': 50, 'B': 60}]
df.at[2, 'col3'] = [{'X': 100, 'Y': 200}]
df.at[3, 'col3'] = [{'X': 300, 'Y': 400}, {'X': 500, 'Y': 600}]
vstack_extract_nested(df, ['col1', 'col2', 'col3'], prefix=[None, '','c3_'])
# id col1_a col1_b A B c3_X c3_Y
# 0 100 NaN NaN NaN NaN NaN NaN
# 1 101 1.0 2.0 NaN NaN NaN NaN
# 2 102 NaN NaN 10.0 20.0 NaN NaN
# 2 102 NaN NaN NaN NaN 100.0 200.0
# 3 103 3.0 4.0 NaN NaN NaN NaN
# 3 103 5.0 6.0 NaN NaN NaN NaN
# 3 103 NaN NaN 30.0 40.0 NaN NaN
# 3 103 NaN NaN 50.0 60.0 NaN NaN
# 3 103 NaN NaN NaN NaN 300.0 400.0
# 3 103 NaN NaN NaN NaN 500.0 600.0
# 4 104 NaN NaN NaN NaN NaN NaN
OP 用例示例
df = pd.DataFrame({'transaction_id': [4308, 4254, 4128, 4090], 'payments': [[]]*4, 'ledger_account_bookings': [[]]*4})
df.at[1, 'ledger_account_bookings'] = [{'ledger_id': '4265', 'amount': '291,67'}]
df.at[2, 'payments'] = [{'payment_id': '4128', 'amount': '847.0'}]
df.at[3, 'ledger_account_bookings'] = [{'ledger_id': '4231', 'amount': '-0.32'},
{'ledger_id': '4231', 'amount': '-0.18'}]
df.at[3, 'payments'] = [{'id': '4110','amount': '16.22'},
{'id': '4111', 'amount': '84.0'},
{'id': '4112', 'amount': '41.99'}]
# extract the nested columns
df_exp = vstack_extract_nested(df, ['ledger_account_bookings', 'payments'], prefix=['ledger_', 'payments_'])
# correct naming
df_exp.fillna(df_exp.pop('payments_payment_id'), inplace=True) # correct naming of payments column (payment_id, payments_payment_id)
df_exp.rename(columns={'ledger_ledger_id':'ledger_id'}, inplace=True) # correct naming of ledger column
df_exp
# transaction_id ledger_id ledger_amount payments_amount payments_id
# 0 4308 NaN NaN NaN NaN
# 1 4254 4265 291,67 NaN NaN
# 2 4128 NaN NaN 847.0 NaN
# 3 4090 4231 -0.32 NaN NaN
# 3 4090 4231 -0.18 NaN NaN
# 3 4090 NaN NaN 16.22 4110
# 3 4090 NaN NaN 84.0 4111
# 3 4090 NaN NaN 41.99 4112
2
-
1这种方法对我不起作用 – 给出
KeyError: "None of [Index(['col1', 'col2'], dtype='object')] are in the [columns]"
。也许这条线上出了什么问题?mask_all_empty = ~df[['col1', 'col2']].astype(bool).any(axis=1)
–
-
1好主意。应该是
~df[columns].astype(bool).any(axis=1)
。改了。
–
|
–
–
–
payments
有 4 个字典,并且ledger_account_bookings
有 2 个,那么数据框中的所有组合应该是 8 个最终行还是您期望什么?–
–
|