Question

我想先按 b 列对以下 DataFrame 进行排序，然后按 a 列进行排序。

A	b
0	1.2
2	0.07076863960397785
1	0.07076863960397783
4	0.02
3	0.07076863960397784

应该使用 math.isclose() 函数来比较 b 列中的浮点数。因此，我编写了一个自定义比较函数并使用 functools 中的 cmp_to_key 函数。但是，在对数据框进行排序时，我收到以下错误：

这是我的完整代码：

import pandas as pd
from functools import cmp_to_key
from math import isclose
import numpy as np

my_list = [
[0, 1.2],
[2, 0.07076863960397785],
[1, 0.07076863960397783],
[4, 0.02],
[3, 0.07076863960397784]
]

df = pd.DataFrame(my_list,columns=['a','b'])

def compare(a,b):
  if isclose(a,b):
    return 0
  elif a-b<0:
    return -1
  else:
    return 1

df.sort_values(by=['b','a'],key= cmp_to_key(compare))

现在，我知道 sort_values 中的键需要一系列，因此键函数应该矢量化。但我不知道如何实现这一点。

最终结果应该是这样的：

A	b
4	0.02
1	0.07076863960397783
2	0.07076863960397785
3	0.07076863960397784
0	1.2

Answer 1

要找到彼此相邻且不被截断的值的运行，可以对数组进行预排序diff，然后应用一些阈值。如果您有多个相邻的浮点值恰好低于阈值，则这会产生复合副作用。

为了举例说明这一点，下面的数组中应该有多少个组？

threshold = 0.05
array = [1.0, 1.02, 1.04, 1.08, 1.5, 1.7]

显然1.5和1.7超出了我们的阈值，因此它们各自属于自己的排序组，并且1.0, 1.02, 1.04应该分组在一起，因为它们的差异都低于阈值。但是它1.08属于哪里？相邻搜索表明它应该与属于同一组，这可能是违反直觉的，因为和1.04之间的差异超过了阈值。1.01.08

如果上述约束没有问题，那么下面的代码就可以工作：

import pandas as pd
import numpy as np

my_list = [
    [0, 1.2],
    [2, 0.07076863960397785],
    [1, 0.07076863960397783],
    [4, 0.02],
    [3, 0.07076863960397784]
]

df = pd.DataFrame(my_list,columns=['a','b'])

threshold = 1e-6 # floats smaller than threshold will be grouped together
grouped_floats = df.apply(lambda s: s.sort_values().diff().gt(threshold).cumsum())
print(
    df.reindex(grouped_floats.sort_values(by=['b', 'a']).index)
)
#    a         b
# 3  4  0.020000
# 2  1  0.070769
# 1  2  0.070769
# 4  3  0.070769
# 0  0  1.200000

Answer 2

编辑：请参阅@mozway 和@Eric Postpischil 的评论，了解这两种解决方案的注意事项。

您可以在对数据框的副本进行排序之前对列中的值进行四舍五入'b'，然后使用副本中的索引重新索引原始数据框：

sorted = df.copy()
sorted["b"] = sorted["b"].round(2)
sorted = sorted.sort_values(["b", "a"])

df = df.reindex(sorted.index)

   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

如果您想使用键函数来sort_values，它应该返回pd.Series：

def custom_key(series: pd.Series):
    sorted_indices = series.argsort()
    grouped_series = np.zeros_like(series)
    current_group = 0

    for i in sorted_indices:
        if grouped_series[i] == 0:
            current_group += 1
            grouped_series[i] = current_group
            grouped_series[np.isclose(series[i], series)] = current_group

    return grouped_series


df = df.sort_values(by=["b", "a"], key=custom_key)

   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

是的，我知道这一点。虽然我不确定四舍五入是否能完全解决这个问题。相反，我可以截断浮点数（乘以某个数，例如 100 万，然后转换为整数）。但使用 key 函数难道不能吗？ — 
@Yannick 这是可能的，但我认为没有必要那么复杂。请参阅我更新后的答案。 — 
好的，谢谢！确实看起来相当复杂。我只是惊讶于正确排序浮点数竟然如此困难 :-)。 — 
我会谨慎使用比较。如果您有 a、b、c，则可能存在 a==b（接近）且 b == c，但 c > a。这可能会破坏排序算法并导致意外结果。我还认为舍入可能是一种更好的方法（尽管并非没有缺陷） — 
四舍五入并不是检测彼此接近的值的功能。它确实会使一些相近的值靠拢，但会使其他相近的值相距较远。例如，四舍五入到小数点后两位会使 .7999 和 .8001 靠拢，但会使 .7949 和 .7951 相距较远。 —

Answer 3

也许尝试groupby使用applywithpd.Series.sort_values对组进行排序。

一种用于groupby对列进行分组的解决方案b，其中组具有0.05阈值。您可以使用cumsum它来找出接近浮点值的组。

threshold = 0.05

df = df.sort_values('b').reset_index(drop=True)
groups = df['b'].diff().ge(threshold).cumsum()

df.iloc[df.groupby(groups)['a'].apply(pd.Series.sort_values).reset_index(level=0).index]

round浮点数的解决方法。我用round(6)它来四舍五入小数点。

代码：

df['b'] = df['b'].round(6)

df.groupby('b')['a'].apply(pd.Series.sort_values).reset_index(level=0)[['a', 'b']]

输出：

   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

Answer 4

您可以sort_values()使用np.argsort()：

import pandas as pd
import numpy as np
from math import isclose
from functools import cmp_to_key


compare = lambda a, b: 0 if isclose(a, b) else (-1 if a < b else 1)


def _sort(df):
    sorted_indices = sorted(range(len(df)), key=cmp_to_key(lambda i, j: compare(df.at[i, 'b'], df.at[j, 'b'])))
    df = df.iloc[sorted_indices].reset_index(drop=True)
    df = df.sort_values(by=['b', 'a'], key=lambda col: col if col.name == 'a' else np.argsort(col))
    return df


my_list = [
    [0, 1.2],
    [2, 0.07076863960397785],
    [1, 0.07076863960397783],
    [4, 0.02],
    [3, 0.07076863960397784]
]

df = pd.DataFrame(my_list, columns=['a', 'b'])

print(_sort(df))

印刷

   a         b
0  4  0.020000
3  3  0.070769
1  2  0.070769
2  1  0.070769
4  0  1.200000

a 列的顺序应为 4、1、2、3、0，而不是 4、3、2、1、0 —

python – Sort pandas dataframe float column with custom key – Stack Overflow

最佳答案
4

印刷

最佳答案 4

印刷

最佳答案
4