我在 Java 17 中实现了。除了提供执行统计数据的实现之外,我还想提供一个没有统计数据的版本来缩短执行时间。令我惊讶的是,没有统计数据的版本执行速度明显慢于有统计数据的版本(在测试机上慢了 8%)。

我做了很多测试,并将问题简化为一个方法的两个不同版本MatrixEntry

快速版本(第一行是第 77 行MatrixEntry.java):

int coverColumn() {
    int updates = 1;
    columnHead.right.left = columnHead.left;
    columnHead.left.right = columnHead.right;
    MatrixEntry<T> i = columnHead.lower;
    while (i != columnHead) {
        MatrixEntry<T> j = i.right;
        while (j != i) {
            updates++;
            j.lower.upper = j.upper;
            j.upper.lower = j.lower;
            j.columnHead.rowCount--;
            j = j.right;
        }
        i = i.lower;
    }
    return updates;
}

慢速版本(第一行是中的第 77 行MatrixEntry.java):

void coverColumn() {
    //int updates = 1;
    columnHead.right.left = columnHead.left;
    columnHead.left.right = columnHead.right;
    MatrixEntry<T> i = columnHead.lower;
    while (i != columnHead) {
        MatrixEntry<T> j = i.right;
        while (j != i) {
            //updates++;
            j.lower.upper = j.upper;
            j.upper.lower = j.lower;
            j.columnHead.rowCount--;
            j = j.right;
        }
        i = i.lower;
    }
    //return updates;
}

我还检查了使用的字节码javap -c MatrixEntry.class,除了在堆栈上创建变量、增加并返回它的预期附加指令之外,没有任何差异。

为代码提供一些上下文:每个字段访问都是对 的另一个实例的对象引用的访问MatrixEntry,除了对 的访问rowCountrowCount是一个整数字段。

我在一台 Linux 机器上运行了测试,该机器上装有一台闲置的 Xeon E3-1220 v6 服务器(在我的台式机上,结果不太稳定)。每次运行应用程序时,Java 方法被调用超过 309,357,294 次,内循环执行 100,722,885,573 次。我执行了 4 次应用程序,分别在方法中使用和不使用统计信息。运行之间的标准偏差约为 4 秒,而每次使用统计信息的运行耗时 22:19,每次不使用统计信息的运行耗时 24:08 分钟。

JVM 是 OpenJDK:

openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment Temurin-17.0.13+11 (build 17.0.13+11)
OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (build 17.0.13+11, mixed mode, sharing)

找到,尽管我猜测这些上下文信息不是必需的。

我对结果感到很困惑,并尝试返回 0,而不是让方法返回 void。我甚至创建了一个版本,其中更新计数器会增加,但值不会存储 – 这个版本是最快的(但我没有让它运行多次)。

有人能解释一下带有集成计数器的代码的改进性能吗?


快速版本的字节码:

  int coverColumn();
    Code:
       0: iconst_1
       1: istore_1
       2: aload_0
       3: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
       6: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
       9: aload_0
      10: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      13: getfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      16: putfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      19: aload_0
      20: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      23: getfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      26: aload_0
      27: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      30: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      33: putfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      36: aload_0
      37: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      40: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      43: astore_2
      44: aload_2
      45: aload_0
      46: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      49: if_acmpeq     116
      52: aload_2
      53: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      56: astore_3
      57: aload_3
      58: aload_2
      59: if_acmpeq     108
      62: iinc          1, 1
      65: aload_3
      66: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      69: aload_3
      70: getfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      73: putfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      76: aload_3
      77: getfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      80: aload_3
      81: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      84: putfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      87: aload_3
      88: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      91: dup
      92: getfield      #7                  // Field rowCount:I
      95: iconst_1
      96: isub
      97: putfield      #7                  // Field rowCount:I
     100: aload_3
     101: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
     104: astore_3
     105: goto          57
     108: aload_2
     109: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
     112: astore_2
     113: goto          44
     116: iload_1
     117: ireturn

较慢版本的字节码:

  void coverColumn();
    Code:
       0: aload_0
       1: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
       4: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
       7: aload_0
       8: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      11: getfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      14: putfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      17: aload_0
      18: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      21: getfield      #13                 // Field left:Lde/famiru/dlx/MatrixEntry;
      24: aload_0
      25: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      28: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      31: putfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      34: aload_0
      35: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      38: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      41: astore_1
      42: aload_1
      43: aload_0
      44: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      47: if_acmpeq     111
      50: aload_1
      51: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      54: astore_2
      55: aload_2
      56: aload_1
      57: if_acmpeq     103
      60: aload_2
      61: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      64: aload_2
      65: getfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      68: putfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      71: aload_2
      72: getfield      #20                 // Field upper:Lde/famiru/dlx/MatrixEntry;
      75: aload_2
      76: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      79: putfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
      82: aload_2
      83: getfield      #26                 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
      86: dup
      87: getfield      #7                  // Field rowCount:I
      90: iconst_1
      91: isub
      92: putfield      #7                  // Field rowCount:I
      95: aload_2
      96: getfield      #17                 // Field right:Lde/famiru/dlx/MatrixEntry;
      99: astore_2
     100: goto          55
     103: aload_1
     104: getfield      #23                 // Field lower:Lde/famiru/dlx/MatrixEntry;
     107: astore_1
     108: goto          42
     111: return

根据评论中的建议,我让 JVM 使用 hsdis 插件和以下 JVM 命令行标志在 C2 编译后打印汇编代码:-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:PrintAssemblyOptions=intel -XX:CompileCommand=print,*MatrixEntry.coverColumn

我尝试仅提取程序集的内循环部分,因为整个程序集相当长。

包括计数器的快速代码:

0x00007faf08f46705:   mov    r11,QWORD PTR [r15+0x350]
0x00007faf08f4670c:   mov    r8,QWORD PTR [rsp]
0x00007faf08f46710:   mov    r9d,DWORD PTR [r8+0x14]      ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@46 (line 82)
0x00007faf08f46714:   mov    r13d,DWORD PTR [r13+0x24]    ; ImmutableOopMap {r9=NarrowOop r13=NarrowOop [0]=Oop }
                                                          ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                          ; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@113 (line 92)
0x00007faf08f46718:   test   DWORD PTR [r11],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@113 (line 92)
                                                          ;   {poll}
0x00007faf08f4671b:   cmp    r13d,r9d
0x00007faf08f4671e:   je     0x00007faf08f46806           ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@52 (line 83)
0x00007faf08f46724:   mov    ebp,DWORD PTR [r13+0x1c]     ; implicit exception: dispatches to 0x00007faf08f46b04
                                                          ;*getfield right {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@53 (line 83)
0x00007faf08f46728:   cmp    ebp,r13d
0x00007faf08f4672b:   je     0x00007faf08f46705           ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@59 (line 84)
0x00007faf08f4672d:   mov    r11,r13                      ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@52 (line 83)
0x00007faf08f46730:   mov    QWORD PTR [rsp+0x8],r11
0x00007faf08f46735:   data16 data16 nop WORD PTR [rax+rax*1+0x0]
                                                          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@62 (line 85)
0x00007faf08f46740:   mov    ebx,DWORD PTR [rbp+0x20]     ; implicit exception: dispatches to 0x00007faf08f46aee
                                                          ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@70 (line 86)
0x00007faf08f46743:   mov    r14d,DWORD PTR [rbp+0x24]    ;*getfield lower {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@66 (line 86)
0x00007faf08f46747:   test   r14d,r14d
0x00007faf08f4674a:   je     0x00007faf08f46a2c           ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f46750:   cmp    BYTE PTR [r15+0x38],0x0
0x00007faf08f46755:   jne    0x00007faf08f4681c
0x00007faf08f4675b:   mov    DWORD PTR [r14+0x20],ebx
0x00007faf08f4675f:   mov    r11,r14
0x00007faf08f46762:   mov    r8,rbx
0x00007faf08f46765:   xor    r8,r11
0x00007faf08f46768:   shr    r8,0x14
0x00007faf08f4676c:   test   r8,r8
0x00007faf08f4676f:   je     0x00007faf08f4678f
0x00007faf08f46771:   test   ebx,ebx
0x00007faf08f46773:   je     0x00007faf08f4678f
0x00007faf08f46775:   shr    r11,0x9
0x00007faf08f46779:   movabs rdi,0x7faf1bba1000
0x00007faf08f46783:   add    rdi,r11
0x00007faf08f46786:   cmp    BYTE PTR [rdi],0x4
0x00007faf08f46789:   jne    0x00007faf08f46882           ;*putfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@73 (line 86)
0x00007faf08f4678f:   mov    ebx,DWORD PTR [rbp+0x20]     ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@77 (line 87)
0x00007faf08f46792:   test   ebx,ebx
0x00007faf08f46794:   je     0x00007faf08f46a38           ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f4679a:   cmp    BYTE PTR [r15+0x38],0x0
0x00007faf08f4679f:   jne    0x00007faf08f4684f
0x00007faf08f467a5:   mov    DWORD PTR [rbx+0x24],r14d
0x00007faf08f467a9:   mov    r11,r14
0x00007faf08f467ac:   mov    r8,rbx
0x00007faf08f467af:   xor    r11,r8
0x00007faf08f467b2:   shr    r11,0x14
0x00007faf08f467b6:   test   r11,r11
0x00007faf08f467b9:   je     0x00007faf08f467da
0x00007faf08f467bb:   test   r14d,r14d
0x00007faf08f467be:   je     0x00007faf08f467da
0x00007faf08f467c0:   shr    r8,0x9
0x00007faf08f467c4:   movabs rdi,0x7faf1bba1000
0x00007faf08f467ce:   add    rdi,r8
0x00007faf08f467d1:   cmp    BYTE PTR [rdi],0x4
0x00007faf08f467d4:   jne    0x00007faf08f468cb           ;*putfield lower {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@84 (line 87)
0x00007faf08f467da:   mov    r11d,DWORD PTR [rbp+0x14]    ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@88 (line 88)
0x00007faf08f467de:   dec    DWORD PTR [r11+0xc]          ; implicit exception: dispatches to 0x00007faf08f46af8
                                                          ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f467e2:   mov    r11,QWORD PTR [r15+0x350]
0x00007faf08f467e9:   mov    ebp,DWORD PTR [rbp+0x1c]     ;*getfield right {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@101 (line 89)
0x00007faf08f467ec:   inc    r10d                         ; ImmutableOopMap {rbp=NarrowOop r13=NarrowOop [0]=Oop [8]=Oop }
                                                          ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                          ; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@105 (line 89)
0x00007faf08f467ef:   test   DWORD PTR [r11],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@105 (line 89)
                                                          ;   {poll}
0x00007faf08f467f2:   cmp    ebp,r13d
0x00007faf08f467f5:   je     0x00007faf08f46705           ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@59 (line 84)

缓慢的代码返回void:

0x00007f0628f5c10c:   mov    r10,QWORD PTR [r15+0x350]
0x00007f0628f5c113:   mov    r11,QWORD PTR [rsp]
0x00007f0628f5c117:   mov    r11d,DWORD PTR [r11+0x14]    ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@44 (line 82)
0x00007f0628f5c11b:   mov    r13d,DWORD PTR [r13+0x24]    ; ImmutableOopMap {r11=NarrowOop r13=NarrowOop [0]=Oop }
                                                          ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                          ; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@108 (line 92)
0x00007f0628f5c11f:   test   DWORD PTR [r10],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@108 (line 92)
                                                          ;   {poll}
0x00007f0628f5c122:   cmp    r13d,r11d
0x00007f0628f5c125:   je     0x00007f0628f5c0f9           ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@50 (line 83)
0x00007f0628f5c127:   mov    ebp,DWORD PTR [r13+0x1c]     ; implicit exception: dispatches to 0x00007f0628f5c49c
                                                          ;*getfield right {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@51 (line 83)
0x00007f0628f5c12b:   cmp    ebp,r13d
0x00007f0628f5c12e:   je     0x00007f0628f5c10c           ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@57 (line 84)
0x00007f0628f5c130:   mov    r10,r13                      ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@50 (line 83)
0x00007f0628f5c133:   mov    QWORD PTR [rsp+0x8],r10
0x00007f0628f5c138:   nop    DWORD PTR [rax+rax*1+0x0]    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@60 (line 86)
0x00007f0628f5c140:   mov    ebx,DWORD PTR [rbp+0x24]     ; implicit exception: dispatches to 0x00007f0628f5c486
                                                          ;*getfield lower {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@61 (line 86)
0x00007f0628f5c143:   mov    r14d,DWORD PTR [rbp+0x20]    ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@65 (line 86)
0x00007f0628f5c147:   test   ebx,ebx
0x00007f0628f5c149:   je     0x00007f0628f5c400           ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c14f:   cmp    BYTE PTR [r15+0x38],0x0
0x00007f0628f5c154:   jne    0x00007f0628f5c1fe
0x00007f0628f5c15a:   mov    DWORD PTR [rbx+0x20],r14d
0x00007f0628f5c15e:   mov    r10,rbx
0x00007f0628f5c161:   mov    r11,r14
0x00007f0628f5c164:   xor    r11,r10
0x00007f0628f5c167:   shr    r11,0x14
0x00007f0628f5c16b:   test   r11,r11
0x00007f0628f5c16e:   je     0x00007f0628f5c18f
0x00007f0628f5c170:   test   r14d,r14d
0x00007f0628f5c173:   je     0x00007f0628f5c18f
0x00007f0628f5c175:   shr    r10,0x9
0x00007f0628f5c179:   movabs rdi,0x7f063c735000
0x00007f0628f5c183:   add    rdi,r10
0x00007f0628f5c186:   cmp    BYTE PTR [rdi],0x4
0x00007f0628f5c189:   jne    0x00007f0628f5c264           ;*putfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@68 (line 86)
0x00007f0628f5c18f:   mov    r14d,DWORD PTR [rbp+0x20]    ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@72 (line 87)
0x00007f0628f5c193:   test   r14d,r14d
0x00007f0628f5c196:   je     0x00007f0628f5c410           ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c19c:   cmp    BYTE PTR [r15+0x38],0x0
0x00007f0628f5c1a1:   jne    0x00007f0628f5c231
0x00007f0628f5c1a7:   mov    DWORD PTR [r14+0x24],ebx
0x00007f0628f5c1ab:   mov    r10,rbx
0x00007f0628f5c1ae:   mov    r11,r14
0x00007f0628f5c1b1:   xor    r10,r11
0x00007f0628f5c1b4:   shr    r10,0x14
0x00007f0628f5c1b8:   test   r10,r10
0x00007f0628f5c1bb:   je     0x00007f0628f5c1db
0x00007f0628f5c1bd:   test   ebx,ebx
0x00007f0628f5c1bf:   je     0x00007f0628f5c1db
0x00007f0628f5c1c1:   shr    r11,0x9
0x00007f0628f5c1c5:   movabs rdi,0x7f063c735000
0x00007f0628f5c1cf:   add    rdi,r11
0x00007f0628f5c1d2:   cmp    BYTE PTR [rdi],0x4
0x00007f0628f5c1d5:   jne    0x00007f0628f5c2a6           ;*putfield lower {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@79 (line 87)
0x00007f0628f5c1db:   mov    r10d,DWORD PTR [rbp+0x14]    ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@83 (line 88)
0x00007f0628f5c1df:   dec    DWORD PTR [r10+0xc]          ; implicit exception: dispatches to 0x00007f0628f5c490
                                                          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
0x00007f0628f5c1e3:   mov    ebp,DWORD PTR [rbp+0x1c]     ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c1e6:   mov    r10,QWORD PTR [r15+0x350]    ; ImmutableOopMap {rbp=NarrowOop r13=NarrowOop [0]=Oop [8]=Oop }
                                                          ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                          ; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
0x00007f0628f5c1ed:   test   DWORD PTR [r10],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
                                                          ;   {poll}
0x00007f0628f5c1f0:   cmp    ebp,r13d
0x00007f0628f5c1f3:   je     0x00007f0628f5c10c           ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - de.famiru.dlx.MatrixEntry::coverColumn@57 (line 84)

perf stat -d -d -d bin/app快速版本的报告:

         944020.43 msec task-clock                       #    1.002 CPUs utilized                                                                                                                                  
             34131      context-switches                 #   36.155 /sec                                                                                                                                           
              3772      cpu-migrations                   #    3.996 /sec                                                                                                                                           
             26375      page-faults                      #   27.939 /sec                                                                                                                                           
     3073656028927      cycles                           #    3.256 GHz                      (61.54%)                                                                                                              
     8314592091084      instructions                     #    2.71  insn per cycle           (69.23%)                                                                                                              
     1617667532435      branches                         #    1.714 G/sec                    (69.23%)                                                                                                              
        7376429842      branch-misses                    #    0.46% of all branches          (69.23%)                                                                                                              
     2518602003849      L1-dcache-loads                  #    2.668 G/sec                    (69.23%)                                                                                                              
      217124852119      L1-dcache-load-misses            #    8.62% of all L1-dcache accesses  (69.23%)                                                                                                            
        8360116079      LLC-loads                        #    8.856 M/sec                    (69.23%)                                                                                                              
          36950592      LLC-load-misses                  #    0.44% of all LL-cache accesses  (69.24%)                                                                                                             
   <not supported>      L1-icache-loads                                                                                                                                                                            
         613369051      L1-icache-load-misses                                                (69.23%)                                                                                                              
     2518014639585      dTLB-loads                       #    2.667 G/sec                    (69.24%)                                                                                                              
           2964075      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (61.54%)
           4335821      iTLB-loads                       #    4.593 K/sec                    (61.54%)    
           1578713      iTLB-load-misses                 #   36.41% of all iTLB cache accesses  (61.54%)

对于慢速版本:

        1011635.24 msec task-clock                       #    1.001 CPUs utilized          
             37371      context-switches                 #   36.941 /sec                   
              1949      cpu-migrations                   #    1.927 /sec                   
             26087      page-faults                      #   25.787 /sec                   
     3277438029589      cycles                           #    3.240 GHz                      (61.54%)
     9156368782029      instructions                     #    2.79  insn per cycle           (69.23%)
     1614775762524      branches                         #    1.596 G/sec                    (69.23%)
        7360324323      branch-misses                    #    0.46% of all branches          (69.23%)
     2914029260930      L1-dcache-loads                  #    2.881 G/sec                    (69.23%)
      216220083510      L1-dcache-load-misses            #    7.42% of all L1-dcache accesses  (69.23%)
        8079066223      LLC-loads                        #    7.986 M/sec                    (69.23%)
          47018889      LLC-load-misses                  #    0.58% of all LL-cache accesses  (69.23%)
   <not supported>      L1-icache-loads                                             
         650941335      L1-icache-load-misses                                                (69.23%)
     2913664329244      dTLB-loads                       #    2.880 G/sec                    (69.23%)
           2913507      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (61.54%)
           4294995      iTLB-loads                       #    4.246 K/sec                    (61.54%)
           1796205      iTLB-load-misses                 #   41.82% of all iTLB cache accesses  (61.54%)

分支未命中率仅为 0.46%(两种情况下),但perf stat突出显示iTLB-load-misses,分别为 36% 和 42%。

33

  • 2
    代码对齐会影响性能。例如,未对齐的跳转或循环体会降低代码速度。在 Intel CPU 上,需要考虑 JCC µcode 错误。您可以提取编译后的代码(JIT 编译后)以获取此信息吗?


    – 


  • 1
    另请参阅


    – 

  • 3
    计数器似乎只是编译为inc r10d循环内的额外内容,应该很容易与 CPU 正在执行的其他工作并行运行,因此我当然不希望内存限制循环出现任何减速。也没有明显的直接原因可以加速,因此它要么是代码对齐的次要影响,要么是 uop 调度和无序执行资源的一些怪癖。 我不认为有一个简单的解释。 而且我不希望在不同的 uarches(如 Zen)上看到相同的效果,甚至可能不是 Ice Lake 或 Alder Lake。


    – 

  • 2
    data16 data16 nop WORD PTR [rax+rax*1+0x0]只是一个带有三个66前缀(两个是多余的) 的长 NOP。felixcloutier.com/x86/nop文档记录了采用 ModR/M 字节,可让您使用任何所需的寻址模式,在本例中,使用 SIB + disp32 来生成 11 字节指令():前缀(3)+ 操作码(2)+ modrm(1)+ SIB(1)+ 4(disp32)。它用于对齐其后的分支目标。如果需要填充 11 个字节,GNU 汇编程序将发出相同的内容;最多 3 个前缀遵循英特尔的指导方针0F 10 /00x40 - 0x35 = 11.p2align


    – 


  • 2
    我在 AMD Ryzen 5 5500 上进行了一些测量。在这台机器上,没有统计信息的版本甚至慢了 12%!我对每个版本进行了 6 次运行。我会看看我是否可以访问更新的英特尔机器。


    – 

$(function() {
$(“.js-gps-inline-related-questions .spacer”).on(“click”, function () {
fireRelatedEvent($(this).index() + 1, $(this).data(‘question-id’));
});

function fireRelatedEvent(position, questionId) {
StackExchange.using(“gps”, function() {
StackExchange.gps.track(‘related_questions.click’,
{
position: position,
originQuestionId: 79131410,
relatedQuestionId: +questionId,
location: ‘inline’,
source: ‘Baseline_Fallback’
});
});
}
});

function toggleInlineRelated(showMore) {
var inlineRelatedLess = document.getElementById(“inline_related_var_a_less”);
var inlineRelatedMore = document.getElementById(“inline_related_var_a_more”);

var inlineRelatedSeeMore = document.getElementById(“inline_related_see_more”);
var inlineRelatedSeeLess = document.getElementById(“inline_related_see_less”);

if (showMore) {
inlineRelatedLess.classList.add(“d-none”);
inlineRelatedSeeMore.classList.add(“d-none”);

inlineRelatedMore.classList.remove(“d-none”);
inlineRelatedSeeLess.classList.remove(“d-none”);
}
else {
inlineRelatedMore.classList.add(“d-none”);
inlineRelatedSeeLess.classList.add(“d-none”);

inlineRelatedLess.classList.remove(“d-none”);
inlineRelatedSeeMore.classList.remove(“d-none”);
}
}

0