我在 Java 17 中实现了。除了提供执行统计数据的实现之外,我还想提供一个没有统计数据的版本来缩短执行时间。令我惊讶的是,没有统计数据的版本执行速度明显慢于有统计数据的版本(在测试机上慢了 8%)。
我做了很多测试,并将问题简化为一个方法的两个不同版本MatrixEntry
。
快速版本(第一行是第 77 行MatrixEntry.java
):
int coverColumn() {
int updates = 1;
columnHead.right.left = columnHead.left;
columnHead.left.right = columnHead.right;
MatrixEntry<T> i = columnHead.lower;
while (i != columnHead) {
MatrixEntry<T> j = i.right;
while (j != i) {
updates++;
j.lower.upper = j.upper;
j.upper.lower = j.lower;
j.columnHead.rowCount--;
j = j.right;
}
i = i.lower;
}
return updates;
}
慢速版本(第一行是中的第 77 行MatrixEntry.java
):
void coverColumn() {
//int updates = 1;
columnHead.right.left = columnHead.left;
columnHead.left.right = columnHead.right;
MatrixEntry<T> i = columnHead.lower;
while (i != columnHead) {
MatrixEntry<T> j = i.right;
while (j != i) {
//updates++;
j.lower.upper = j.upper;
j.upper.lower = j.lower;
j.columnHead.rowCount--;
j = j.right;
}
i = i.lower;
}
//return updates;
}
我还检查了使用的字节码javap -c MatrixEntry.class
,除了在堆栈上创建变量、增加并返回它的预期附加指令之外,没有任何差异。
为代码提供一些上下文:每个字段访问都是对 的另一个实例的对象引用的访问MatrixEntry
,除了对 的访问rowCount
。rowCount
是一个整数字段。
我在一台 Linux 机器上运行了测试,该机器上装有一台闲置的 Xeon E3-1220 v6 服务器(在我的台式机上,结果不太稳定)。每次运行应用程序时,Java 方法被调用超过 309,357,294 次,内循环执行 100,722,885,573 次。我执行了 4 次应用程序,分别在方法中使用和不使用统计信息。运行之间的标准偏差约为 4 秒,而每次使用统计信息的运行耗时 22:19,每次不使用统计信息的运行耗时 24:08 分钟。
JVM 是 OpenJDK:
openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment Temurin-17.0.13+11 (build 17.0.13+11)
OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (build 17.0.13+11, mixed mode, sharing)
找到,尽管我猜测这些上下文信息不是必需的。
我对结果感到很困惑,并尝试返回 0,而不是让方法返回 void。我甚至创建了一个版本,其中更新计数器会增加,但值不会存储 – 这个版本是最快的(但我没有让它运行多次)。
有人能解释一下带有集成计数器的代码的改进性能吗?
快速版本的字节码:
int coverColumn();
Code:
0: iconst_1
1: istore_1
2: aload_0
3: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
6: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
9: aload_0
10: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
13: getfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
16: putfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
19: aload_0
20: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
23: getfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
26: aload_0
27: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
30: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
33: putfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
36: aload_0
37: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
40: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
43: astore_2
44: aload_2
45: aload_0
46: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
49: if_acmpeq 116
52: aload_2
53: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
56: astore_3
57: aload_3
58: aload_2
59: if_acmpeq 108
62: iinc 1, 1
65: aload_3
66: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
69: aload_3
70: getfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
73: putfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
76: aload_3
77: getfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
80: aload_3
81: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
84: putfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
87: aload_3
88: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
91: dup
92: getfield #7 // Field rowCount:I
95: iconst_1
96: isub
97: putfield #7 // Field rowCount:I
100: aload_3
101: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
104: astore_3
105: goto 57
108: aload_2
109: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
112: astore_2
113: goto 44
116: iload_1
117: ireturn
较慢版本的字节码:
void coverColumn();
Code:
0: aload_0
1: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
4: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
7: aload_0
8: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
11: getfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
14: putfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
17: aload_0
18: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
21: getfield #13 // Field left:Lde/famiru/dlx/MatrixEntry;
24: aload_0
25: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
28: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
31: putfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
34: aload_0
35: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
38: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
41: astore_1
42: aload_1
43: aload_0
44: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
47: if_acmpeq 111
50: aload_1
51: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
54: astore_2
55: aload_2
56: aload_1
57: if_acmpeq 103
60: aload_2
61: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
64: aload_2
65: getfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
68: putfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
71: aload_2
72: getfield #20 // Field upper:Lde/famiru/dlx/MatrixEntry;
75: aload_2
76: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
79: putfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
82: aload_2
83: getfield #26 // Field columnHead:Lde/famiru/dlx/MatrixEntry;
86: dup
87: getfield #7 // Field rowCount:I
90: iconst_1
91: isub
92: putfield #7 // Field rowCount:I
95: aload_2
96: getfield #17 // Field right:Lde/famiru/dlx/MatrixEntry;
99: astore_2
100: goto 55
103: aload_1
104: getfield #23 // Field lower:Lde/famiru/dlx/MatrixEntry;
107: astore_1
108: goto 42
111: return
根据评论中的建议,我让 JVM 使用 hsdis 插件和以下 JVM 命令行标志在 C2 编译后打印汇编代码:-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:PrintAssemblyOptions=intel -XX:CompileCommand=print,*MatrixEntry.coverColumn
我尝试仅提取程序集的内循环部分,因为整个程序集相当长。
包括计数器的快速代码:
0x00007faf08f46705: mov r11,QWORD PTR [r15+0x350]
0x00007faf08f4670c: mov r8,QWORD PTR [rsp]
0x00007faf08f46710: mov r9d,DWORD PTR [r8+0x14] ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@46 (line 82)
0x00007faf08f46714: mov r13d,DWORD PTR [r13+0x24] ; ImmutableOopMap {r9=NarrowOop r13=NarrowOop [0]=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@113 (line 92)
0x00007faf08f46718: test DWORD PTR [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@113 (line 92)
; {poll}
0x00007faf08f4671b: cmp r13d,r9d
0x00007faf08f4671e: je 0x00007faf08f46806 ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@52 (line 83)
0x00007faf08f46724: mov ebp,DWORD PTR [r13+0x1c] ; implicit exception: dispatches to 0x00007faf08f46b04
;*getfield right {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@53 (line 83)
0x00007faf08f46728: cmp ebp,r13d
0x00007faf08f4672b: je 0x00007faf08f46705 ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@59 (line 84)
0x00007faf08f4672d: mov r11,r13 ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@52 (line 83)
0x00007faf08f46730: mov QWORD PTR [rsp+0x8],r11
0x00007faf08f46735: data16 data16 nop WORD PTR [rax+rax*1+0x0]
;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@62 (line 85)
0x00007faf08f46740: mov ebx,DWORD PTR [rbp+0x20] ; implicit exception: dispatches to 0x00007faf08f46aee
;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@70 (line 86)
0x00007faf08f46743: mov r14d,DWORD PTR [rbp+0x24] ;*getfield lower {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@66 (line 86)
0x00007faf08f46747: test r14d,r14d
0x00007faf08f4674a: je 0x00007faf08f46a2c ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f46750: cmp BYTE PTR [r15+0x38],0x0
0x00007faf08f46755: jne 0x00007faf08f4681c
0x00007faf08f4675b: mov DWORD PTR [r14+0x20],ebx
0x00007faf08f4675f: mov r11,r14
0x00007faf08f46762: mov r8,rbx
0x00007faf08f46765: xor r8,r11
0x00007faf08f46768: shr r8,0x14
0x00007faf08f4676c: test r8,r8
0x00007faf08f4676f: je 0x00007faf08f4678f
0x00007faf08f46771: test ebx,ebx
0x00007faf08f46773: je 0x00007faf08f4678f
0x00007faf08f46775: shr r11,0x9
0x00007faf08f46779: movabs rdi,0x7faf1bba1000
0x00007faf08f46783: add rdi,r11
0x00007faf08f46786: cmp BYTE PTR [rdi],0x4
0x00007faf08f46789: jne 0x00007faf08f46882 ;*putfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@73 (line 86)
0x00007faf08f4678f: mov ebx,DWORD PTR [rbp+0x20] ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@77 (line 87)
0x00007faf08f46792: test ebx,ebx
0x00007faf08f46794: je 0x00007faf08f46a38 ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f4679a: cmp BYTE PTR [r15+0x38],0x0
0x00007faf08f4679f: jne 0x00007faf08f4684f
0x00007faf08f467a5: mov DWORD PTR [rbx+0x24],r14d
0x00007faf08f467a9: mov r11,r14
0x00007faf08f467ac: mov r8,rbx
0x00007faf08f467af: xor r11,r8
0x00007faf08f467b2: shr r11,0x14
0x00007faf08f467b6: test r11,r11
0x00007faf08f467b9: je 0x00007faf08f467da
0x00007faf08f467bb: test r14d,r14d
0x00007faf08f467be: je 0x00007faf08f467da
0x00007faf08f467c0: shr r8,0x9
0x00007faf08f467c4: movabs rdi,0x7faf1bba1000
0x00007faf08f467ce: add rdi,r8
0x00007faf08f467d1: cmp BYTE PTR [rdi],0x4
0x00007faf08f467d4: jne 0x00007faf08f468cb ;*putfield lower {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@84 (line 87)
0x00007faf08f467da: mov r11d,DWORD PTR [rbp+0x14] ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@88 (line 88)
0x00007faf08f467de: dec DWORD PTR [r11+0xc] ; implicit exception: dispatches to 0x00007faf08f46af8
;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@16 (line 79)
0x00007faf08f467e2: mov r11,QWORD PTR [r15+0x350]
0x00007faf08f467e9: mov ebp,DWORD PTR [rbp+0x1c] ;*getfield right {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@101 (line 89)
0x00007faf08f467ec: inc r10d ; ImmutableOopMap {rbp=NarrowOop r13=NarrowOop [0]=Oop [8]=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@105 (line 89)
0x00007faf08f467ef: test DWORD PTR [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@105 (line 89)
; {poll}
0x00007faf08f467f2: cmp ebp,r13d
0x00007faf08f467f5: je 0x00007faf08f46705 ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@59 (line 84)
缓慢的代码返回void:
0x00007f0628f5c10c: mov r10,QWORD PTR [r15+0x350]
0x00007f0628f5c113: mov r11,QWORD PTR [rsp]
0x00007f0628f5c117: mov r11d,DWORD PTR [r11+0x14] ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@44 (line 82)
0x00007f0628f5c11b: mov r13d,DWORD PTR [r13+0x24] ; ImmutableOopMap {r11=NarrowOop r13=NarrowOop [0]=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@108 (line 92)
0x00007f0628f5c11f: test DWORD PTR [r10],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@108 (line 92)
; {poll}
0x00007f0628f5c122: cmp r13d,r11d
0x00007f0628f5c125: je 0x00007f0628f5c0f9 ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@50 (line 83)
0x00007f0628f5c127: mov ebp,DWORD PTR [r13+0x1c] ; implicit exception: dispatches to 0x00007f0628f5c49c
;*getfield right {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@51 (line 83)
0x00007f0628f5c12b: cmp ebp,r13d
0x00007f0628f5c12e: je 0x00007f0628f5c10c ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@57 (line 84)
0x00007f0628f5c130: mov r10,r13 ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@50 (line 83)
0x00007f0628f5c133: mov QWORD PTR [rsp+0x8],r10
0x00007f0628f5c138: nop DWORD PTR [rax+rax*1+0x0] ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@60 (line 86)
0x00007f0628f5c140: mov ebx,DWORD PTR [rbp+0x24] ; implicit exception: dispatches to 0x00007f0628f5c486
;*getfield lower {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@61 (line 86)
0x00007f0628f5c143: mov r14d,DWORD PTR [rbp+0x20] ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@65 (line 86)
0x00007f0628f5c147: test ebx,ebx
0x00007f0628f5c149: je 0x00007f0628f5c400 ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c14f: cmp BYTE PTR [r15+0x38],0x0
0x00007f0628f5c154: jne 0x00007f0628f5c1fe
0x00007f0628f5c15a: mov DWORD PTR [rbx+0x20],r14d
0x00007f0628f5c15e: mov r10,rbx
0x00007f0628f5c161: mov r11,r14
0x00007f0628f5c164: xor r11,r10
0x00007f0628f5c167: shr r11,0x14
0x00007f0628f5c16b: test r11,r11
0x00007f0628f5c16e: je 0x00007f0628f5c18f
0x00007f0628f5c170: test r14d,r14d
0x00007f0628f5c173: je 0x00007f0628f5c18f
0x00007f0628f5c175: shr r10,0x9
0x00007f0628f5c179: movabs rdi,0x7f063c735000
0x00007f0628f5c183: add rdi,r10
0x00007f0628f5c186: cmp BYTE PTR [rdi],0x4
0x00007f0628f5c189: jne 0x00007f0628f5c264 ;*putfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@68 (line 86)
0x00007f0628f5c18f: mov r14d,DWORD PTR [rbp+0x20] ;*getfield upper {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@72 (line 87)
0x00007f0628f5c193: test r14d,r14d
0x00007f0628f5c196: je 0x00007f0628f5c410 ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c19c: cmp BYTE PTR [r15+0x38],0x0
0x00007f0628f5c1a1: jne 0x00007f0628f5c231
0x00007f0628f5c1a7: mov DWORD PTR [r14+0x24],ebx
0x00007f0628f5c1ab: mov r10,rbx
0x00007f0628f5c1ae: mov r11,r14
0x00007f0628f5c1b1: xor r10,r11
0x00007f0628f5c1b4: shr r10,0x14
0x00007f0628f5c1b8: test r10,r10
0x00007f0628f5c1bb: je 0x00007f0628f5c1db
0x00007f0628f5c1bd: test ebx,ebx
0x00007f0628f5c1bf: je 0x00007f0628f5c1db
0x00007f0628f5c1c1: shr r11,0x9
0x00007f0628f5c1c5: movabs rdi,0x7f063c735000
0x00007f0628f5c1cf: add rdi,r11
0x00007f0628f5c1d2: cmp BYTE PTR [rdi],0x4
0x00007f0628f5c1d5: jne 0x00007f0628f5c2a6 ;*putfield lower {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@79 (line 87)
0x00007f0628f5c1db: mov r10d,DWORD PTR [rbp+0x14] ;*getfield columnHead {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@83 (line 88)
0x00007f0628f5c1df: dec DWORD PTR [r10+0xc] ; implicit exception: dispatches to 0x00007f0628f5c490
;*goto {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
0x00007f0628f5c1e3: mov ebp,DWORD PTR [rbp+0x1c] ;*putfield left {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@14 (line 79)
0x00007f0628f5c1e6: mov r10,QWORD PTR [r15+0x350] ; ImmutableOopMap {rbp=NarrowOop r13=NarrowOop [0]=Oop [8]=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
0x00007f0628f5c1ed: test DWORD PTR [r10],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@100 (line 89)
; {poll}
0x00007f0628f5c1f0: cmp ebp,r13d
0x00007f0628f5c1f3: je 0x00007f0628f5c10c ;*if_acmpeq {reexecute=0 rethrow=0 return_oop=0}
; - de.famiru.dlx.MatrixEntry::coverColumn@57 (line 84)
perf stat -d -d -d bin/app
快速版本的报告:
944020.43 msec task-clock # 1.002 CPUs utilized
34131 context-switches # 36.155 /sec
3772 cpu-migrations # 3.996 /sec
26375 page-faults # 27.939 /sec
3073656028927 cycles # 3.256 GHz (61.54%)
8314592091084 instructions # 2.71 insn per cycle (69.23%)
1617667532435 branches # 1.714 G/sec (69.23%)
7376429842 branch-misses # 0.46% of all branches (69.23%)
2518602003849 L1-dcache-loads # 2.668 G/sec (69.23%)
217124852119 L1-dcache-load-misses # 8.62% of all L1-dcache accesses (69.23%)
8360116079 LLC-loads # 8.856 M/sec (69.23%)
36950592 LLC-load-misses # 0.44% of all LL-cache accesses (69.24%)
<not supported> L1-icache-loads
613369051 L1-icache-load-misses (69.23%)
2518014639585 dTLB-loads # 2.667 G/sec (69.24%)
2964075 dTLB-load-misses # 0.00% of all dTLB cache accesses (61.54%)
4335821 iTLB-loads # 4.593 K/sec (61.54%)
1578713 iTLB-load-misses # 36.41% of all iTLB cache accesses (61.54%)
对于慢速版本:
1011635.24 msec task-clock # 1.001 CPUs utilized
37371 context-switches # 36.941 /sec
1949 cpu-migrations # 1.927 /sec
26087 page-faults # 25.787 /sec
3277438029589 cycles # 3.240 GHz (61.54%)
9156368782029 instructions # 2.79 insn per cycle (69.23%)
1614775762524 branches # 1.596 G/sec (69.23%)
7360324323 branch-misses # 0.46% of all branches (69.23%)
2914029260930 L1-dcache-loads # 2.881 G/sec (69.23%)
216220083510 L1-dcache-load-misses # 7.42% of all L1-dcache accesses (69.23%)
8079066223 LLC-loads # 7.986 M/sec (69.23%)
47018889 LLC-load-misses # 0.58% of all LL-cache accesses (69.23%)
<not supported> L1-icache-loads
650941335 L1-icache-load-misses (69.23%)
2913664329244 dTLB-loads # 2.880 G/sec (69.23%)
2913507 dTLB-load-misses # 0.00% of all dTLB cache accesses (61.54%)
4294995 iTLB-loads # 4.246 K/sec (61.54%)
1796205 iTLB-load-misses # 41.82% of all iTLB cache accesses (61.54%)
分支未命中率仅为 0.46%(两种情况下),但perf stat
突出显示iTLB-load-misses
,分别为 36% 和 42%。
33
$(function() {
$(“.js-gps-inline-related-questions .spacer”).on(“click”, function () {
fireRelatedEvent($(this).index() + 1, $(this).data(‘question-id’));
});
function fireRelatedEvent(position, questionId) {
StackExchange.using(“gps”, function() {
StackExchange.gps.track(‘related_questions.click’,
{
position: position,
originQuestionId: 79131410,
relatedQuestionId: +questionId,
location: ‘inline’,
source: ‘Baseline_Fallback’
});
});
}
});
function toggleInlineRelated(showMore) {
var inlineRelatedLess = document.getElementById(“inline_related_var_a_less”);
var inlineRelatedMore = document.getElementById(“inline_related_var_a_more”);
var inlineRelatedSeeMore = document.getElementById(“inline_related_see_more”);
var inlineRelatedSeeLess = document.getElementById(“inline_related_see_less”);
if (showMore) {
inlineRelatedLess.classList.add(“d-none”);
inlineRelatedSeeMore.classList.add(“d-none”);
inlineRelatedMore.classList.remove(“d-none”);
inlineRelatedSeeLess.classList.remove(“d-none”);
}
else {
inlineRelatedMore.classList.add(“d-none”);
inlineRelatedSeeLess.classList.add(“d-none”);
inlineRelatedLess.classList.remove(“d-none”);
inlineRelatedSeeMore.classList.remove(“d-none”);
}
}
–
–
inc r10d
循环内的额外内容,应该很容易与 CPU 正在执行的其他工作并行运行,因此我当然不希望内存限制循环出现任何减速。也没有明显的直接原因可以加速,因此它要么是代码对齐的次要影响,要么是 uop 调度和无序执行资源的一些怪癖。 我不认为有一个简单的解释。 而且我不希望在不同的 uarches(如 Zen)上看到相同的效果,甚至可能不是 Ice Lake 或 Alder Lake。–
data16 data16 nop WORD PTR [rax+rax*1+0x0]
只是一个带有三个66
前缀(两个是多余的) 的长 NOP。felixcloutier.com/x86/nop文档记录了采用 ModR/M 字节,可让您使用任何所需的寻址模式,在本例中,使用 SIB + disp32 来生成 11 字节指令():前缀(3)+ 操作码(2)+ modrm(1)+ SIB(1)+ 4(disp32)。它用于对齐其后的分支目标。如果需要填充 11 个字节,GNU 汇编程序将发出相同的内容;最多 3 个前缀遵循英特尔的指导方针0F 10 /0
0x40 - 0x35 = 11
.p2align
–
–
|