-
Notifications
You must be signed in to change notification settings - Fork 0
Posts: Add AoCO 2025 Day 10 Study Notes #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
59a6950
42bf732
dfab629
62e48d4
b4800aa
7e53321
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,317 @@ | ||||||
| --- | ||||||
| tags: AoCO2025, Compiler, x86 | ||||||
| --- | ||||||
|
|
||||||
| ## Study Notes: Unrolling loops, Advent of Compiler Optimisations 2025 | ||||||
|
|
||||||
| These notes are based on the post [**Unrolling loops**](https://xania.org/202512/10-loop-unrolling) and the YouTube video [**[AoCO 10/25] Unrolling Loops**](https://www.youtube.com/watch?v=HvF3tF2efEA&list=PL2HVqYf7If8cY4wLk7JUQ2f0JXY_xMQm2&index=11) which are Day 10 of the [Advent of Compiler Optimisations 2025](https://xania.org/AoCO2025-archive) Series by [Matt Godbolt](https://xania.org/MattGodbolt). | ||||||
|
|
||||||
| My notes focus on reproducing and verifying [Matt Godbolt](https://xania.org/MattGodbolt)'s teaching within a local development environment using `LLVM toolchain` on `Ubuntu`. | ||||||
|
|
||||||
| Written by me and assisted by AI, proofread by me and assisted by AI. | ||||||
|
|
||||||
| ## Development Environment | ||||||
| ```bash | ||||||
| $ lsb_release -d | ||||||
| Description: Ubuntu 24.04.3 LTS | ||||||
|
|
||||||
| $ clang++ --version | ||||||
| Ubuntu clang version 18.1.8 | ||||||
|
|
||||||
| $ llvm-objdump -v | ||||||
| Ubuntu LLVM version 18.1.8 | ||||||
|
|
||||||
| $ radare2 -v | ||||||
| radare2 5.5.0 0 @ linux-x86-64 git.5.5.0 | ||||||
| ``` | ||||||
|
|
||||||
| ## What is span | ||||||
|
|
||||||
| Let's do a quick introduction of `std::span` first. | ||||||
| It provides a uniform interface for contiguous sequences of objects like vectors, arrays, or raw C-style arrays. | ||||||
|
|
||||||
| ```bash | ||||||
| $ cat main.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| #include <span> | ||||||
| #include <vector> | ||||||
| #include <array> | ||||||
| #include <iostream> | ||||||
|
|
||||||
| template<typename T> | ||||||
| auto sum(T&& dataset) { | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sum function template uses a forwarding reference (T&&), but std::span cannot be safely constructed from rvalue containers (like a temporary std::vector) because they are not 'borrowed ranges'. This will cause a compilation error if sum is called with a temporary container. Using const T& is safer for a generic sum function that uses std::span internally.
Suggested change
|
||||||
| std::span s{dataset}; | ||||||
| using U = typename decltype(s)::value_type; | ||||||
| U total{}; | ||||||
| for (const auto& val : s) { | ||||||
| total += val; | ||||||
| } | ||||||
| return total; | ||||||
| } | ||||||
|
|
||||||
| int main() { | ||||||
| std::vector<int> xs = {1, 2, 3, 4, 5}; | ||||||
| std::array<float, 3> ys = {4.5f, 5.6f, 6.7f}; | ||||||
| double zs[] = {7.8, 8.9, 9.10, 10.11, 11.12}; | ||||||
|
|
||||||
| std::cout << sum(xs) << "\n"; | ||||||
| std::cout << sum(ys) << "\n"; | ||||||
| std::cout << sum(zs) << "\n"; | ||||||
| return 0; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ rm -f *.out; clang++ -std=c++20 -o app.out main.cpp; ./app.out | ||||||
| 15 | ||||||
| 16.8 | ||||||
| 47.03 | ||||||
| ``` | ||||||
|
|
||||||
| ## What is Loop unrolling | ||||||
|
|
||||||
| Loop unrolling can reduce the overhead by decreasing the number of iterations and branch instructions. | ||||||
|
|
||||||
| To force on loop unrolling, we will disable the SIMD by `-fno-vectorize -mno-sse -mno-avx` in the following example. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The phrase "To force on loop unrolling" is slightly confusing. If the intention is to isolate the effect of loop unrolling by disabling vectorization, "To focus on loop unrolling" or "To observe loop unrolling" would be clearer. Alternatively, if you meant to force the optimization, "To force loop unrolling" (without "on") is more idiomatic.
Suggested change
|
||||||
|
|
||||||
| #### Part 01 : Standard Loop | ||||||
|
|
||||||
| It is the standard for-loop and corresponding assembly. | ||||||
|
|
||||||
| ```bash | ||||||
| $ cat sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| int sum(int data[8]) { | ||||||
| int total = 0; | ||||||
| for (int i = 0; i < 8; i++) { | ||||||
| total += data[i]; | ||||||
| } | ||||||
| return total; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ clang++ -std=c++20 -O2 -fno-unroll-loops -fno-vectorize -mno-sse -mno-avx -c sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ llvm-objdump -d --disassemble-symbols=$(nm sum.o | awk '/sum/ {print $3}') --x86-asm-syntax=intel sum.o | ||||||
| ``` | ||||||
|
|
||||||
| ```txt | ||||||
| sum.o: file format elf64-x86-64 | ||||||
|
|
||||||
| Disassembly of section .text: | ||||||
|
|
||||||
| 0000000000000000 <_Z3sumPi>: | ||||||
| 0: 31 c9 xor ecx, ecx | ||||||
| 2: 31 c0 xor eax, eax | ||||||
| 4: 66 66 66 2e 0f 1f 84 00 00 00 00 00 nop word ptr cs:[rax + rax] | ||||||
| 10: 03 04 8f add eax, dword ptr [rdi + 4*rcx] | ||||||
| 13: 48 ff c1 inc rcx | ||||||
| 16: 48 83 f9 08 cmp rcx, 0x8 | ||||||
| 1a: 75 f4 jne 0x10 <_Z3sumPi+0x10> | ||||||
| 1c: c3 ret | ||||||
| ``` | ||||||
|
|
||||||
| #### Part 02 : Manual Unrolling | ||||||
|
|
||||||
| We can manually unroll the loop using the following way. | ||||||
|
|
||||||
| ```bash | ||||||
| $ cat sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| int sum(int data[8]) { | ||||||
| int total = 0; | ||||||
| total += data[0]; | ||||||
| total += data[1]; | ||||||
| total += data[2]; | ||||||
| total += data[3]; | ||||||
| total += data[4]; | ||||||
| total += data[5]; | ||||||
| total += data[6]; | ||||||
| total += data[7]; | ||||||
| return total; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ clang++ -std=c++20 -O2 -fno-unroll-loops -fno-vectorize -mno-sse -mno-avx -c sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ llvm-objdump -d --disassemble-symbols=$(nm sum.o | awk '/sum/ {print $3}') --x86-asm-syntax=intel sum.o | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| sum.o: file format elf64-x86-64 | ||||||
|
|
||||||
| Disassembly of section .text: | ||||||
|
|
||||||
| 0000000000000000 <_Z3sumPi>: | ||||||
| 0: 8b 47 04 mov eax, dword ptr [rdi + 0x4] | ||||||
| 3: 03 07 add eax, dword ptr [rdi] | ||||||
| 5: 03 47 08 add eax, dword ptr [rdi + 0x8] | ||||||
| 8: 03 47 0c add eax, dword ptr [rdi + 0xc] | ||||||
| b: 03 47 10 add eax, dword ptr [rdi + 0x10] | ||||||
| e: 03 47 14 add eax, dword ptr [rdi + 0x14] | ||||||
| 11: 03 47 18 add eax, dword ptr [rdi + 0x18] | ||||||
| 14: 03 47 1c add eax, dword ptr [rdi + 0x1c] | ||||||
| 17: c3 ret | ||||||
| ``` | ||||||
|
|
||||||
| #### Part 03 : Use Compiler to do the Loop Unrolling | ||||||
|
|
||||||
| In previous examples, we use `-fno-unroll-loops` to disable the compiler from doing the loop unrolling. | ||||||
|
|
||||||
| For now, we enable it and see the output assembly is as same as the part02, which manually unrolled in C code. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a couple of grammatical issues in this sentence: "is as same as" should be "is the same as", and "which manually unrolled" should be "which was manually unrolled". Also, "part02" should be capitalized to match the section header.
Suggested change
|
||||||
|
|
||||||
| ```bash | ||||||
| $ cat sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| int sum(int data[8]) { | ||||||
| int total = 0; | ||||||
| for (int i = 0; i < 8; i++) { | ||||||
| total += data[i]; | ||||||
| } | ||||||
| return total; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ clang++ -std=c++20 -O2 -fno-vectorize -mno-sse -mno-avx -c sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ llvm-objdump -d --disassemble-symbols=$(nm sum.o | awk '/sum/ {print $3}') --x86-asm-syntax=intel sum.o | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| sum.o: file format elf64-x86-64 | ||||||
|
|
||||||
| Disassembly of section .text: | ||||||
|
|
||||||
| 0000000000000000 <_Z3sumPi>: | ||||||
| 0: 8b 47 04 mov eax, dword ptr [rdi + 0x4] | ||||||
| 3: 03 07 add eax, dword ptr [rdi] | ||||||
| 5: 03 47 08 add eax, dword ptr [rdi + 0x8] | ||||||
| 8: 03 47 0c add eax, dword ptr [rdi + 0xc] | ||||||
| b: 03 47 10 add eax, dword ptr [rdi + 0x10] | ||||||
| e: 03 47 14 add eax, dword ptr [rdi + 0x14] | ||||||
| 11: 03 47 18 add eax, dword ptr [rdi + 0x18] | ||||||
| 14: 03 47 1c add eax, dword ptr [rdi + 0x1c] | ||||||
| 17: c3 ret | ||||||
| ``` | ||||||
|
|
||||||
| ## Case Study | ||||||
|
|
||||||
| We compare the `std::span<int>` and `std::span<int, 8>` in formal parameters to see how the compiler | ||||||
| performs loop unrolling when the size is fixed at compile-time. | ||||||
|
|
||||||
| #### Case01 : `std::span<int>` | ||||||
|
|
||||||
| In this case, the span size is unknown hence the compiler generated the standard loop assembly code. | ||||||
|
|
||||||
| ```bash | ||||||
| $ cat sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| #include <span> | ||||||
|
|
||||||
| int sum(std::span<int> dataset) { | ||||||
| int total = 0; | ||||||
| for(const auto& data : dataset) { | ||||||
| total += data; | ||||||
| } | ||||||
| return total; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ clang++ -std=c++20 -O2 -fno-vectorize -mno-sse -mno-avx -c sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ radare2 -q -e bin.cache=true -c "aa; pdf" sum.o | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| ;-- section..text: | ||||||
| ;-- .text: | ||||||
| ;-- reloc..text: | ||||||
| ┌ 32: sym.sum_std::span_int__18446744073709551615ul__ (int64_t arg1, int64_t arg2); | ||||||
| │ ; arg int64_t arg1 @ rdi | ||||||
| │ ; arg int64_t arg2 @ rsi | ||||||
| │ 0x08000040 4885f6 test rsi, rsi ; RELOC 32 .text @ 0x08000040 - 0x80000d8 ; arg2 ; [02] -r-x section size 32 named .text | ||||||
| │ ┌─< 0x08000043 7418 je 0x800005d | ||||||
| │ │ 0x08000045 48c1e602 shl rsi, 2 ; arg2 | ||||||
| │ │ 0x08000049 31c9 xor ecx, ecx | ||||||
| │ │ 0x0800004b 31c0 xor eax, eax | ||||||
| │ │ 0x0800004d 0f1f00 nop dword [rax] | ||||||
| │ ┌──> 0x08000050 03040f add eax, dword [rdi + rcx] ; arg1 | ||||||
| │ ╎│ 0x08000053 4883c104 add rcx, 4 | ||||||
| │ ╎│ 0x08000057 4839ce cmp rsi, rcx ; arg2 | ||||||
| │ └──< 0x0800005a 75f4 jne 0x8000050 | ||||||
| │ │ 0x0800005c c3 ret | ||||||
| │ └─> 0x0800005d 31c0 xor eax, eax | ||||||
| └ 0x0800005f c3 ret | ||||||
| ``` | ||||||
|
|
||||||
| #### Case02 : `std::span<int, 8>` | ||||||
|
|
||||||
| In this case, the span size is defined hence the compiler can perform the loop unrolling. | ||||||
|
|
||||||
| ```bash | ||||||
| $ cat sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```cpp | ||||||
| #include <span> | ||||||
|
|
||||||
| int sum(std::span<int, 8> dataset) { | ||||||
| int total = 0; | ||||||
| for(const auto& data : dataset) { | ||||||
| total += data; | ||||||
| } | ||||||
| return total; | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ clang++ -std=c++20 -O2 -fno-vectorize -mno-sse -mno-avx -c sum.cpp | ||||||
| ``` | ||||||
|
|
||||||
| ```bash | ||||||
| $ radare2 -q -e bin.cache=true -c "aa; pdf" sum.o | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| ;-- section..text: | ||||||
| ;-- .text: | ||||||
| ;-- reloc..text: | ||||||
| ┌ 24: sym.sum_std::span_int__8ul__ (int64_t arg1); | ||||||
| │ ; arg int64_t arg1 @ rdi | ||||||
| │ 0x08000040 8b4704 mov eax, dword [rdi + 4] ; RELOC 32 .text @ 0x08000040 - 0x80000d0 ; arg1 ; [02] -r-x section size 24 named .text | ||||||
| │ 0x08000043 0307 add eax, dword [rdi] ; arg1 | ||||||
| │ 0x08000045 034708 add eax, dword [rdi + 8] ; arg1 | ||||||
| │ 0x08000048 03470c add eax, dword [rdi + 0xc] ; arg1 | ||||||
| │ 0x0800004b 034710 add eax, dword [rdi + 0x10] ; arg1 | ||||||
| │ 0x0800004e 034714 add eax, dword [rdi + 0x14] ; arg1 | ||||||
| │ 0x08000051 034718 add eax, dword [rdi + 0x18] ; arg1 | ||||||
| │ 0x08000054 03471c add eax, dword [rdi + 0x1c] ; arg1 | ||||||
| └ 0x08000057 c3 ret | ||||||
| ``` | ||||||
|
|
||||||
| ## Conclusion | ||||||
| Use Radare2 to visualize the assembly with control flow. | ||||||
| We can see the compiler does the optimization of loop unrolling | ||||||
| to remove the branch instruction to reduce the overhead. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is quite repetitive. A more concise version would improve the flow of the introduction.