Skip to content

gaul/peepopt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

peepopt 🐣

peepopt recompiles x86-64 binaries using peephole optimization to take advantage of instructions available in newer processors. This improves performance and reduces power consumption in some situations.

Background

When compiling a program one must decide which processor family to target, e.g., x86-64, ARMv8. They may further specialize to a subset of processors, e.g., Intel Alder Lake or newer. Most Linux distributions compile binaries for a least-common denominator profile, e.g., x86-64 v1 in Fedora, x86-64 v3 in RHEL 10. Some distributions like Gentoo can compile from source to target a more specific processor and unlock additional performance. peepopt applies inexpensive peephole optimizations that reclaim some of this performance without expensive full-program compilation.

Shift left example

Consider a C function:

uint32_t shift(uint32_t x, uint32_t y)
{
    return x << y;
}

Shifting left with x86-64-v1

The sall instruction only takes two operands which requires movl instructions to set up the input registers:

89F8           movl %edi,%eax
89F1           movl %esi,%ecx
D3E0           sall %cl,%eax
C3             ret

Shifting left with x86-64-v3 (BMI2)

The shlx instruction takes three operands which allows more flexibility and does not require movls:

C4E249F7C7     shlx %esi,%edi,%eax
C3             ret

Note that this is not equivalent to the former example since sall explicitly writes to %cl and implicitly writes to EFLAGS. When rewriting instructions peepopt examines subsequent instructions to ensure that they would not observe the replacement.

Optimizing existing binaries

Currently peepopt only does simple replacements, e.g., shifts, that can be done without increasing or decreasing the number of instruction bytes. Unused bytes are padded with no-ops which may seem wasteful but processors discard them early during execution. Further the instructions represent fewer and simpler micro-operations which increase instruction cache hit rates and reduce execution overhead.

Benchmarks

Anecdotally using the x86-64-v3 profile improves performance by a few percent:

TODO: run benchmarks for Firefox and GCC

Compilation

First install the Intel x86 encoder decoder:

git clone https://github.com/intelxed/xed.git xed
git clone https://github.com/intelxed/mbuild.git mbuild
cd xed
./mfile.py install --install-dir=kits/xed-install

Next build peepopt:

git clone https://github.com/gaul/peepopt.git peepopt
cd peepopt
XED_PATH=/path/to/xed make all

Usage

  • peepopt --dry-run program_file
    • Show which replacements peepopt would do
  • peepopt [--verbose] program_file
    • Optimize the input binary with replacement instructions

Future directions

  • 10-15 byte no-ops - optimal on Sandy Bridge and newer only but Atom and Zen perform poorly
  • APX - expanded three-operand and no flag instructions, supported by Panther Lake and newer processors
  • BMI - more flexible bit manipulations
  • FSRM - improve memory copies on Ice Lake and newer processors
    • difficult replacement due more complicated register usage
  • inline compiler builtins, e.g., popcount
  • inline indirect functions

Distributions

peepopt could automatically run during distribution package installs. This will require plugins for package managers like apt and dnf.

License

Copyright (C) 2026 Andrew Gaul

Licensed under the Apache License, Version 2.0

About

Recompiles x86-64 binaries using peephole optimization to take advantage of instructions available in newer processors

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors