Furthermore, libdivide allows you to divide SIMD vectors by runtime constants, which is especially nice because SIMD typically lacks integer division.
libdivide is free and open source with a permissive license. It is packed as a single header-only library, with both a C and C++ API.
Use cases: libdivide helps optimize the case of dividing many integer numerators by one integer denominator, where the denominator may only be known at runtime. There are many potential use cases: hash tables, scaling algorithms, a quick-'n-dirty speed boost for an interpreter, etc.
Supported types: libdivide supports 32 bit and 64 bit division of signed and unsigned integers (so four types total). It supports scalars, SSE2, AVX2, AVX512, and ARM NEON.
License: libdivide uses the zlib license. This permits usage of libdivide in any product - commercial, GPL, or otherwise. No attribution is required. However, you must not claim to have written libdivide, or distribute an altered version without plainly marking it as modified.
libdivide is separately made available under the Boost license, for users who prefer the Boost license or who are already using code under this license.
Documentation: See the user documentation
Performance: libdivide's scalar code is up to 16 times faster for powers of 2, 10 times faster for non-powers of 2, compared to naive hardware division. Optimized vector code is a further 2-3 times faster. The performance you will see depends on many factors. Click here to see a partial list. Of course, the best way to determine how libdivide affects your performance is to try it and measure!
For 64 bit numerators and denominators, libdivide provides much more benefit on x86-64 than on i386. This is because i386 chips have some hardware support for dividing 64 bit values, but not for multiplying 64 bit values. Even so, in my test, libdivide is nearly twice as fast as hardware division when dividing 64 bit integers in i386.
Overhead: libdivide requires a data structure that is one byte larger than the size of the divisor (so 32 bit division requires storing a 40 bit struct). It does not perform dynamic memory allocation.
In terms of processor time, pre-computing the proper magic number and shift is on the order of one to three hardware divides, for non-powers of 2. Powers of 2 (and their negatives) are much faster.
Why so little assembly? In addition to the portability issues, inline assembly defeats many compiler optimizations such as CSE, unswitching, register allocation, and instruction scheduling. libdivide's functions are designed to be inlined into and optimized along with the callers, so the compiler needs full visibility into what those functions are doing. That said, small amounts of assembly are used when the performance win is large and the codegen is otherwise unattainable (for example, the function libdivide_128_div_64_to_64).
Development: Contributions are very welcome at the github page.
Questions, patches, bug reports? Send them to