Furthermore, libdivide allows you to divide an SSE2 vector by a runtime constant, which is especially nice because SSE2 has no integer division instructions!
libdivide is free and open source with a permissive license. The name "libdivide" is a bit of a joke, as there is no library per se: the code is packaged entirely as a single header file, with both a C and a C++ API.
What is it good for? libdivide helps optimize the case of dividing many integer numerators by one integer denominator, where the denominator may only be known at runtime. There are many potential use cases: hash tables, scaling algorithms, a quick-'n-dirty speed boost for an interpreter, etc.
What types does it support? libdivide supports signed and unsigned division with 32 bit and 64 bit types (so four types in total). libdivide also supports dividing a packed SSE vector by these types, in a way designed to integrate with the SSE intrinsics supported by gcc, clang, and Visual C++.
Where does it run? libdivide is valid ANSI C and ANSI C++. It is packaged as a single header file for easy integration, and has been tested with clang, gcc, and Visual C++.
What is the license? libdivide uses the zlib license. This permits usage of libdivide in any product - commercial, GPL, or otherwise. No attribution is required. However, you must not claim to have written libdivide, or distribute an altered version without plainly marking it as modified.
Where can I get it?
How do I use it? See the user documentation
How fast? libdivide's scalar code is up to 16 times faster for powers of 2, 10 times faster for non-powers of 2, compared to naive hardware division. Optimized vector code is a further 2-3 times faster. The performance you will see depends on many factors. Click here to see a partial list. Of course, the best way to determine how libdivide affects your performance is to try it and measure!
For 64 bit numerators and denominators, libdivide provides much more benefit on x86-64 than on i386. This is because i386 chips have some hardware support for dividing 64 bit values, but not for multiplying 64 bit values. Even so, in my test, libdivide is nearly twice as fast as hardware division when dividing 64 bit integers in i386.
What is the overhead? libdivide requires a data structure that is one byte larger than the size of the divisor (so 32 bit division requires storing a 40 bit struct). It does not perform dynamic memory allocation and is in general very lean on memory.
In terms of processor time, pre-computing the proper magic number and shift is on the order of one to three hardware divides, for non-powers of 2. Powers of 2 (and their negatives) are much faster.
Does it work with .NET? libdivide works well with C++/CLI. It is about twice as slow compared to native code, which means it is 2 to 5 times faster than "native" (managed) division.
Why so little assembly? In addition to the portability issues, inline assembly defeats many compiler optimizations such as CSE, unswitching, register allocation, and instruction scheduling. libdivide's functions are designed to be inlined into and optimized along with the callers, so the compiler needs full visibility into what those functions are doing. That said, small amounts of assembly are used when the performance win is large and the codegen is otherwise unattainable (for example, the function libdivide_128_div_64_to_64).
How can I help? You can contribute at the github page. Here is the wishlist. Help with ARM and on Windows is especially appreciated.
Questions, patches, bug reports? Send them to