The C++ API leverages templates and operator overloading. If you don't like these things, feel free to use the C API, which works fine in C++.
An example is worth a thousand words, so jump in. Here is a function that divides many integers by a fixed integer, and returns their sum:
int sum_of_quotients(const int *numers, int count, int d) { int result = 0; for (int i=0; i < count; i++) result += numers[i] / d; //this division is slow! return result; }Here is how you would optimize it with libdivide in C++:
int sum_of_quotients(const int *numers, int count, int d) { int result = 0; libdivide::divider<int> fast_d(d); //constructs an instance of libdivide::divider for (int i=0; i < count; i++) result += numers[i] / fast_d; //uses faster libdivide division return result; }Despite the division operator, no division instructions are issued in the second code. The division operator is overloaded to instead use a multiply and shift, which were precomputed in the constructor for libdivide::divider.
All of libdivide is contained in a single header file, with the libdivide namespace. The sole public class in this namespace is 'divider'. This class is a template, parameterized by the type you want to divide. Four types are supported: int32_t
, int64_t
, uint32_t
, and uint64_t
, with other types producing an error.
When dividing, the numerator may be the same type as the denominator. If vector support is enabled, then it may also be a vector type. Supported vector types are the corresponding NEON types (uint32x4_t
, int32x4_t
, uint64x2_t
, int64x2_t
) and the x86 family (__m128i
, __m256i
, __m512i
).
The C API takes the form of a family of regularly named functions. You use libdivide in C by passing the divisor to a generating function, which returns a struct. You then pass a dividend (numerator) and a pointer to the struct to a do function, which returns the resulting quotient.
Here is that normal C function that divides many integers by a fixed integer, and returns their sum:
int sum_of_quotients(const int *numers, int count, int d) { int result = 0; for (int i=0; i < count; i++) result += numers[i] / d; //this division is slow! return result; }Here is how you would optimize it with libdivide in C:
int sum_of_quotients(const int *numers, size_t count, int d) { int result = 0; struct libdivide_s32_t fast_d = libdivide_s32_gen(d); for (size_t i=0; i < count; i++) result += libdivide_s32_do(numers[i], &fast_d); // performs faster libdivide division return result; }
The four supported types are int32_t, uint32_t, int64_t, and uint64_t. The four generating functions are:
Similarly, there are four do functions. Each accepts a numerator and returns the result of dividing it by the denominator passed to the gen function:
There are also do_vec functions, designed to integrate with vector intrisics. The functions are named according to the vector width: 128 for SSE2 and NEON, 256 for AVX2, 512 for AVX512. Each accepts a vector containing either two or four packed numerators, and returns a vector containing the result of dividing each by the denominator passed to the gen function:
libdivide's behavior can be tweaked by a few preprocessor macros:
These enable the corresponding x86 vector support.
This enables NEON vector support: