This is the documentation for libdivide, a library for optimizing integer division. libdivide has both a C and a C++ interface. Pick one:

libdivide in C++

The C++ API leverages templates and operator overloading. If you don't like these things, feel free to use the C API, which works fine in C++.

An example is worth a thousand words, so jump in. Here is a function that divides many integers by a fixed integer, and returns their sum:

int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    for (int i=0; i < count; i++)
        result += numers[i] / d; //this division is slow!
    return result;
}

Here is how you would optimize it with libdivide in C++:

int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    libdivide::divider<int> fast_d(d); //constructs an instance of libdivide::divider
    for (int i=0; i < count; i++)
        result += numers[i] / fast_d; //uses faster libdivide division
    return result;
}

Despite the division operator, no division instructions are issued in the second code. The division operator is overloaded to instead use a multiply and shift, which were precomputed in the constructor for libdivide::divider.

All of libdivide is contained in a single header file, with the libdivide namespace. The sole public class in this namespace is 'divider'. This class is a template, parameterized by the type you want to divide. Four types are supported: int32_t, int64_t, uint32_t, and uint64_t, with other types producing an error.

When dividing, the numerator may be the same type as the denominator. If vector support is enabled, then it may also be a vector type. Supported vector types are the corresponding NEON types (uint32x4_t, int32x4_t, uint64x2_t, int64x2_t) and the x86 family (__m128i, __m256i, __m512i).

libdivide in C

The C API takes the form of a family of regularly named functions. You use libdivide in C by passing the divisor to a generating function, which returns a struct. You then pass a dividend (numerator) and a pointer to the struct to a do function, which returns the resulting quotient.

Here is that normal C function that divides many integers by a fixed integer, and returns their sum:

int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    for (int i=0; i < count; i++)
        result += numers[i] / d; //this division is slow!
    return result;
}

Here is how you would optimize it with libdivide in C:

int sum_of_quotients(const int *numers, size_t count, int d) {
    int result = 0;
    struct libdivide_s32_t fast_d = libdivide_s32_gen(d);
    for (size_t i=0; i < count; i++)
        result += libdivide_s32_do(numers[i], &fast_d); // performs faster libdivide division
    return result;
}

The four supported types are int32_t, uint32_t, int64_t, and uint64_t. The four generating functions are:

struct libdivide_s32_t libdivide_s32_gen(int32_t y)
struct libdivide_u32_t libdivide_u32_gen(uint32_t y)
struct libdivide_s64_t libdivide_s64_gen(int64_t y)
struct libdivide_u64_t libdivide_u64_gen(uint64_t y)

Similarly, there are four do functions. Each accepts a numerator and returns the result of dividing it by the denominator passed to the gen function:

int32_t libdivide_s32_do(int32_t, const struct libdivide_s32_t *)
uint32_t libdivide_u32_do(uint32_t, const struct libdivide_u32_t *)
int64_t libdivide_s64_do(int64_t, const struct libdivide_s64_t *)
uint64_t libdivide_u64_do(uint64_t, const struct libdivide_u64_t *)

There are also do_vec functions, designed to integrate with vector intrisics. The functions are named according to the vector width: 128 for SSE2 and NEON, 256 for AVX2, 512 for AVX512. Each accepts a vector containing either two or four packed numerators, and returns a vector containing the result of dividing each by the denominator passed to the gen function:

__m128i libdivide_s32_do_vec128(__m128i, const struct libdivide_s32_t *)
__m128i libdivide_u32_do_vec128(__m128i, const struct libdivide_u32_t *)
__m128i libdivide_s64_do_vec128(__m128i, const struct libdivide_s64_t *)
__m128i libdivide_u64_do_vec128(__m128i, const struct libdivide_u64_t *)

__m256i libdivide_s32_do_vec256(__m256i, const struct libdivide_s32_t *)
__m256i libdivide_u32_do_vec256(__m256i, const struct libdivide_u32_t *)
__m256i libdivide_s64_do_vec256(__m256i, const struct libdivide_s64_t *)
__m256i libdivide_u64_do_vec256(__m256i, const struct libdivide_u64_t *)

__m256i libdivide_s32_do_vec512(__m512i, const struct libdivide_s32_t *)
__m256i libdivide_u32_do_vec512(__m512i, const struct libdivide_u32_t *)
__m256i libdivide_s64_do_vec512(__m512i, const struct libdivide_s64_t *)
__m256i libdivide_u64_do_vec512(__m512i, const struct libdivide_u64_t *)

int32x4_t libdivide_s32_do_vec128(int32x4_t, const struct libdivide_s32_t *)
uint32x4_t libdivide_u32_do_vec128(uint32x4_t, const struct libdivide_u32_t *)
int64x2_t libdivide_s64_do_vec128(int64x2_t, const struct libdivide_s64_t *)
int64x2_t libdivide_u64_do_vec128(int64x2_t, const struct libdivide_u64_t *)

Preprocessor defines

libdivide's behavior can be tweaked by a few preprocessor macros:

LIBDIVIDE_SSE2
LIBDIVIDE_AVX2
LIBDIVIDE_AVX512

These enable the corresponding x86 vector support.

LIBDIVIDE_NEON

This enables NEON vector support: