This is the documentation for libdivide, a library for optimizing integer division. libdivide has both a C and a C++ interface. Pick one:

libdivide in C++

The C++ API leverages templates and operator overloading. If you don't like these things, feel free to use the C API, which works fine in C++.

An example is worth a thousand words, so jump in. Here is a function that divides many integers by a fixed integer, and returns their sum:

int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    for (int i=0; i < count; i++)
        result += numers[i] / d; //this division is slow!
    return result;
}
Here is how you would optimize it with libdivide in C++:
int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    libdivide::divider<int> fast_d(d); //constructs an instance of libdivide::divider
    for (int i=0; i < count; i++)
        result += numers[i] / fast_d; //uses faster libdivide division
    return result;
}
Despite the division operator, no division instructions are issued in the second code. The division operator is overloaded to instead use a multiply and shift, which were precomputed in the constructor for libdivide::divider.

All of libdivide is contained in a single header file, with the libdivide namespace. The sole public class in this namespace is 'divider'. This class is a template, parameterized by the type you want to divide. Four types are supported: int32_t, int64_t, uint32_t, and uint64_t, with other types producing an error.

Unswitching in C++

libdivide also supports a concept called "unswitching." Unswitching is a compiler optimization in which a conditional but invariant branch is hoisted out of a loop, and then the loop's contents are duplicated for each arm of the branch. That is, if the compiler can deduce that an if statement will always evaluate the same way in a loop, it may evaluate the if statement outside of the loop, and branch to one of two loops depending on the result. If your compiler is unable or unwilling to perform this optimization in libdivide, you may do it manually, like so:

int sum_of_quotients(const int *numers, int count, int d) {
    using namespace libdivide;
    int i, result = 0;
    divider<int> fast_d(d);
    switch (fast_d.get_algorithm()) {
        case 0: for (i=0; i < count; i++) result += numers[i] / unswitch<0>(fast_d); break;
        case 1: for (i=0; i < count; i++) result += numers[i] / unswitch<1>(fast_d); break;
        case 2: for (i=0; i < count; i++) result += numers[i] / unswitch<2>(fast_d); break;
        case 3: for (i=0; i < count; i++) result += numers[i] / unswitch<3>(fast_d); break;
        case 4: for (i=0; i < count; i++) result += numers[i] / unswitch<4>(fast_d); break;
    }
    return result;
}

There are five cases. Don't worry about using too many: the unswitch template will give you an error for five or more. This optimization is only useful for loop-like structures: don't try to cache the algorithm index, or store a function pointer to the right algorithm, or any fancy tricks like that, since they will introduce more overhead.

In my tests, unswitching gains are modest or nonexistent on clang and gcc, but substantial on Visual C++. (Presumably the two former compilers are more aggressive about unswitching than the latter.)

The formal class interface is:

namespace libdivide {
   template<typename INT_TYPE>
   class divider {
   
      /* Ordinary constructor, that takes the divisor as a parameter. */
      divider(INT_TYPE n);
      
      /* Divides the parameter by the divisor, returning the quotient. */
      T perform_divide(T val) const;
      
      /* Treats the vector as either two or four packed values (depending on the size), and divides each of them by the divisor, returning the packed quotients. */
      __m128i perform_divide_vector(__m128i val) const; 

      
      /* Returns the index of algorithm, for use in the unswitch function. */
      int get_algorithm() const;
      
      /* Standard operator overloads.  operator= is defined implicitly. */
      bool operator==(const divider<INT_TYPE> & other) const;
      bool operator!=(const divider<INT_TYPE> & other) const;
      
   };
   
   /* Overload of the / operator for scalar division. */
   template<typename INT_TYPE>
   INT_TYPE operator/(INT_TYPE numer, const divider<INT_TYPE> & denom);
   
   /* Overload of the / operator for vector division. */
   template<typename INT_TYPE>
   __m128i operator/(__m128i numer, const divider<INT_TYPE> & denom);
   
   /* Returns a divider specialized for the given algorithm. */
   template<int ALGO, typename INT_TYPE>
   divider<INT_TYPE> unswitch(const divider<INT_TYPE> & d)
}

libdivide in C

The C API takes the form of a family of regularly named functions. You use libdivide in C by passing the divisor to a generating function, which returns a struct. You then pass a dividend (numerator) and a pointer to the struct to a do function, which returns the resulting quotient.

Here is that normal C function that divides many integers by a fixed integer, and returns their sum:

int sum_of_quotients(const int *numers, int count, int d) {
    int result = 0;
    for (int i=0; i < count; i++)
        result += numers[i] / d; //this division is slow!
    return result;
}
Here is how you would optimize it with libdivide in C:
int sum_of_quotients(const int *numers, size_t count, int d) {
    int result = 0;
    struct libdivide_s32_t fast_d = libdivide_s32_gen(d);
    for (size_t i=0; i < count; i++)
        result += libdivide_s32_do(numers[i], &fast_d); // performs faster libdivide division
    return result;
}

The four supported types are int32_t, uint32_t, int64_t, and uint64_t. The four generating functions are:

Similarly, there are four do functions. Each accepts a numerator and returns the result of dividing it by the denominator passed to the gen function:

There are also four do_vector functions, designed to integrate with SSE intrisics. Each accepts a vector containing either two or four packed numerators, and returns a vector containing the result of dividing each by the denominator passed to the gen function:

Unswitching in C

C also supports an unswitching API. (Read about unswitching up above.) You start by calling a get_algorithm function, and then (within the loop) one of the do_algorithm functions depending on the result. There are three do_algorithm functions for unsigned types, and five for signed types.

To write the above function with unswitching:

int sum_of_quotients5(const int *numers, size_t count, int d) {
    int result = 0;
    struct libdivide_s32_t fast_d = libdivide_s32_gen(d);
    int algo = libdivide_s32_get_algorithm(&fast_d);
    switch (algo) {
        case 0:
            for (size_t i=0; i < count; i++)
                result += libdivide_s32_do_alg0(numers[i], &fast_d);
            break;
        case 1:
            for (size_t i=0; i < count; i++)
                result += libdivide_s32_do_alg1(numers[i], &fast_d);
            break;
        case 2:
            for (size_t i=0; i < count; i++)
                result += libdivide_s32_do_alg2(numers[i], &fast_d);
            break;
        case 3:
            for (size_t i=0; i < count; i++)
                result += libdivide_s32_do_alg3(numers[i], &fast_d);
            break;
        case 4:
            for (size_t i=0; i < count; i++)
                result += libdivide_s32_do_alg4(numers[i], &fast_d);
            break;
    }
    return result;
}

As you can see, this manual unswitching is laborious, and any gains are generally modest.

The full unswitching API is long. Click here to see it all.


Preprocessor defines

libdivide's behavior can be tweaked by a few preprocessor macros: