Faster code Fridays: Understand division and speed of operations

Hello there and welcome back! Faster code Fridays is our weekly series that doesn’t ever fall on a Friday, unless our laziness becomes so strong that it interferes with our disregard for naming conventions. We figure we’ll forgetfully publish one of these things on a Friday at some point. Even a broken clock is right twice a day, eh?

If you’re a first-time visitor, Faster code Fridays highlights code optimization techniques that are useful for embedded systems. Embedded applications often deal with time critical applications that require maximum performance and minimum execution time. In fact, good coding practice is often more apparent when working with microcontrollers, because you don’t have four 4GHz cores and 8Gb of RAM to get you out of trouble. We’ll use Arduino-compatible code for most of our examples, though these techniques are applicable to AVR, PIC, or any number of platforms.

Math-letes ‘R Us

Last week, we discussed why you should avoid floating point math. This week, we’ll discuss the relative speed of integer math operations. If you search for information on code optimization for microcontrollers or for C/C++, you’ll find lots of outdated articles on how division should be avoided at all costs, or replaced with some complicated bit-shifting wizardry. They will invariably mention that if you really want to kick it old-school and keep up with the hackers that write pure assembly, division workarounds are a must. Someday you may even reach enlightenment and be able to code by feeding raw bits into program memory.

However, modern compilers have come a long way, and are capable of doing a lot of optimization for you. If you poke around and read some recent knowledgeable opinions, many suggest that using division is not a big deal, even on resource constrained systems. So… where does the truth fall?

Testing the speed of math operations using Arduino

We decided to run some simple tests to see how the standard math operations (*, /, +, -) stacked up in terms of execution time.

To do this, we wrote a simple Arduino sketch that performs a single operation on two operands using nested for loops. For the purposes of our experiment, we chose 0-100 for the range of each operand. This is not exactly a robust sample size, but it does establish a range of values and throws in a mix of primes, powers of 2, etc. to keep things interesting.

Our sketch includes two global variables, int x to store calculation results and unsigned long time; to track when each test completes. The full sketch looks like this:

int x;
unsigned long time;

void setup ()
{
    Serial.begin(115200);
}

void loop()
{
    time = micros();
    for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {
            x = i/j;
        }
    }
    time = micros() - time;
    Serial.print(&quot;Completion time: &quot;);
    Serial.println(time);
    while (1) {}
}

When run, this code will iterate over 0-100 for i and 0-100 for j, dividing each possible combination of the two and writing the result to x. After completing all 10201 operations (101^2), it marks the elapsed time in microseconds and displays this time on the screen. Each operation also includes a memory write. However, these should be consistent between tests, allowing for an accurate relative comparison. By marking elapsed time before and after the loops, we can eliminate the call to setup() and initialization of our two variables.

The experiment

First, we’ll run the test using division. The code was executed four times with extremely consistent results:

Speedy division on an AVR ATMega238p.

The division test completed in 153236 us each time. That’s in between one and two tenths of a second. Not bad.

Now let’s see what happens with multiplication. We made one change to the test code, switching x = i/j; to x = i*j;. This sketch was also run four times:

Multiplication is instantaneous!

Ruh-roh. According to our scientifically rigorous testing, the Arduino managed to finish those 10,000 multiply operations instantaneously. What happened?

We knew this chip was fast, but that’s a bit suspicious. In order to investigate the effect of compiler optimization, we ran the test again with the math operation commented out, like so:

int x;
unsigned long time;

void setup ()
{
    Serial.begin(115200);
}

void loop()
{
    time = micros();
    for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {
            //x = i*j;
        }
    }
    time = micros() - time;
    Serial.print(&quot;Completion time: &quot;);
    Serial.println(time);
    while (1) {}
}

In this case, the Arduino isn’t doing any math – it’s simply iterating over the nested loops as quickly as possible. In fact, if the compiler is smart, it may be able to determine that nothing is happening in these loops, and optimize them out of the compiled code entirely.

Uploading and running four times results in this:

Multiplication is instantaneous!

Yep, that’s exactly the same results as the multiplication. It turns out the compiler is aggressively removing extraneous code. It saw that we only use x in our loops, and do nothing with it afterwards. Therefore, it optimizes out the multiplication step as if we had never written it. Interestingly, however, it was unable to make the same determination about our division. We’re not experts on avr-gcc but we’d be curious to learn the exact algorithms used here.

The experiment in which we get things working

Ok, clearly we need to change something to ensure the Arduino is actually performing all of these calculations. One possibility is to save the results of each calculation, as if we are going to use them later. However, saving 10201 calculations to an array would be a pain – the Arduino Uno doesn’t even have enough RAM to hold that many integers.

Another approach is to somehow prevent the compiler from optimizing that section of code. We could prevent optimization by declaring x as volatile. Volatile variables do not undergo any optimization, which solves our problem. But the volatile keyword also has the side effect of forcing variables to be stored in RAM, which generates extra overhead when accessing them. What else can we do?

From the treasures of StackOverflow, it turns out if we place a blank assembly instruction in our nested loops, the compiler will leave them alone while treating variables normally:

    for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {
            x = i*j;
            asm("");
        }
    }

Running the division test again, we reach completion in 153236 us, the same as our original case. This makes sense; if the compiler wasn’t optimizing the division code to begin with, adding an empty assembly instruction wouldn’t affect the compiled code at all.

For the multiplication test, each run now completes in 3896 us. Holy difference-maker, Batman. That’s blazing fast. It also is much more realistic than the first case, suggesting that our compiler trick worked and the loops are no longer being optimized away.

The end result is that multiplication is 97.46% quicker, or alternatively, 39.3x faster than division in this case. Interestingly, results for addition and subtraction were exactly the same: 3896 us per test.

Implications of slow-as-molasses division

The moral of the story here is that although modern compilers can make clever optimizations in certain cases, they can’t do it all the time. In your particular project, taking a little extra time for division might not be a problem. But if you have a time-critical application, it’s worth investigating whether you could speed things up by using workarounds or alternate calculation steps. Normally, we would recommend you only pay attention to these optimizations when debugging is suggesting them as a culprit. Choice of general program structure or algorithms can have a much larger affect. Nevertheless, on the Arduino and other platforms, step-through debugging is difficult, so it’s good to be aware of potential bottlenecks.

Remember that the actual time taken for each type of math operation may vary by compiler or device. Taking longer for some types of division is a common issue in microchips, but the Atmega328 on the Arduino seems particularly slow in this regard. Your processor may be better or worse. For more information on how compilers optimize some division, check out this Stack Overflow question.

That wraps things up for today. Do you have an example where speed of calculation affected your project? How about suggestions for future Faster code Fridays? Let us know in the comments. And if you enjoyed this article, be sure to subscribe to our RSS feed or email newsletter.

Update: after several tips from our awesome commenters, we’ve changed our test methodology to produce better results. Thanks to Harley and Robert for the tips!

5 thoughts on “Faster code Fridays: Understand division and speed of operations

  1. Nice write-up. Thanks!

    Regarding multiple runs: I’d be highly suspicious if multiple runs didn’t have the same result. In a simple environment like this, the code should execute in a highly deterministic manner. It should be exactly the same each time. This is in contrast to a PC environment where there are non-deterministic events happening and you need to make multiple runs to get an average.

    Regarding speed differences: A quick perusal through the AVR opcode summary page shows instructions for Add, Subtract and Multiply. So these are done in hardware. However, no opcode for Division means the compiler is generating code to handle this case. The speed difference shows the relative efficiency of math implemented in hardware vs. software. It probably also explains why the optimizer doesn’t remove the loops when division is done, but does for multiplication.

    • Thanks Harley! Those are great points. We come from an academic background so running multiple trials is in our blood, even when it shouldn’t make a difference :-) . And the hardware vs. software distinction makes a lot of sense.

  2. There is no way the 72 us figure is correct. It sounds like the compiler is still optimizing out one of the loops. With a 16 mhz clock, a multiply takes 0.125 us. Not including overhead for the loops and storing the result, 10201 multiplies would take at least 1275.125 us.

    • Hm, you’re right. We looked around for a better solution and found a couple of different options for preventing optimization. The post is updated with the new results. Thanks for your help!

  3. I tried your example with x as a long, forcing the multiplication to long, changed x= to x+= and started at j=1 to avoid divide by 0. Result time 2732? With the longs changed to int the result is the same as before. Approx 3900. Aside from Integer rollovers that don’t occur, why is long multiplication quicker? The answer seems correct.

    I found this while actually trying to find out if a float multiplication is faster than a long division.

    Anyways, the code is below:

    long x;
    unsigned long time;

    void setup ()
    {
    Serial.begin(115200);
    }

    void loop()
    {
    time = micros();
    for (int i = 0; i <= 100; i++) {
    for (int j = 1; j <= 100; j++) {
    x += (long)i*j;
    asm("");
    }
    }
    time = micros() – time;
    Serial.print("Completion time: ");
    Serial.println(time);
    Serial.print("value: ");
    Serial.println(x);
    while (1) {}
    }

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>