TransWikia.com

Mod slow in CompiledFunction

Mathematica Asked by b3m2a1 on December 10, 2020

I’ve been playing with a little compiled Ising model implementation. It’s brutally slow. Decomposing things I’ve found that the primary culprit appears to be Mod. Taking this trivial function:

modProper =
  Compile[
   {
    {initCoords, _Integer, 2},
    {tSteps, _Integer}
    },
   Module[
    {
     coordMat = initCoords,
     modMat,
     gridNRows,
     gridNCols
     },
    gridNRows = Length@coordMat;
    gridNCols = Length@coordMat[[1]];
    Do[
     modMat =
       Table[
        Mod[i - 1, gridNRows, 1],
        {i, gridNRows},
        {j, gridNCols}
        ];,
     {n, tSteps}
     ];
    coordMat
    ],
   RuntimeOptions ->
    {
     "CatchMachineOverflow" -> False,
     "CatchMachineIntegerOverflow" -> False,
     "CompareWithTolerance" -> False,
     "EvaluateSymbolically" -> False
     },
   CompilationTarget -> "C",
   Parallelization -> True
   ];

Taking a 100×100 grid it takes a full .0002 seconds per iteration:

isingGrid = RandomChoice[{-1, 1}, {100, 100}];
AbsoluteTiming[modProper[isingGrid, 5000]][[1]]/5000

0.000206102

This seems fast until you realize that you’re only doing 10000 Mod calculations. It’s even a little bit slower than the uncompiled, vectorized version:

With[{range = Range[10000] - 10},
 AbsoluteTiming[Mod[range, 5000, 1]][[1]]
 ]

0.000163

Is there a reason it should be this slow?

4 Answers

Towards your original code (this does not answer the question why Mod is that slow; it just tries to get rid of it), here is my version of it, two generations later. We already identified Mod to be the bottleneck, so let's get rid of it, by precomputing the indices and by submitting them to the compiled function as additional arguments:

isingProper5 = Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}, {storeTraj, _Integer},
    {up, _Integer, 1}, {down, _Integer, 1}, 
    {left, _Integer, 1}, {right, _Integer, 1}
    },
   Module[{totalCoords, realTraj, storedSteps, storedStepPointer = 1,
     coordMat = initCoords,
     peMat, gridNRows, gridNCols, giL, giM, gjL, gjM, c},
    (*number of steps to be used in the trajectory*)

    realTraj = Min@{Max@{storeTraj, 1}, tSteps};
    (*steps to be cached*)
    storedSteps = 
     Table[Ceiling[tSteps*i/realTraj], {i, Min@{Max@{realTraj, 1}, tSteps}}];
    (*preallocated trajectory block*)

    totalCoords = Table[initCoords, {i, realTraj}];
    gridNRows = Length@initCoords;
    gridNCols = Dimensions[initCoords][[2]];
    Do[
     peMat =(*calculate PE on the grid from nearest neighbors*)

      Table[
       giL = Compile`GetElement[up, i];
       giM = Compile`GetElement[down, i];
       Table[
        gjL = Compile`GetElement[left, j];
        gjM = Compile`GetElement[right, j];
        Times[
          - Compile`GetElement[coordMat, i, j],
          Plus[
           Compile`GetElement[coordMat, giL, j],
           Compile`GetElement[coordMat, giM, j],
           Compile`GetElement[coordMat, i, gjL],
           Compile`GetElement[coordMat, i, gjM]
           ]
          ], {j, gridNCols}]
       , {i, gridNRows}
       ];
     If[n == storedSteps[[storedStepPointer]],
      totalCoords[[storedStepPointer]] = coordMat;
      storedStepPointer++];
     , {n, tSteps}
     ];
    totalCoords
    ],
   RuntimeOptions -> "Speed",
   CompilationTarget -> "C"
   ];

Here is a timing example:

m = n = 100;
isingGrid = Developer`ToPackedArray[RandomChoice[{-1, 1}, {m, n}]];
up = RotateLeft[Range[n]];
down = RotateRight[Range[n]];
left = RotateLeft[Range[m]];
right = RotateRight[Range[m]];

t5 = AbsoluteTiming[
   isingProper5[isingGrid, 1000, 1, up, down, left, right];
   ][[1]]/1000

0.000020027

Compared to the original function, this is about 36 times as fast:

t3 = AbsoluteTiming[
    isingProper3[isingGrid, 1000, 1]
    ][[1]]/1000

t3/t5

0.00105735

36.2294

So far, this code is also not parallelized. You have to cast that into a Listable function in order to profit from Parallelization -> True.

I am also quite surprised about the slowness of Mod. Usually, we get advised to replace as much memory bound tasks by CPU-bound tasks on modern hardware. In this case, the opposite was more performant.

Disclaimer

I haven't checked the result for correctnes, yet.

Thinking about it even more, I wonder wether the Do-loop in the compiled function gets optimized away by the compiler. If I replace the Do by Table and make it the return variable, execution takes way longer and I doubt that this is solely due to memory allocation...

Moreover, the following isolates a single iteration from the Do-loop.

cg = Compile[{{a, _Integer, 2}, {up, _Integer, 1}, {down, _Integer, 
     1}, {left, _Integer, 1}, {right, _Integer, 1}},
   Table[
    Times[
     (- Compile`GetElement[a, i, j]),
     Plus[
      Compile`GetElement[a, Compile`GetElement[up, i], j],
      Compile`GetElement[a, Compile`GetElement[down, i], j],
      Compile`GetElement[a, i, Compile`GetElement[left, j]],
      Compile`GetElement[a, i, Compile`GetElement[right, j]
       ]
      ]
     ], {i, Dimensions[a][[1]]}, {j, Dimensions[a][[2]]}]
   ,
   RuntimeOptions -> "Speed", CompilationTarget -> "C"];

Its runtime is approximately twice as the average runtime of an iteration in isingProper5.

tcg = cg[isingGrid, up, down, left, right]; // RepeatedTiming // First

0.000038

I am not sure what the reason is. It cannot be calling overhead alone, because the factor 2 remains also for m = n = 1000; and m = n = 2000;.

Answered by Henrik Schumacher on December 10, 2020

The analysis below may be totally off base, but at least this post will give you some tools to do the analysis yourself.


Let's take a look at what actually happens. For this, we'll wrap gcc with our own script, which will save the source code and gcc arguments for further examination. Here's what we do from Mathematica:

$CCompiler = {"Name" -> "GCC",
 "Compiler" -> CCompilerDriver`GCCCompiler`GCCCompiler,
 "CompilerInstallation" -> "/tmp", "CompilerName" -> Automatic};

If you're not on Linux, you may want to change this appropriately. The idea is to take the first element of the list returned by CCompilers[] and change the "CompilerInstallation" option to point to the wrapper instead of the default path (which was "/usr/bin" in my case).

And here's the wrapper script, which we put to /tmp/gcc (don't forget to chmod +x it):

#!/bin/sh
echo "$*" > /tmp/mma-gcc.log
i=0
for opt in "$@"; do
    [ -f "$opt" ] && cp "$opt" /tmp/src-$i.c
    i=$((i+1))
done
exec gcc "$@"

After we run your Mathematica code of Compile[..., we get /tmp/src-9.c or similarly-named file, which is our source code. To further experiment with compilation options you can start from the arguments given to gcc, which are saved in /tmp/mma-gcc.log.

Now let's look inside the C source file we've recovered. The Do loop in your Mathematica code corresponds to the following C code (indented by me, // comments also mine):

lab8: // do loop
    I0_7 = I0_1;
    I0_8 = I0_3;
    I0_9 = I0_5;
    dims[0] = I0_7;
    dims[1] = I0_8;
    err = funStructCompile->MTensor_allocate(T0_2, 2, 2, dims);
    if( err)
    {
        goto error_label;
    }
    P3 = MTensor_getIntegerDataMacro(*T0_2);
    I0_10 = I0_5;
    goto lab25;
lab14: // loop over i
    I0_12 = I0_5;
    goto lab24;
lab16: // loop over j
    I0_13 = I0_10 + I0_11;
    I0_16 = I0_1;
    I0_17 = I0_13 + I0_11;
    {
        mint S0 = FP1((void*) (&I0_18), (void*) (&I0_17),
                      (void*) (&I0_16), 1, UnitIncrements, 0);/*  Quotient  */
        err = S0 == 0 ? 0 : LIBRARY_NUMERICAL_ERROR;
        if( err)
        {
            goto error_label;
        }
    }
    I0_17 = I0_16 * I0_18;
    I0_18 = -I0_17;
    I0_17 = I0_13 + I0_18;
    P3[I0_9++] = I0_17;
lab24:
    if( ++I0_12 <= I0_8)
    {
        goto lab16; // loop over j
    }
lab25:
    if( ++I0_10 <= I0_7)
    {
        goto lab14; // loop over i
    }
lab26:
    if( ++I0_6 <= I0_4)
    {
        goto lab8; // do loop
    }

The first thing which catches the eye is that the code is not vectorized in any obvious way. Moreover, it's not even auto-vectorizer-friendly, since goto construct is too flexible to reliably prove that it actually represents a loop (from the compiler POV), and to detect what type of loop it is. Another obstacle to auto-vectorization and many other optimizations is the call to Quotient function via pointer FP1, declared as

static BinaryMathFunctionPointer FP1;

So, it's not really surprising that this does not work faster than Mathematica normal code. And maybe Mathematica also does some vectorization of its normal code, which happens to be faster on your CPU.

Answered by Ruslan on December 10, 2020

This is only an observation, as I'm not so sure we can call Mod slow. The question is, how do we choose a base-line? Consider this large matrix

data = RandomInteger[1000, {2000, 2000}];

One of the simplest operations is probably adding an integer to each element. So let's time this

AbsoluteTiming[Do[data + 1, {20}]][[1]]/20
(* 0.00219195 *)

This seems to indicate that 2ms is an appropriate lower bound. How long does Mod take?

AbsoluteTiming[Do[Mod[data, 5, 1], {20}]][[1]]/20
(* 0.0117109 *)

So it takes about 5x the time of an integer addition. Is this slow-down justified? In my opinion, this is expected. An equivalent definition of Mod[m,n,1] in C looks like this

m - floor((m-1.0)/n)*n

We have several things to do. We need to cast to float/double and make an expensive division. I guess floor is pretty cheap. To simply the multiplication with n, we could cast back to an integer, but nevertheless, it seems obvious that there is much more work to do than in a simple addition.

I have tried several compiled versions of low-level mod implementations, including the above C code in a custom LibraryLink function. I was not able to match the plain speed of Mod. There is too much overhead. What I haven't tried is using a better parallelization than that of Compile. You could theoretically write a LibraryLink function that takes a matrix and calculates the mod in parallel. However, if this is not spoiled by the overhead of transferring data needs to be proven.

Answered by halirutan on December 10, 2020

I looked at your original code and tried it with the newer FunctionCompile function.

Here is your code:

modProper1 = 
 Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}}, 
  Module[{coordMat = initCoords, modMat, gridNRows, gridNCols}, 
   gridNRows = Length@coordMat;
   gridNCols = Length@coordMat[[1]];
   Do[modMat = 
      Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j, 
        gridNCols}];, {n, tSteps}];
   coordMat], 
  RuntimeOptions -> {"CatchMachineOverflow" -> False, 
    "CatchMachineIntegerOverflow" -> False, 
    "CompareWithTolerance" -> False, "EvaluateSymbolically" -> False},
   CompilationTarget -> "C", Parallelization -> True]

And its timing:

isingGrid = RandomChoice[{-1, 1}, {200, 200}];

{timing1, result1} = AbsoluteTiming[modProper1[isingGrid, 5000]];

On my machine timing gives values like 3.2443 (sometimes better, sometimes worse).

Here is my best attempt with FunctionCompile:

modProper2 = FunctionCompile[
  Function[{
    Typed[initCoords, 
     TypeSpecifier["PackedArray"]["MachineInteger", 2]],
    Typed[tSteps, "MachineInteger"]
    },
   Native`UncheckedBlock@
    Module[{coordMat = initCoords, modMat, gridNRows, gridNCols},
     gridNRows = Length@coordMat;
     gridNCols = Length@coordMat[[1]];
     Do[modMat = 
       Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j, 
         gridNCols}], {n, tSteps}];
     coordMat
     ]], "CompilerOptions" -> {"LLVMOptimization" -> 
     "ClangOptimization"[3], "AbortHandling" -> False}]

When I run this:

{timing2, result2} = AbsoluteTiming[modProper2[isingGrid, 5000]];

The result for timing2 gives values like 0.0750443 (so about 40x faster)

The results are the same:

In[.]:= result2 == result1

Out[.]= True

It uses two features that I believe are not documented (yet):

  1. Native`UncheckedBlock -- this turns off error checking
  2. CompilerOptions -- turning off the abort handlers also speeds things up (a lot)

Answered by Arnoud Buzing on December 10, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP