Mathematica Asked by b3m2a1 on December 10, 2020
I’ve been playing with a little compiled Ising model implementation. It’s brutally slow. Decomposing things I’ve found that the primary culprit appears to be Mod
. Taking this trivial function:
modProper =
Compile[
{
{initCoords, _Integer, 2},
{tSteps, _Integer}
},
Module[
{
coordMat = initCoords,
modMat,
gridNRows,
gridNCols
},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[
modMat =
Table[
Mod[i - 1, gridNRows, 1],
{i, gridNRows},
{j, gridNCols}
];,
{n, tSteps}
];
coordMat
],
RuntimeOptions ->
{
"CatchMachineOverflow" -> False,
"CatchMachineIntegerOverflow" -> False,
"CompareWithTolerance" -> False,
"EvaluateSymbolically" -> False
},
CompilationTarget -> "C",
Parallelization -> True
];
Taking a 100×100 grid it takes a full .0002 seconds per iteration:
isingGrid = RandomChoice[{-1, 1}, {100, 100}];
AbsoluteTiming[modProper[isingGrid, 5000]][[1]]/5000
0.000206102
This seems fast until you realize that you’re only doing 10000 Mod
calculations. It’s even a little bit slower than the uncompiled, vectorized version:
With[{range = Range[10000] - 10},
AbsoluteTiming[Mod[range, 5000, 1]][[1]]
]
0.000163
Is there a reason it should be this slow?
Towards your original code (this does not answer the question why Mod
is that slow; it just tries to get rid of it), here is my version of it, two generations later. We already identified Mod
to be the bottleneck, so let's get rid of it, by precomputing the indices and by submitting them to the compiled function as additional arguments:
isingProper5 = Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}, {storeTraj, _Integer},
{up, _Integer, 1}, {down, _Integer, 1},
{left, _Integer, 1}, {right, _Integer, 1}
},
Module[{totalCoords, realTraj, storedSteps, storedStepPointer = 1,
coordMat = initCoords,
peMat, gridNRows, gridNCols, giL, giM, gjL, gjM, c},
(*number of steps to be used in the trajectory*)
realTraj = Min@{Max@{storeTraj, 1}, tSteps};
(*steps to be cached*)
storedSteps =
Table[Ceiling[tSteps*i/realTraj], {i, Min@{Max@{realTraj, 1}, tSteps}}];
(*preallocated trajectory block*)
totalCoords = Table[initCoords, {i, realTraj}];
gridNRows = Length@initCoords;
gridNCols = Dimensions[initCoords][[2]];
Do[
peMat =(*calculate PE on the grid from nearest neighbors*)
Table[
giL = Compile`GetElement[up, i];
giM = Compile`GetElement[down, i];
Table[
gjL = Compile`GetElement[left, j];
gjM = Compile`GetElement[right, j];
Times[
- Compile`GetElement[coordMat, i, j],
Plus[
Compile`GetElement[coordMat, giL, j],
Compile`GetElement[coordMat, giM, j],
Compile`GetElement[coordMat, i, gjL],
Compile`GetElement[coordMat, i, gjM]
]
], {j, gridNCols}]
, {i, gridNRows}
];
If[n == storedSteps[[storedStepPointer]],
totalCoords[[storedStepPointer]] = coordMat;
storedStepPointer++];
, {n, tSteps}
];
totalCoords
],
RuntimeOptions -> "Speed",
CompilationTarget -> "C"
];
Here is a timing example:
m = n = 100;
isingGrid = Developer`ToPackedArray[RandomChoice[{-1, 1}, {m, n}]];
up = RotateLeft[Range[n]];
down = RotateRight[Range[n]];
left = RotateLeft[Range[m]];
right = RotateRight[Range[m]];
t5 = AbsoluteTiming[
isingProper5[isingGrid, 1000, 1, up, down, left, right];
][[1]]/1000
0.000020027
Compared to the original function, this is about 36 times as fast:
t3 = AbsoluteTiming[
isingProper3[isingGrid, 1000, 1]
][[1]]/1000
t3/t5
0.00105735
36.2294
So far, this code is also not parallelized. You have to cast that into a Listable
function in order to profit from Parallelization -> True
.
I am also quite surprised about the slowness of Mod
. Usually, we get advised to replace as much memory bound tasks by CPU-bound tasks on modern hardware. In this case, the opposite was more performant.
Disclaimer
I haven't checked the result for correctnes, yet.
Thinking about it even more, I wonder wether the Do
-loop in the compiled function gets optimized away by the compiler. If I replace the Do
by Table
and make it the return variable, execution takes way longer and I doubt that this is solely due to memory allocation...
Moreover, the following isolates a single iteration from the Do
-loop.
cg = Compile[{{a, _Integer, 2}, {up, _Integer, 1}, {down, _Integer,
1}, {left, _Integer, 1}, {right, _Integer, 1}},
Table[
Times[
(- Compile`GetElement[a, i, j]),
Plus[
Compile`GetElement[a, Compile`GetElement[up, i], j],
Compile`GetElement[a, Compile`GetElement[down, i], j],
Compile`GetElement[a, i, Compile`GetElement[left, j]],
Compile`GetElement[a, i, Compile`GetElement[right, j]
]
]
], {i, Dimensions[a][[1]]}, {j, Dimensions[a][[2]]}]
,
RuntimeOptions -> "Speed", CompilationTarget -> "C"];
Its runtime is approximately twice as the average runtime of an iteration in isingProper5
.
tcg = cg[isingGrid, up, down, left, right]; // RepeatedTiming // First
0.000038
I am not sure what the reason is. It cannot be calling overhead alone, because the factor 2 remains also for m = n = 1000;
and m = n = 2000;
.
Answered by Henrik Schumacher on December 10, 2020
The analysis below may be totally off base, but at least this post will give you some tools to do the analysis yourself.
Let's take a look at what actually happens. For this, we'll wrap gcc
with our own script, which will save the source code and gcc
arguments for further examination. Here's what we do from Mathematica:
$CCompiler = {"Name" -> "GCC",
"Compiler" -> CCompilerDriver`GCCCompiler`GCCCompiler,
"CompilerInstallation" -> "/tmp", "CompilerName" -> Automatic};
If you're not on Linux, you may want to change this appropriately. The idea is to take the first element of the list returned by CCompilers[]
and change the "CompilerInstallation"
option to point to the wrapper instead of the default path (which was "/usr/bin"
in my case).
And here's the wrapper script, which we put to /tmp/gcc
(don't forget to chmod +x
it):
#!/bin/sh
echo "$*" > /tmp/mma-gcc.log
i=0
for opt in "$@"; do
[ -f "$opt" ] && cp "$opt" /tmp/src-$i.c
i=$((i+1))
done
exec gcc "$@"
After we run your Mathematica code of Compile[...
, we get /tmp/src-9.c
or similarly-named file, which is our source code. To further experiment with compilation options you can start from the arguments given to gcc
, which are saved in /tmp/mma-gcc.log
.
Now let's look inside the C source file we've recovered. The Do
loop in your Mathematica code corresponds to the following C code (indented by me, //
comments also mine):
lab8: // do loop
I0_7 = I0_1;
I0_8 = I0_3;
I0_9 = I0_5;
dims[0] = I0_7;
dims[1] = I0_8;
err = funStructCompile->MTensor_allocate(T0_2, 2, 2, dims);
if( err)
{
goto error_label;
}
P3 = MTensor_getIntegerDataMacro(*T0_2);
I0_10 = I0_5;
goto lab25;
lab14: // loop over i
I0_12 = I0_5;
goto lab24;
lab16: // loop over j
I0_13 = I0_10 + I0_11;
I0_16 = I0_1;
I0_17 = I0_13 + I0_11;
{
mint S0 = FP1((void*) (&I0_18), (void*) (&I0_17),
(void*) (&I0_16), 1, UnitIncrements, 0);/* Quotient */
err = S0 == 0 ? 0 : LIBRARY_NUMERICAL_ERROR;
if( err)
{
goto error_label;
}
}
I0_17 = I0_16 * I0_18;
I0_18 = -I0_17;
I0_17 = I0_13 + I0_18;
P3[I0_9++] = I0_17;
lab24:
if( ++I0_12 <= I0_8)
{
goto lab16; // loop over j
}
lab25:
if( ++I0_10 <= I0_7)
{
goto lab14; // loop over i
}
lab26:
if( ++I0_6 <= I0_4)
{
goto lab8; // do loop
}
The first thing which catches the eye is that the code is not vectorized in any obvious way. Moreover, it's not even auto-vectorizer-friendly, since goto
construct is too flexible to reliably prove that it actually represents a loop (from the compiler POV), and to detect what type of loop it is. Another obstacle to auto-vectorization and many other optimizations is the call to Quotient
function via pointer FP1
, declared as
static BinaryMathFunctionPointer FP1;
So, it's not really surprising that this does not work faster than Mathematica normal code. And maybe Mathematica also does some vectorization of its normal code, which happens to be faster on your CPU.
Answered by Ruslan on December 10, 2020
This is only an observation, as I'm not so sure we can call Mod
slow. The question is, how do we choose a base-line? Consider this large matrix
data = RandomInteger[1000, {2000, 2000}];
One of the simplest operations is probably adding an integer to each element. So let's time this
AbsoluteTiming[Do[data + 1, {20}]][[1]]/20
(* 0.00219195 *)
This seems to indicate that 2ms is an appropriate lower bound. How long does Mod
take?
AbsoluteTiming[Do[Mod[data, 5, 1], {20}]][[1]]/20
(* 0.0117109 *)
So it takes about 5x the time of an integer addition. Is this slow-down justified? In my opinion, this is expected. An equivalent definition of Mod[m,n,1]
in C looks like this
m - floor((m-1.0)/n)*n
We have several things to do. We need to cast to float/double and make an expensive division. I guess floor
is pretty cheap. To simply the multiplication with n
, we could cast back to an integer, but nevertheless, it seems obvious that there is much more work to do than in a simple addition.
I have tried several compiled versions of low-level mod implementations, including the above C code in a custom LibraryLink function. I was not able to match the plain speed of Mod
. There is too much overhead. What I haven't tried is using a better parallelization than that of Compile
. You could theoretically write a LibraryLink function that takes a matrix and calculates the mod in parallel. However, if this is not spoiled by the overhead of transferring data needs to be proven.
Answered by halirutan on December 10, 2020
I looked at your original code and tried it with the newer FunctionCompile function.
Here is your code:
modProper1 =
Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}},
Module[{coordMat = initCoords, modMat, gridNRows, gridNCols},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[modMat =
Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j,
gridNCols}];, {n, tSteps}];
coordMat],
RuntimeOptions -> {"CatchMachineOverflow" -> False,
"CatchMachineIntegerOverflow" -> False,
"CompareWithTolerance" -> False, "EvaluateSymbolically" -> False},
CompilationTarget -> "C", Parallelization -> True]
And its timing:
isingGrid = RandomChoice[{-1, 1}, {200, 200}];
{timing1, result1} = AbsoluteTiming[modProper1[isingGrid, 5000]];
On my machine timing
gives values like 3.2443 (sometimes better, sometimes worse).
Here is my best attempt with FunctionCompile:
modProper2 = FunctionCompile[
Function[{
Typed[initCoords,
TypeSpecifier["PackedArray"]["MachineInteger", 2]],
Typed[tSteps, "MachineInteger"]
},
Native`UncheckedBlock@
Module[{coordMat = initCoords, modMat, gridNRows, gridNCols},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[modMat =
Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j,
gridNCols}], {n, tSteps}];
coordMat
]], "CompilerOptions" -> {"LLVMOptimization" ->
"ClangOptimization"[3], "AbortHandling" -> False}]
When I run this:
{timing2, result2} = AbsoluteTiming[modProper2[isingGrid, 5000]];
The result for timing2
gives values like 0.0750443 (so about 40x faster)
The results are the same:
In[.]:= result2 == result1
Out[.]= True
It uses two features that I believe are not documented (yet):
Native`UncheckedBlock
-- this turns off error checkingCompilerOptions
-- turning off the abort handlers also speeds things up (a lot)Answered by Arnoud Buzing on December 10, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP