Mathematica Asked on December 28, 2021
After reading about how to avoid the CopyTensor in various cases I still do not understand the optimal answer for the Fast-Hadamard transform below (CopyTensor appears in line 4 of the code):
FHT = Compile[{{vec, _Complex, 1}},
Module[{i, j, x, y, h, state, len, num},
h = 1;
state = vec;
len = Length[state];
num = (Sqrt[2]^Log[2, len]);
While[h < len,
For[i = 1, i <= len - 1, i = i + 2*h,
For [j = i, j <= i + h - 1, j++,
x = state[[j]];
y = state[[j + h]];
state[[j]] = x + y;
state[[j + h]] = x - y;
];
];
h *= 2;
];
state/num
], RuntimeOptions -> {"CatchMachineUnderflow" -> False,
"CatchMachineOverflow" -> False,
"CatchMachineIntegerOverflow" -> False,
"CompareWithTolerance" -> False, "EvaluateSymbolically" -> False,
"RuntimeErrorHandler" -> False, "WarningMessages" -> False},
CompilationOptions -> {"ExpressionOptimization" -> True,
"InlineExternalDefinitions" -> True}, "CompilationTarget" -> "C",
RuntimeAttributes -> {Listable}, Parallelization -> True
]
I know that mathematica has a pre-compiled function but it is much slower than my example. The only problem that I have is that it is not clear how to pass the array by reference. Is there an easy efficient answer to that? I am interested in transforming arrays of $2^{24}-2^{30}$ elements.
Since it was mentioned in the comments this is how the algorithm is compared against the build-in algorithm:
L = 24;
state = Normalize[Table[RandomReal[{-1, 1}], {2^L}]];
AbsoluteTiming[state2 = DiscreteHadamardTransform[state
, Method -> "BitComplement"];]
AbsoluteTiming[state3 = FHT[state];]
Total[Abs[state2 - state3]]
We get
{22.2306, Null}
{1.42747, Null}
-1.75*10^-15 + 0. I
Optimal Solution
The current optimal solution to the problem is given by Henrik Schumacher. In my opinion a faster transform can be achieved only by a more efficient algorithm or a parallel one. For completeness I present Henrik’s code for complex argument:
Module[{name, file, lib}, name = "libFHT";
file = Export[FileNameJoin[{$TemporaryDirectory, name <> ".cpp"}], "
#include"WolframLibrary.h"
#include <tgmath.h>
EXTERN_C DLLEXPORT int " <> name <>
"(WolframLibraryData libData, mint Argc, MArgument *Args,
MArgument Res)
{
MTensor vec = MArgument_getMTensor(Args[0]);
mcomplex* v = libData->MTensor_getComplexData(vec);
mint len = libData->MTensor_getDimensions(vec)[0];
mint h = 1;
mreal num = pow(sqrt(2.), -log2((mreal) len) );
mcomplex x,y;
while(h<len)
{
for( mint i = 0; i < len-1; i = i + 2*h)
{
for( mint j = i; j < i+h; j++)
{
x = v[j];
y = v[j+h];
v[j] = {x.ri[0]+y.ri[0],x.ri[1]+y.ri[1]};
v[j+h] = {x.ri[0]-y.ri[0],x.ri[1]-y.ri[1]};
}
}
h = h*2;
}
for( mint k = 0; k<len; k++)
{
v[k] = {v[k].ri[0]*num,v[k].ri[1]*num};
}
return LIBRARY_NO_ERROR;
}", "Text"];
This is a slightly faster rewrite of OP's CompiledFunction
. It exploits faster read access through Compile`GetElement
. It is about twice as fast as OP's original function (which took about 1.51672
seconds on my machine). But this speed-up is mostly due to changing the argument pattern from {{vec, Complex, 1}}
to {{vec, Real, 1}}
(because the former enforces the use of slower complex double arithmetic).
FHT = Compile[{{vec, _Real, 1}},
Module[{i, j, x, y, h, state, len, num},
h = 1;
state = vec;
len = Length[state];
num = (Sqrt[2.]^Log[2, len]);
While[h < len,
For[i = 1, i <= len - 1, i += 2*h,
For[j = i, j <= i + h - 1, j++,
x = Compile`GetElement[state, j];
y = Compile`GetElement[state, j + h];
state[[j]] = x + y;
state[[j + h]] = x - y;
];
];
h *= 2;
];
state/num
],
CompilationTarget -> "C",
RuntimeAttributes -> {Listable},
Parallelization -> True,
RuntimeOptions -> "Speed"
];
In contrast to CompiledFunction
s, LibraryFunction
s can use shared memory. This is one way to do it:
Needs["CCompilerDriver`"];
Module[{name, file, lib},
name = "libFHT";
file = Export[FileNameJoin[{$TemporaryDirectory, name <> ".cpp"}],
"
#include"WolframLibrary.h"
#include <tgmath.h>
EXTERN_C DLLEXPORT int " <> name <>
"(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res)
{
MTensor vec = MArgument_getMTensor(Args[0]);
mreal* v = libData->MTensor_getRealData(vec);
mint len = libData->MTensor_getDimensions(vec)[0];
mint h = 1;
mreal num = pow(sqrt(2.), -log2((mreal) len) );
mreal x, y;
while(h<len)
{
for( mint i = 0; i < len-1; i = i + 2*h)
{
for( mint j = i; j < i+h; j++)
{
x = v[j];
y = v[j+h];
v[j] = x+y;
v[j+h] = x-y;
}
}
h = h*2;
}
for( mint k = 0; k<len; k++)
{
v[k] *= num;
}
return LIBRARY_NO_ERROR;
}"
,
"Text"
];
lib = CreateLibrary[{file}, name,
"TargetDirectory" -> $TemporaryDirectory
(*,"ShellCommandFunction"[Rule]Print
,"ShellOutputFunction"[Rule]Print*)
];
Quiet[LibraryFunctionUnload[cf]];
cf = LibraryFunctionLoad[lib, name, {{Real, 1, "Shared"}}, {"Void"}]
]
Here a comparison:
L = 24;
state = Normalize[RandomReal[{-1, 1}, {2^L}]];
state3 = FHT[state]; // AbsoluteTiming // First
cf[state]; // AbsoluteTiming // First
Max[Abs[state3 - state]]
0.722481
0.322641
2.1684*10^-19
So one can reduce the computation time by roughly 50% by using a library function in this case. Not that much in view of the additional programming effort but still something.
Crucial here is the line
mreal* v = libData->MTensor_getRealData(vec);
which provides one with the pointer to the array underlying the MTensor vec
and the argument pattern
{{Real, 1, "Shared"}}
int the call to LibraryFunctionLoad
.
Answered by Henrik Schumacher on December 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP