Avoiding CopyTensor of large array in Compiled functions

Question

After reading about how to avoid the CopyTensor in various cases I still do not understand the optimal answer for the Fast-Hadamard transform below (CopyTensor appears in line 4 of the code):

 FHT = Compile[{{vec, _Complex, 1}}, 
  Module[{i, j, x, y, h, state, len, num},
   h = 1;
   state = vec;
   len = Length[state];
   num = (Sqrt[2]^Log[2, len]);
   While[h < len,
    For[i = 1, i <= len - 1, i = i + 2*h,
     For [j = i, j <= i + h - 1, j++,
       x = state[[j]];
       y = state[[j + h]];
       state[[j]] = x + y;
       state[[j + h]] = x - y;
       ];
     ];
    h *= 2;
    ];
   state/num
   ], RuntimeOptions -> {"CatchMachineUnderflow" -> False, 
    "CatchMachineOverflow" -> False, 
    "CatchMachineIntegerOverflow" -> False, 
    "CompareWithTolerance" -> False, "EvaluateSymbolically" -> False, 
    "RuntimeErrorHandler" -> False, "WarningMessages" -> False}, 
  CompilationOptions -> {"ExpressionOptimization" -> True, 
    "InlineExternalDefinitions" -> True}, "CompilationTarget" -> "C", 
  RuntimeAttributes -> {Listable}, Parallelization -> True
  ]

I know that mathematica has a pre-compiled function but it is much slower than my example. The only problem that I have is that it is not clear how to pass the array by reference. Is there an easy efficient answer to that? I am interested in transforming arrays of $2^{24}-2^{30}$ elements.

Since it was mentioned in the comments this is how the algorithm is compared against the build-in algorithm:

L = 24;
state = Normalize[Table[RandomReal[{-1, 1}], {2^L}]];
AbsoluteTiming[state2 = DiscreteHadamardTransform[state
    , Method -> "BitComplement"];]
AbsoluteTiming[state3 = FHT[state];]
Total[Abs[state2 - state3]]

We get

{22.2306, Null}
{1.42747, Null}
-1.75*10^-15 + 0. I

Optimal Solution

The current optimal solution to the problem is given by Henrik Schumacher. In my opinion a faster transform can be achieved only by a more efficient algorithm or a parallel one. For completeness I present Henrik’s code for complex argument:

Module[{name, file, lib}, name = "libFHT";
 file = Export[FileNameJoin[{$TemporaryDirectory, name <> ".cpp"}], "
        #include"WolframLibrary.h"
        #include <tgmath.h> 
        
        EXTERN_C DLLEXPORT int " <> name <> 
    "(WolframLibraryData libData, mint Argc, MArgument *Args, 
MArgument Res)
        {
            MTensor vec = MArgument_getMTensor(Args[0]);
            mcomplex* v = libData->MTensor_getComplexData(vec);
            mint len = libData->MTensor_getDimensions(vec)[0];
            mint h = 1;
            mreal num = pow(sqrt(2.), -log2((mreal) len) );
            mcomplex x,y;
          
    
            while(h<len)
            {
                for( mint i = 0; i < len-1; i = i + 2*h)
                {
                    for( mint j = i; j < i+h; j++)
                    {
                        x = v[j];
                        y = v[j+h];               
                        v[j] = {x.ri[0]+y.ri[0],x.ri[1]+y.ri[1]};
                        v[j+h] = {x.ri[0]-y.ri[0],x.ri[1]-y.ri[1]};
                    }
                }
                h = h*2;
            }
            
           for( mint k = 0; k<len; k++)
            {
                v[k] = {v[k].ri[0]*num,v[k].ri[1]*num};
            }
          
            
            return LIBRARY_NO_ERROR;
        }", "Text"];

performance tuning

Henrik Schumacher · Answer

This is a slightly faster rewrite of OP's CompiledFunction. It exploits faster read access through Compile`GetElement. It is about twice as fast as OP's original function (which took about 1.51672 seconds on my machine). But this speed-up is mostly due to changing the argument pattern from {{vec, Complex, 1}} to {{vec, Real, 1}} (because the former enforces the use of slower complex double arithmetic).

FHT = Compile[{{vec, _Real, 1}},
   Module[{i, j, x, y, h, state, len, num},
    h = 1;
    state = vec;
    len = Length[state];
    num = (Sqrt[2.]^Log[2, len]);
    While[h < len,
     For[i = 1, i <= len - 1, i += 2*h,
      For[j = i, j <= i + h - 1, j++,
        x = Compile`GetElement[state, j];
        y = Compile`GetElement[state, j + h];
        state[[j]] = x + y;
        state[[j + h]] = x - y;
        ];
      ];
     h *= 2;
     ];
    state/num
    ],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   RuntimeOptions -> "Speed"
   ];

In contrast to CompiledFunctions, LibraryFunctions can use shared memory. This is one way to do it:

Needs["CCompilerDriver`"];
Module[{name, file, lib},
 name = "libFHT";
 file = Export[FileNameJoin[{$TemporaryDirectory, name <> ".cpp"}],
   "
    #include"WolframLibrary.h"
    #include <tgmath.h> 
    
    EXTERN_C DLLEXPORT int " <> name <> 
    "(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res)
    {
        MTensor vec = MArgument_getMTensor(Args[0]);
        mreal* v = libData->MTensor_getRealData(vec);
        mint len = libData->MTensor_getDimensions(vec)[0];
        mint h = 1;
        mreal num = pow(sqrt(2.), -log2((mreal) len) );
        mreal x, y;
    
        while(h<len)
        {
            for( mint i = 0; i < len-1; i = i + 2*h)
            {
                for( mint j = i; j < i+h; j++)
                {
                    x = v[j];
                    y = v[j+h];
                    v[j] = x+y;
                    v[j+h] = x-y;
                }
            }
            h = h*2;
        }
        
        for( mint k = 0; k<len; k++)
        {
            v[k] *= num;
        }
        
        return LIBRARY_NO_ERROR;
    }"
   ,
   "Text"
   ];
 lib = CreateLibrary[{file}, name,
   "TargetDirectory" -> $TemporaryDirectory
   (*,"ShellCommandFunction"[Rule]Print
   ,"ShellOutputFunction"[Rule]Print*)
   ];
 
 Quiet[LibraryFunctionUnload[cf]];
 cf = LibraryFunctionLoad[lib, name, {{Real, 1, "Shared"}}, {"Void"}]
 ]

Here a comparison:

L = 24;
state = Normalize[RandomReal[{-1, 1}, {2^L}]];
state3 = FHT[state]; // AbsoluteTiming // First
cf[state]; // AbsoluteTiming // First
Max[Abs[state3 - state]]

0.722481

0.322641

2.1684*10^-19

So one can reduce the computation time by roughly 50% by using a library function in this case. Not that much in view of the additional programming effort but still something.

Crucial here is the line

mreal* v = libData->MTensor_getRealData(vec);

which provides one with the pointer to the array underlying the MTensor vec and the argument pattern

{{Real, 1, "Shared"}}

int the call to LibraryFunctionLoad.

Avoiding CopyTensor of large array in Compiled functions

One Answer

Add your own answers!

Ask a Question