What could work without much changes is to use the MKL native provider. The multiplication is then be performed in native code, which the profiler probably treats as a black box call. Using a native provider significantly changes the performance of the multiplication, but that may be ok if you've already shown it is not the bottleneck.
↧