Matrix multiply on Adreno GPUs – Part 2: Host code and kernel