cuda - Is any optimization done if one run the same kernel with the same input again and again? -


if run same kernel with same input several times, this

#define n 2000 for(int = 0; < 2000; i++) {     mykernel<<<1,120>>>(...); } 

what happens? timed , played around n: halving n (to 1000), halved time took.

yet i'm bit cautious belive runs kernel 2000 times because speed non-cuda code dramatic (~900 sec ~0.9 sec). kind of optimization cuda in case? caching results?

setting cuda_launch_blocking=1 didn't change nothing.

mykernel replaces inner loop in non-cuda code.

hardware geforce gtx 260

cuda doesn't optimization of kind, or caching of results. if launch 2000 kernels, runs 2000 kernels.

however, kernel launches asynchronous, measuring time taken launch 2000 kernel instances in loop isn't same total execution time of 2000 kernel instances. seeing artifact of incorrect time measurement , not true speed-up.


Comments

Popular posts from this blog

c# - how to write client side events functions for the combobox items -

exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -