Would you like to make your CUDA code run faster on any GPU? Are you tired of having to manually search for the best performing sizes for your thread blocks? And having to search all over again on the next GPU? How do you know if that one optimization still improves performance on different GPUs? Stop worrying and start auto-tuning! We'll introduce some of the core concepts of GPU code optimization and explain how you can automatically optimize your GPU code using Python. Using a real-world example of a GPU application used in localization microscopy, we take a step-by-step approach to creating an optimized CUDA kernel and auto-tune it for the latest NVIDIA A100 GPU. Using Jupyter Notebooks, we can analyze the auto-tuning search space and gain a deeper understanding of the performance of our kernel and the combined effect of different optimizations.