Skip to content
View Pupking's full-sized avatar

Block or report Pupking

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Pupking/README.md

Sri Harshavardhan Reddy Deverapalli

GPU Kernel & Performance Engineer | CUDA · CUTLASS · Tensor Cores · HPC

I build and profile CUDA/CUTLASS kernels for tensor, sparse, and AI workloads. My MS thesis developed an H100 block-sparse tensor contraction pipeline with 3.01× mean / 6.06× max speedup over cuTENSOR 2.5.0 and ~95% of H100 FP64 Tensor Core peak.

Technical Focus

  • CUDA kernel profiling studies: GEMM, WMMA Tensor Core GEMM, reductions, softmax, and FlashAttention-lite
  • GPU performance analysis: Nsight Compute, Nsight Systems, roofline modeling, occupancy, memory bandwidth, and register-pressure tuning
  • Research: high-performance block-sparse tensor contractions for quantum many-body simulation on NVIDIA H100

LinkedIn · Google Scholar · Email

Pinned Loading

  1. cuda-kernel-profiling-studies cuda-kernel-profiling-studies Public

    Nsight-driven CUDA kernel profiling studies for GEMM, Tensor Core GEMM, reductions, softmax, and attention against vendor baselines.

  2. 01_tiled_gemm 01_tiled_gemm Public

    Profile-driven FP32 CUDA GEMM optimization: naive --> tiled --> coalesced --> register-blocked --> bank-padded, benchmarked against cuBLAS.

    Cuda

  3. 02_mixed_precision_gemm 02_mixed_precision_gemm Public

    WMMA FP16-->FP32 Tensor Core GEMM with shared-memory tiling and cp.async-style pipelining, benchmarked against cuBLAS.

    Cuda

  4. 04_fused_softmax 04_fused_softmax Public

    Row-wise CUDA softmax kernels: shared-memory reduction, warp-shuffle reduction, and online softmax benchmarked against cuDNN.

    Cuda

  5. 05_flash_attention_lite 05_flash_attention_lite Public

    Single-head CUDA attention kernel: naive SDPA --> fused softmax --> occupancy-tuned variants, benchmarked against cuDNN SDPA with Nsight Compute profiling.

    Cuda

  6. fft_final fft_final Public

    Sparse binary 2D FFT on CUDA/cuFFT with memory-footprint optimization, streaming tiles, Hermitian symmetry, and Nsight analysis.

    Cuda