TL

Tony Lee

ML Inference Performance Engineering • GPU Serving Systems • CUDA/Triton Optimization

tungsheng@gmail.com linkedin.com/in/tonyslee8

Professional Summary

ML inference performance engineer optimizing GPU-backed LLM serving from CUDA/Triton kernels to vLLM runtime behavior and Kubernetes capacity controls, with recent work on KV-cache limits, latency/cost tradeoffs, autoscaling/admission control, H200 Tensor Core matmul tuning, RMSNorm bandwidth profiling, and decode-step replay.

Selected Projects

GPU Inference Decision Lab (AWS EKS + Karpenter + vLLM)

  • Built an AWS EKS/vLLM inference performance lab that turns KV-cache/context-length behavior, autoscaling, admission control, scheduler/quantization probes, useful-work cost, and failure-mitigation drills into supported, rejected, or pending architecture evidence.
  • Measured burst and 8192/300 long-context behavior: bounded queues preserved 100% delivery near 2s p95 for bursts, 1.20 req/s long-context traffic repeated 62.66-63.40s p95 latency, and FP8 KV was rejected after delivery fell to 47.58-69.12%.

CUDA/Triton GPU Kernel Lab (PyTorch baselines + A10G/H200)

  • Built a reproducible CUDA/Triton benchmarking lab for LLM-shaped GPU primitives, comparing custom Triton kernels against PyTorch/cuBLAS baselines with correctness checks, latency percentiles, bandwidth, TFLOP/s, roofline analysis, and Nsight evidence.
  • Separated supported wins from caveats: RMSNorm fp16 reached 5.901x with 90.91% DRAM throughput, while same-stream dynamic decode replay reached about 0.156 ms p50 / 0.230 ms p95 as a synthetic resident-KV upper bound.
  • Extended the H200 matmul track with standard and persistent-wave Triton schedules; the focused 512x11008x4096 bf16 standard row reached 471.4 TFLOP/s (89.41% of PyTorch/cuBLAS), while persistent waves remained below the standard schedule.

Experience

Staff Software Engineer

DTEX Systems

Sep 2024 - Present

  • Designed distributed platform services across UI, API, data, and infrastructure boundaries, with emphasis on reliability, deployment safety, and cross-service coordination.
  • Improved release stability by containerizing UI services, decoupling shared dependencies, and isolating high-risk deployment surfaces.
  • Led a launch-critical OpenSearch Dashboards migration, resolving deployment and environment issues with AWS/OpenSearch engineers.
  • Built local development and validation automation that cut dev/test iteration latency by 20% and improved onboarding.

Senior Technical Lead

Cisco Systems, Inc.

Aug 2019 - Apr 2024

  • Led architecture for the Onboarding Experience platform, helping enterprise customers complete onboarding in under 30 minutes.
  • Drove system design across frontend, backend, APIs, and integrations, resolving tradeoffs across US, Europe, and India engineering groups.
  • Led AngularJS-to-React modernization to remove security risk and improve maintainability; introduced Cypress end-to-end testing that cut test creation time by 50%.
  • Built CI/CD and test automation improvements that reduced regression risk in high-visibility customer onboarding flows.

Senior Software Engineer

Tico Co., Ltd.

Jan 2019 - Aug 2019

  • Improved message loading performance by 60% through frontend and API optimization and reduced codebase size by 30% through modular refactoring.

Co-founder / CTO

Popup Technology Co., Ltd.

Apr 2016 - Oct 2018

  • Built the core platform from scratch, led a 5-person engineering team, and shipped features that doubled revenue within 12 months.

Skills

ML Infrastructure / Inference

vLLM • LLM Serving • OpenAI-Compatible APIs • Autoscaling • Admission Control • KV Cache / Context Length • Quantization Evaluation • CUDA / Triton Benchmarking • Kernel Fusion • GPU Profiling • GPU Capacity Planning • GPU Scheduling (NVIDIA Device Plugin) • Inference Load Testing (k6)

Leadership

Architecture Design • Cross-Functional Leadership • Technical Mentorship

Infrastructure / Cloud

Terraform • Kubernetes (EKS) • AWS (VPC, IAM, EC2, ALB) • Karpenter • CI/CD Automation • Prometheus / Grafana • DCGM Exporter

Languages

Go • Python • TypeScript • SQL

Backend

Node.js • GraphQL • REST

Education

M. E. Electrical Engineering

University of California, San Diego

Sep 2006 - Jun 2008

B. S. Electrical Engineering

University of Illinois at Urbana-Champaign

Sep 2003 - Dec 2005