Sarah S.

Bio

Software developer with hands-on experience in system-level AI deployment, GPU optimization, and embedded systems. Strong analytical thinker and fast self-learner with proven ability to profile and optimize high-performance infrastructure. Skilled in backend, real-time systems, and AI technologies.

Skills

Python

Kubernetes

Redis

CUDA

PyTorch

vllm

llm-d

lmcache

kvrocks

kubernetes

Bootcamp Project

Hyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference

Datacenter-scale LLM inference framework that offloads KV-cache to a hyperconverged KV-store, increasing capacity and robustness while keeping high hit-rates and good user experience.

Mentored by: Pliops

Data Science Bootcamp 2025 (Data)

Responsibilities:

• Contributed to an R&D team developing data-center-scale infrastructure for high-performance LLM inference.
- Deep Exploration of vLLM & LLM-D Architecture Led a comprehensive exploration of vLLM’s internal mechanisms and execution model, alongside an architectural breakdown of llm-d. This included mapping request lifecycles, understanding component interactions, and characterizing the full behavior of the inference pipeline from input to output.
- KV-Cache Optimization in a Hyper-Converged Environment Engineered a hyper-converged setup for KV-Cache offloading, leveraging LLM-D and LMCache to streamline cache handling and KVrocks for persistent key-value storage. Implemented memory- and system-level optimizations that improved cache locality and boosted inference stability and throughput
- GPU Performance Profiling & Multi-Pipeline Benchmarking Executed rigorous profiling sessions on advanced GPU clusters, evaluating three different inference pipelines. Identified performance constraints and delivered up to 2× gains in TPS and RPS per user when comparing LMCache-backed (DRAM-cached) runs against baseline vLLM execution paths.
- Kubernetes-Based Deployments (Single-Node & Multi-Node) Implemented both single-node and multi-node deployments of LLM-D and LMCache on Kubernetes clusters, enabling reproducible performance testing, streamlined rollout strategies, and consistent behavior across environments
- Hyper-Converged Routing Simplification & System Streamlining Redesigned routing behavior by removing storage-aware routing dependencies, enabling a truly hyper-converged inference architecture. This optimization significantly reduced network overhead, simplified data paths, and improved the system’s operational efficiency.

Click to enlarge

Additional Projects

2025|Interactive Music & Language Learning Platform |singlezone

• Developed a .NET Core REST API implementing core logic.

• Integrated AWS S3 for storage and Whisper API for chat, transcription, and translation features

• Implemented JWT authentication and built data models using EF Core + MySQL.

https://singlezonereact.onrender.com/

2025 | Real-Time Embedded System | Sensor Reading and Response System

• Developed a real-time embedded system using temperature and distance sensors.

• Implemented control logic for a warning LED using GPIO.

• Designed and managed RTOS tasks based on priority.

• Developed in C using STM32CubeIDE.

English Level

Native