Mentored by: Next Silicon
AI-powered optimization engine for next-generation computing

An advanced AI system for optimizing computational workflows and resource allocation. Uses machine learning to predict optimal configurations, reduce latency, and improve efficiency. Features include predictive analytics, automated tuning, and adaptive learning for continuous improvement.
Cohort: Embedded Systems Bootcamp 2025 (Embedded)
Responsibilities:
Implemented SYCL kernels, connected them to the GGML operator system, and integrated them into the llama.cpp graph builder, including validation on real GPU hardware.
Performed kernel-level GPU optimizations, executed controlled Before/After benchmarks, and achieved significant speedups, including a ×4 performance improvement for ARANGE.
Handled maintainer reviews, refined code design, updated documentation, ensured CI stability, and successfully merged both operators into ggml / llama.cpp.
Developed a mechanism that reduces Attention complexity from O(n²) to O(n·k) by dynamically selecting the Top-K most relevant tokens per query.
Implemented buildsparsekmask, generating the mask using existing GGML operators only, without modifying model weights or model architecture.
Adjusted the design so Sparse-K is computed inside Flash Attention during graph build, following reviewer guidelines for clean, backend-consistent integration.
Ensured the Sparse-K mask is automatically applied in every Attention layer, with no additional per-layer code required.
Ran full Prompt Evaluation and Decode benchmarks, performed profiling and comparisons, and validated a 2.3× speedup with no accuracy degradation.
Since Sparse-K relies exclusively on existing GGML operators, any supported backend (such as CUDA) can execute it naturally. Performed HPC runs to confirm correctness and efficiency.
Downloaded models, performed GGUF conversion and quantization, and added Sparse-K metadata fields so all Sparse-K settings are loaded directly from the model, without environment variables.
...and more contributions not listed here
Responsibilities:
Learning and Deepening Knowledge in Machine Learning and Deep Learning In the first phase, I thoroughly studied the fundamentals of ML and DL, including modern architectures, language models, and GPU-based data flows. This provided a strong theoretical and practical foundation for understanding computational graphs in depth and for developing custom, production-grade GPU operators.
Developing and Upstreaming Two GPU Operators (COUNTEQUAL and MEAN) in SYCL for llama.cpp I developed two custom GPU operators — COUNTEQUAL and MEAN — implemented in SYCL for efficient parallel execution. As part of the process, I deep-dived into the GGML operator system and llama.cpp graph builder, analyzed the GPU execution flow and memory model, and integrated the new kernels into the existing runtime with full hardware validation. Both operators were benchmarked and optimized, achieving approximately a 3× performance improvement over the previous implementation. I completed the full upstream contribution process — including code reviews, CI stability fixes, documentation updates, and design adjustments — resulting in successful merging of the operators into the official llama.cpp project.
Sparse-K Attention: End-to-End Integration into the GGML / llama.cpp Execution Flow Integrated the Sparse-K attention mechanism directly into the model’s execution graph, enabling runtime selection of the most relevant attention regions. This transformation reduced the effective attention complexity from quadratic to linear-in-K, resulting in significantly faster execution on long sequences.
Graph-Level Construction of the Sparse-K Mask Using Existing GGML Primitives Designed and implemented a dynamic mask-generation pipeline that builds the Sparse-K mask entirely from existing GGML operators. This approach preserved full backend compatibility and kept the model architecture unchanged.
Embedding Sparse-K into the Flash Attention Path Refined the design based on maintainer guidance and integrated Sparse-K directly into the Flash Attention computation path during graph build. This ensured a clean integration aligned with GGML’s internal architecture and backend conventions.
Seamless Application Across All Attention Layers Extended the Sparse-K mechanism so that the generated mask is automatically applied to every attention layer, without needing per-layer adjustments or custom logic.
Comprehensive Benchmarking and Model Validation Ran full performance analysis — including decode throughput, prompt evaluation, and profiling — and confirmed a 2.3× performance improvement with no degradation in model accuracy.
Backend Execution Validation Verified correct Sparse-K execution on multiple backends (e.g., CUDA) by leveraging GGML’s existing operators, ensuring efficient HPC performance without any backend-specific modifications.
Converting Hugging Face Models to GGUF and Embedding Sparse-K Parameters I downloaded pretrained models from Hugging Face and converted them into the GGUF format, including proper quantization and model structure preservation. During this process, I embedded Sparse-K configuration parameters directly into the GGUF metadata, allowing the model to automatically load all Sparse-K settings without relying on environment variables. This integration ensured seamless runtime behavior and full compatibility across different backends, enabling efficient HPC execution with Sparse-K fully applied.
...and more contributions not listed here
Responsibilities:
Explore core Machine Learning and Deep Learning principles, with a strong emphasis on Large Language Models (LLMs).
Gain proficiency in SYCL, focusing on writing scalable, efficient parallel code that maximizes GPU utilization.
Developed full CPU and SYCL GPU support for new numerical operators (FLOOR, CEIL, ROUND, TRUNC): implemented kernels, wired them into the GGML operator system, and integrated them into the llama.cpp graph builder, with validation on both CPU and GPU hardware. PR-LINK:Implement operators in CPU backend PR-LINK:Implement operators in SYCL backend
Optimized these operators at the kernel level and ran controlled before/after benchmarks, achieving significant speedups in their execution.
Opened and maintained pull requests to upstream ggml/llama.cpp, addressing reviewer feedback, refining code design, updating documentation and tests, ensuring CI stability, and successfully merging the new operators into the main codebase.
Study the Attention mechanism as described in “Attention Is All You Need” (2017).
Study FlashAttention from the paper “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (2022).
Implement a SYCL-based FlashAttention backend in llama.cpp, handling integration challenges and design trade-offs to deliver high-performance attention on GPU. The implementation includes efficient matrix multiplication with matrix transposition for better memory access patterns, explicit staging of data in on-chip SRAM, and custom implementations of Softmax and value normalization within the kernel. Profile, benchmark, and refine the FlashAttention path, iteratively tuning it for higher throughput, lower latency, and improved overall efficiency. PR-LINK: Implement FLASH_ATTN for SYCL beckend
...and more contributions not listed here
Responsibilities:
Download an LLM from Hugging Face, convert it into a llama.cpp-compatible format, and run it locally using llama.cpp.
Study foundational Machine Learning and Deep Learning concepts, with an in-depth focus on Large Language Models (LLMs).
Learn SYCL, emphasizing best practices for writing highly parallel and efficient code that fully exploits GPU capabilities.
Implement the CONCAT operator for the SYCL backend in llama.cpp, including effective workload distribution across GPU hardware. PR-LINK: Implement CONCAT for SYCL beckend
Implement the PADREFLECT1D operator for the SYCL backend in llama.cpp, including effective workload distribution across GPU hardware. PR-LINK: Implement PADREFLECT1D for SYCL beckend
Study the Attention mechanism as described in “Attention Is All You Need” (2017).
Study FlashAttention from the paper “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (2022).
Implement Flash Attention in llama.cpp for the SYCL backend, addressing development challenges and managing trade-offs intelligently to achieve maximum performance. PR-LINK: Implement FLASH_ATTN for SYCL beckend
Measure, benchmark, and optimize the Flash Attention implementation, iterating toward improved throughput and efficiency
...and more contributions not listed here