Yael S.

Bio

Software & Embedded Developer specializing in High-Performance Computing (HPC), GPU acceleration (SYCL), and open-source development. Experienced in C++ and Python, parallel programming, and performance optimization for large-scale models. Strong analytical thinker with fast learning skills, solid system-level understanding, and a proven ability to deliver efficient, high-quality solutions.

Skills

Python

C++

CI/CD

CUDA

PyTorch

SYCL

Bootcamp Project

NextOptAI

AI-powered optimization engine for next-generation computing

Mentored by: Next Silicon

Embedded Systems Bootcamp 2025 (Embedded)

Responsibilities:

Learning and Deepening Knowledge in Machine Learning and Deep Learning In the first phase, I thoroughly studied the fundamentals of ML and DL, including modern architectures, language models, and GPU-based data flows. This provided a strong theoretical and practical foundation for understanding computational graphs in depth and for developing custom, production-grade GPU operators.
Developing and Upstreaming Two GPU Operators (COUNTEQUAL and MEAN) in SYCL for llama.cpp I developed two custom GPU operators — COUNTEQUAL and MEAN — implemented in SYCL for efficient parallel execution. As part of the process, I deep-dived into the GGML operator system and llama.cpp graph builder, analyzed the GPU execution flow and memory model, and integrated the new kernels into the existing runtime with full hardware validation. Both operators were benchmarked and optimized, achieving approximately a 3× performance improvement over the previous implementation. I completed the full upstream contribution process — including code reviews, CI stability fixes, documentation updates, and design adjustments — resulting in successful merging of the operators into the official llama.cpp project.
Sparse-K Attention: End-to-End Integration into the GGML / llama.cpp Execution Flow Integrated the Sparse-K attention mechanism directly into the model’s execution graph, enabling runtime selection of the most relevant attention regions. This transformation reduced the effective attention complexity from quadratic to linear-in-K, resulting in significantly faster execution on long sequences.
Graph-Level Construction of the Sparse-K Mask Using Existing GGML Primitives Designed and implemented a dynamic mask-generation pipeline that builds the Sparse-K mask entirely from existing GGML operators. This approach preserved full backend compatibility and kept the model architecture unchanged.
Embedding Sparse-K into the Flash Attention Path Refined the design based on maintainer guidance and integrated Sparse-K directly into the Flash Attention computation path during graph build. This ensured a clean integration aligned with GGML’s internal architecture and backend conventions.
Seamless Application Across All Attention Layers Extended the Sparse-K mechanism so that the generated mask is automatically applied to every attention layer, without needing per-layer adjustments or custom logic.
Comprehensive Benchmarking and Model Validation Ran full performance analysis — including decode throughput, prompt evaluation, and profiling — and confirmed a 2.3× performance improvement with no degradation in model accuracy.
Backend Execution Validation Verified correct Sparse-K execution on multiple backends (e.g., CUDA) by leveraging GGML’s existing operators, ensuring efficient HPC performance without any backend-specific modifications.
Converting Hugging Face Models to GGUF and Embedding Sparse-K Parameters I downloaded pretrained models from Hugging Face and converted them into the GGUF format, including proper quantization and model structure preservation. During this process, I embedded Sparse-K configuration parameters directly into the GGUF metadata, allowing the model to automatically load all Sparse-K settings without relying on environment variables. This integration ensured seamless runtime behavior and full compatibility across different backends, enabling efficient HPC execution with Sparse-K fully applied.

Click to enlarge

Additional Projects

Practicum at ExtraTech (Kamatech) — 03/2025 to 07/2025

During my practicum, I optimized a C++ performance-measurement system and integrated it with Python using pybind11. I implemented parallel execution using threads and pipes to collect real-time metrics, performed benchmarking and unit testing, and focused on advanced algorithms, performance optimization, and efficient code design. This experience provided hands-on exposure to developing production-quality, high-performance software solutions.

Grocery Store Management System (Academic Project)

Developed a full-stack inventory and order management application using React, Node.js, and MySQL. Built a dynamic React user interface with forms and data views, implemented business logic through an Express REST API, and designed the database schema and queries in MySQL to ensure efficient data handling and smooth user interactions.

Drop of Milk Clinic System (Academic Project)

Created a client-server system for managing appointments and patient records using Angular, C#, and SQL Server. Developed an Angular frontend with routing and reactive forms, implemented the backend with ASP.NET Core controllers and services, and managed data persistence through SQL Server combined with Entity Framework, enabling reliable and maintainable client-server interactions.

English Level

Working Proficiency