There's such a massive performance differential vs. SIMD though that I learned to appreciate SIMD (via highway) as one sweet spot of low-dependency portability that sits between C loops and the messy world of GPUs + their fat tree of dependencies.
If anyone want to learn the basics - whip out your favorite LLM pair programmer and ask it to help you study the kernels in the ops/ library of gemma.cpp:
In particular it's got a single ~600 line file (https://github.com/robitec97/gemma3.c/blob/main/gemma3_kerne...) with a clear straightforward implementation of every major function used in inferencing (google's models) from gelu to rope.
I'm curious how many more functions you'd need to add to have full coverage of every publically available LLM innovation (e.g. QK-Norm from Qwen3, SwiGLU etc.).
Obviously llama.cpp has a much bigger library but it's lovely to see everything in one clean file.
Did we need any proof of that ?
When I got started, I was led to ollama and other local-llm freemium.
I didn't necessarily assume that they weren't c++(I don't even know) but I do think that –as implied– Python duct-tape solutions are more popular than llama.cpp.
I know very little about AI but these are things that come to mind here for me.
Umm, we do. It's still one of the best for eu countries support / help chatbot style. It's got good (best?) multilingual support ootb, it's very "safe" (won't swear, won't display chinese characters, etc) and it's pretty fast.