pinnochio 2 days ago
Apologies for the tangent, but isn't this like saying "sliced tomato featuring BLT sandwich"?
trynumber9 2 days ago
No. It's trying to analyze the CPU core but clarifies the device under test as that may have performance implications. There is cooling and possibly manufactured configured power limits.
pinnochio 2 days ago
I get what they're doing. I've never seen that phrasing before.
cmrdporcupine 2 days ago
This is awesome. I'm going to have to spend some time digging over this.

I got one of these GB10s, but the ASUS variety. So far fairly happy with it. Most days I don't remember I'm on ARM.

It's pretty performant, snappy, about the same speed as my other mini PC, a Ryzen 9 7940HS Minisforum UM 790 Pro, but with double the amount of cores and many times the amount of RAM.

storystarling 2 days ago
Have you tried running any local LLMs via llama.cpp? I am curious if that high RAM is effectively usable as unified memory for larger models. I wonder if the memory bandwidth is sufficient to get decent performance on something like a 70b model or if it bottlenecks.
justaboutanyone 2 days ago
You can run large-ish MoE model at good speeds, like gpt-oss-120b, it's snappy enough even with big context.

But large and dense at the same time is a bit slow.

Running a local LLM will be a load of money for something much slower than the api providers though.

storystarling 2 days ago
Makes sense regarding the MoE performance. I am not sure the cost argument holds up for high volume workloads though. If you are running batch jobs 24/7 the hardware pays for itself in a few months compared to API opex. It really just comes down to utilization.
storystarling 24 hours ago
Do you have specific t/s numbers for those dense models? I'm curious just how severe the memory bandwidth bottleneck gets in practice.

I'm not sure I agree on the cost aspect though. For high-volume production workloads the API bills scale linearly and can get painful fast. If you can amortize the hardware over a year and keep the data local for privacy, the math often works out in favor of self-hosting.

justaboutanyone 7 hours ago
For Qwen2.5-72B-Instruct-Q5_K_M at 32k context, I fed it a 26k token file (truncated fiction novel) asking it to summarize, and it input processed at 224 tok/s and output generated at 3 tok/s. Not really good enough for interactive use without frustration. Not just from watching it reply, but also the long wait for it to actually read the book.

On the same hardware gpt-oss-120b at 128k context, I fed it a longer version of the input (a whole novel, 97k tok), and it input processed at 1650 tok/s and output generated at 27 tok/s. Just fast enough IMO

cmrdporcupine 21 hours ago
I bought it primarily so I could learn some of the toolchain for fine-tuning / training stuff, not so much for running inference, which its only "ok" at.

If I was primarily interested in that, I would have probably bought one of the cheaper Strix Halo machines.

It's also just a decent non-Mac ARM64 workstation, with large quantities of RAM. Which in 2026 is a bit of unicorn.

crest 2 days ago
I would love to see a comparison between the A725 and X925 cores.
geerlingguy 2 days ago
Not quite in the same depth, but there are some more general benchmarks across all cores and latencies here: https://github.com/geerlingguy/sbc-reviews/issues/92
arjie 2 days ago
Wow, this repo and the ai-benchmarks repo are the ones I wanted https://github.com/geerlingguy/ai-benchmarks/issues/34

Thank you for doing these. Earned a star and a watch from me on both! Minor sponsor donation as gratitude.

Would be sick to have an RSS feed for your data releases.

geerlingguy 19 hours ago
Will consider that at some point; a lot of the time is just spent getting the data, heh.
ksec 2 days ago
Note to myself: Cortex X925 was originally called X5. The Current Generation X930 is now called C1-Ultra used in Mediatek 9500.