> Both of our models are trained on top of DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.
Not to take away from their work but this shouldn't be buried at the bottom of the page - there's a gulf between completely new models and fine-tuning.
DeepSeek probably would have done this anyway, but they did release a Llama 8B distillation and the Meta terms of use require any derivative works to have Llama in the name. So it also might have just made sense to do for all of them.
Otoh, there aren't many frontier labs that have actually done finetunes.
> the Meta terms of use require any derivative works to have Llama in the name
Technically it requires the derivatives to begin with "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be OK.
> [...] If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.
I suspect that we'll see a lot of variations on this, with the open models catching up to SOTA - and the foundation models being relatively static - there will be many new SOTA's built off of existing foundation models.
How many of the latest databases are postgres forks?
Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?
They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.
That is pretty much a universal problem. If you look at the problems anyone's models has solved, they are all well represented in the corpus.
Remember that AIME is intended for high schoolers with just pencils, erasers, rulers, and compasses to solve in 3 hours. There is an entire industry providing supplementary material to prepare students for concepts are not directly covered in typical school material.
As various blogs and tests often pull from previous years make it into all the common sources like stackoverlow/exchange, reddit etc.., them explicitly stating to have trained on AIME problems prior to 2024 explicitly isn't much different.
Basically expect any model to train on all AIME problems available before their knowledge cutoff date.
To me, "How is the score on AIME2024 relevant" is because it is still not that high (from a practical consideration) despite directly training on it.
Mixed in with all the models success falling dramatically with AIME2025 demonstrates the above, and hints that Rao's claim that compiling in the verifier in training/scratch-space/prompt/fine-tuning etc... in a way the model can reliably access is what matters.
Phi 4. Its fast and reasonable enough, but with local models you have to know what you want to do. If you want a chat bot you use something with Hermes tunes, if you want code you want a coder - a lot of people like the deepseek distill qwen instruct for coding.
There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow without datacenter cards with enough RAM to hold them.
I just asked your very question a day or two ago because I put back together a machine with a 3060 12GB and wondered what sota was on that amount of RAM.
If you use lmstudio it will auto pick which of the quantized models to get, but you can pick a larger model quant if you want. You pick a model and a parameter size and it will choose the "best" quantization for your hardware. Generally.
I have a machine with 3108 TI's that I do a batch with, sending the question first to a LLM and then an LRM, returning to review the faster results and killing the job if they are acceptable. Ollama or just llama.cpp on podman makes this trivial.
But knowing* what model will be better will be impossible, only broad heuristics that may or may not be correct for any individual prompt could be used.
While there are better options if you were buying them today, an old out of date system with out of date GPUs works well in this batch model.
gemma-3-27b-it-Q6_K_L works fine with these, and that mixed with an additional submit to DeepSeek-R1-Distill-Qwen-32B is absolutely fine on that system that would just be shut down otherwise.
I have a very bright line about inter-customer leakage risk prevention that may be irrational but with that mixture I find that I am better looking at scholarly papers than trying the commercial models.
My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
You can add 41080ti at 200w limit with common PSU's and get the memory, but performance is limited by the pci bus at 31080ti.
As they seem to sell for the same price, I would probably buy the Titan V today, but the point being is that if you are fine with the even smaller models, you can run them queries in parallel or even cross verify, which dramatically helps with planning tasks even with the foundational models.
But series/parallel runs do a lot, and if you are using them for code, running a linter etc... on the structured output saves a lot of time evaluating the multiple response.
No connection to them at all, but bartowski on hugging face puts a massive amount of time and effort into re-quantizing models.
If you don't a restriction like my FP64 need, you can get 70b models running on two 24Gb gpus without much 'cost' to accuracy.
> My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
interesting. Very interesting. Why fp64 as opposed to BF16? different sort of model? i don't even know where to find fp64 models (not that i've looked).
also Bartowski may be on huggingface but they're also part of the LM Studio group, and frequently chat on that discord. actually, at least 3 of the main model converter / quant people are on that discord.
I haven't got two 24GB cards, yet, but maybe soon, with the way people are hogging the 5000 series.
edit: i realize that they're increasing the marketing FLOPS by halving the resolution, the current gen stuff is all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when nvidia finishes and releases a card with double the FLOPS at 8 bit, will that card be 8 times slower at fp64?
while researching this i discovered another fast fp64 card is the R9 280x by amd/ati. although the memory is weak, only 3GB! But i suppose if you need the numerical accuracy, there's always that, and those cards are like $40 (in the us, on ebay, sold listings), compared to $400 for the titan. if you need 4x the ram though i guess you're stuck paying 10x the price!
I mostly like to evaluate them whenever I ask a remote model (Calude 3.7, ChatGPT 4.5), to see how far they have progressed. From my tests qwen 2.5 coder 32b is still the best local model for coding tasks. I've also tried Phi 4, nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro M4 46GB RAM.
Interesting – focusing on the 671B parameter model feels like a significant step. It’s a compelling contrast to the previous models and sets a strong benchmark. It’s great that they’re embracing open weights and data too – that’s a crucial aspect for innovation.
I know one can rent consumer GPUs on the internet, where people like you and me offer their free GPU time to people who need it for a price. They basically get a GPU-enabled VM on your machine.
But is there something like a distributed network akin to SETI@home and the likes which is free for training models? Where a consensus is made on which model is trained and that any derivative works must be open source, including all the tooling and hosting platform? Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?
> Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?
Yes, it's possible. But no, it would not be remotely sensible given the performance implications. There is a reason why Nvidia is a multi trillion dollar company, and it's as much about networking as it is about GPUs.
Back in the early days of AI art, before AI became way too cringe to think about, I wondered about this exact thing[0]. The problem I learned later is that most AI training (and inference) is not dependent so much on the GPU compute, but on memory bandwidth and communication. A huge chunk of AI training is just figuring out how to minimize or hide the bottleneck the inter-GPU interconnect imposes so you can scale to multiple cards.
The BOINC model of distributed computing is to separate everything into little work units that can be sent out to multiple machines who then return a result that can be integrated back into the whole. If you were to train foundation models this way, you'd be packaging up the current model state n and a certain amount of trainset items into a work unit, and the result would be model weight offsets to be added back into model state n+1. But you wouldn't be able to benefit from any of the gradients calculated by other users until they submitted their work units and n+1 got calculated. So there'd be a lot of redundant work and training progress would slow down, versus a closely-coupled set of GPUs where they have enough bandwidth to exchange gradients every batch.
For the record, I never actually built a distributed training cluster. But when I learned what AI actually wants to go fast, I realized distributed training probably couldn't work over just renting big GPUs.
Most people do not have GPUs with enough RAM to do meaningful AI work. Generative AI models work autoregressively: that is, all of their weights are repeatedly used in a tight loop. In order for a GPU to provide a meaningful speedup it needs to have the whole model in GPU memory, because PCIe is slow (high latency) and also slow (low bandwidth). Nvidia knows this and that's why they are very stingy on GPU VRAM. Furthermore, training a model takes more memory than merely running it; I believe gradients are something like the number of weights times your batch size in terms of memory usage. There's two ways I could see around this, both of which are going to cause further problems:
- You could make 'mini' workunits where certain specific layers of the model are frozen and do not generate gradients. So you'd only train, say, 10% of the model at any one time. This is how you train very large models in centralized computing; you put a slice of the model on each GPU and exchange activations and gradients each batch. But we're on a distributed computer, so we don't have that kind of tight coupling, and we converge slower or not at all if we do this.
- You can change the model architecture to load specific chunks of weights at each layer, with another neural network to decide what chunks to load for each token. This is known as a "Mixture of Experts" model and it's the most efficient way we know of to stream weights in and out of a GPU, but training has to be aware of it and you can't change the size of the chunks to fit the current GPU. MoE lets a model have access to a lot of weights, but the scaling is worse. e.g. an 8x44B parameter MoE model is NOT equivalent to a 352B non-MoE model. It also causes problems with training that you have to solve for: very common bits of knowledge will be replicated across chunks, and certain chunks can become favored by the model because they're getting more gradients, which causes them to be favored more, so they get more gradients.
[0] My specific goal was to train a txt2img model purely on public domain Wikimedia Commons data, which failed for different reasons having to do with the fact that most of AI is just dataset sorting.
> Both of our models are trained on top of DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.
Not to take away from their work but this shouldn't be buried at the bottom of the page - there's a gulf between completely new models and fine-tuning.
Agreed. Also their name make it seem like it is totally new model.
If they needed to assign their own name to it, at least they could have included the parent (and grant parent) model names in the name.
Just like the name DeepSeek-R1-Distill-Qwen-7B clearly says that it is a distilled Qwen model.
DeepSeek probably would have done this anyway, but they did release a Llama 8B distillation and the Meta terms of use require any derivative works to have Llama in the name. So it also might have just made sense to do for all of them.
Otoh, there aren't many frontier labs that have actually done finetunes.
> the Meta terms of use require any derivative works to have Llama in the name
Technically it requires the derivatives to begin with "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be OK.
> [...] If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.
I've previously written a summary that includes all parts of the license that I think others are likely to have missed: https://notes.victor.earth/youre-probably-breaking-the-llama...
I suspect that we'll see a lot of variations on this, with the open models catching up to SOTA - and the foundation models being relatively static - there will be many new SOTA's built off of existing foundation models.
How many of the latest databases are postgres forks?
Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?
They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.
This happens a lot on r/localLlama since a few months. Big headline claims followed by "oh yeah it's a finetune"
How is the score on AIME2024 relevant if AIME2024 has been used to train the model?
That is pretty much a universal problem. If you look at the problems anyone's models has solved, they are all well represented in the corpus.
Remember that AIME is intended for high schoolers with just pencils, erasers, rulers, and compasses to solve in 3 hours. There is an entire industry providing supplementary material to prepare students for concepts are not directly covered in typical school material.
As various blogs and tests often pull from previous years make it into all the common sources like stackoverlow/exchange, reddit etc.., them explicitly stating to have trained on AIME problems prior to 2024 explicitly isn't much different.
Basically expect any model to train on all AIME problems available before their knowledge cutoff date.
To me, "How is the score on AIME2024 relevant" is because it is still not that high (from a practical consideration) despite directly training on it.
Mixed in with all the models success falling dramatically with AIME2025 demonstrates the above, and hints that Rao's claim that compiling in the verifier in training/scratch-space/prompt/fine-tuning etc... in a way the model can reliably access is what matters.
Google Gemini (2.5 pro) made the same "mistake", their data cut off is January 2025, and AIME 2024 is in Feburary 2024..
github repo: https://github.com/SkyworkAI/Skywork-OR1
blog: https://capricious-hydrogen-41c.notion.site/Skywork-Open-Rea...
huggingface: https://huggingface.co/collections/Skywork/skywork-or1-67fa1...
I tend to prefer running locally non-thinking models since they output the result significantly faster.
Any specific model recommendations for running locally?
Also, what tasks are you using them for?
Phi 4. Its fast and reasonable enough, but with local models you have to know what you want to do. If you want a chat bot you use something with Hermes tunes, if you want code you want a coder - a lot of people like the deepseek distill qwen instruct for coding.
There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow without datacenter cards with enough RAM to hold them.
I just asked your very question a day or two ago because I put back together a machine with a 3060 12GB and wondered what sota was on that amount of RAM.
If you use lmstudio it will auto pick which of the quantized models to get, but you can pick a larger model quant if you want. You pick a model and a parameter size and it will choose the "best" quantization for your hardware. Generally.
Thank you for the insightful reply
> There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow
Is there something like a “prompt router”, that can automatically decide what model to use based on the type of prompt/task?
there's RouteLLM: https://github.com/lm-sys/RouteLLM
nvidia has LLMRouter https://build.nvidia.com/nvidia/llm-router
llama-index also supports routing https://docs.llamaindex.ai/en/stable/examples/low_level/rout...
semantic router seems interesting https://github.com/aurelio-labs/semantic-router/
you could also just use langchain to route https://jimmy-wang-gen-ai.medium.com/llm-router-in-langchain...
interesting paper PickLLM: Context-Aware RL-Assisted Large Language Model Routing
https://arxiv.org/abs/2412.12170
I have a machine with 3108 TI's that I do a batch with, sending the question first to a LLM and then an LRM, returning to review the faster results and killing the job if they are acceptable. Ollama or just llama.cpp on podman makes this trivial.
But knowing* what model will be better will be impossible, only broad heuristics that may or may not be correct for any individual prompt could be used.
While there are better options if you were buying them today, an old out of date system with out of date GPUs works well in this batch model.
gemma-3-27b-it-Q6_K_L works fine with these, and that mixed with an additional submit to DeepSeek-R1-Distill-Qwen-32B is absolutely fine on that system that would just be shut down otherwise.
I have a very bright line about inter-customer leakage risk prevention that may be irrational but with that mixture I find that I am better looking at scholarly papers than trying the commercial models.
My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
You can add 41080ti at 200w limit with common PSU's and get the memory, but performance is limited by the pci bus at 31080ti.
As they seem to sell for the same price, I would probably buy the Titan V today, but the point being is that if you are fine with the even smaller models, you can run them queries in parallel or even cross verify, which dramatically helps with planning tasks even with the foundational models.
But series/parallel runs do a lot, and if you are using them for code, running a linter etc... on the structured output saves a lot of time evaluating the multiple response.
No connection to them at all, but bartowski on hugging face puts a massive amount of time and effort into re-quantizing models.
If you don't a restriction like my FP64 need, you can get 70b models running on two 24Gb gpus without much 'cost' to accuracy.
That would be preferable to a router IMHO.
> My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
interesting. Very interesting. Why fp64 as opposed to BF16? different sort of model? i don't even know where to find fp64 models (not that i've looked).
also Bartowski may be on huggingface but they're also part of the LM Studio group, and frequently chat on that discord. actually, at least 3 of the main model converter / quant people are on that discord.
I haven't got two 24GB cards, yet, but maybe soon, with the way people are hogging the 5000 series.
edit: i realize that they're increasing the marketing FLOPS by halving the resolution, the current gen stuff is all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when nvidia finishes and releases a card with double the FLOPS at 8 bit, will that card be 8 times slower at fp64?
My primary task isn't ML, and 64bit is needed for numerical stability.
For the Titan V, the F64 was 1/2 of F32, it was the only and last consumer generation to have that.
For Titan RTX and newer NVIDIA cards, the ratio from F32 to F64 is typically 1/64th of FP32.
So the Titan RTX, with 16 FP32 TFlop/s drops to 0.5 FP64 TFlops/s
While the Titan V, starting at 15 FP32 TFlops/s still has 7.5 FP64 TFlops/s
The 5090 TI has 104.9 FP16/32 TFlops/s, but only 1.64 FP64 TFlops/s.
Basically Nvidia decided most people didn't need FP64, and chose to improve quantized performance instead.
If you can run on a GPU, that Titan V has more 64bit Flops than even an AMD Ryzen Threadripper PRO 7995WX.
while researching this i discovered another fast fp64 card is the R9 280x by amd/ati. although the memory is weak, only 3GB! But i suppose if you need the numerical accuracy, there's always that, and those cards are like $40 (in the us, on ebay, sold listings), compared to $400 for the titan. if you need 4x the ram though i guess you're stuck paying 10x the price!
I mostly like to evaluate them whenever I ask a remote model (Calude 3.7, ChatGPT 4.5), to see how far they have progressed. From my tests qwen 2.5 coder 32b is still the best local model for coding tasks. I've also tried Phi 4, nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro M4 46GB RAM.
From their Notion page:
> Skywork-OR1-32B-Preview delivers the 671B-parameter Deepseek-R1 performance on math tasks (AIME24 and AIME25) and coding tasks (LiveCodeBench).
Impressive, if true: much better performance than the vanilla distills of R1.
Plus it’s a fully open-source release (including data selection and training code).
Interesting – focusing on the 671B parameter model feels like a significant step. It’s a compelling contrast to the previous models and sets a strong benchmark. It’s great that they’re embracing open weights and data too – that’s a crucial aspect for innovation.
> It’s great that they’re embracing open […] data too…
It could be, but as I type this it's currently vaporware: https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data
I know one can rent consumer GPUs on the internet, where people like you and me offer their free GPU time to people who need it for a price. They basically get a GPU-enabled VM on your machine.
But is there something like a distributed network akin to SETI@home and the likes which is free for training models? Where a consensus is made on which model is trained and that any derivative works must be open source, including all the tooling and hosting platform? Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?
> Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?
Yes, it's possible. But no, it would not be remotely sensible given the performance implications. There is a reason why Nvidia is a multi trillion dollar company, and it's as much about networking as it is about GPUs.
Back in the early days of AI art, before AI became way too cringe to think about, I wondered about this exact thing[0]. The problem I learned later is that most AI training (and inference) is not dependent so much on the GPU compute, but on memory bandwidth and communication. A huge chunk of AI training is just figuring out how to minimize or hide the bottleneck the inter-GPU interconnect imposes so you can scale to multiple cards.
The BOINC model of distributed computing is to separate everything into little work units that can be sent out to multiple machines who then return a result that can be integrated back into the whole. If you were to train foundation models this way, you'd be packaging up the current model state n and a certain amount of trainset items into a work unit, and the result would be model weight offsets to be added back into model state n+1. But you wouldn't be able to benefit from any of the gradients calculated by other users until they submitted their work units and n+1 got calculated. So there'd be a lot of redundant work and training progress would slow down, versus a closely-coupled set of GPUs where they have enough bandwidth to exchange gradients every batch.
For the record, I never actually built a distributed training cluster. But when I learned what AI actually wants to go fast, I realized distributed training probably couldn't work over just renting big GPUs.
Most people do not have GPUs with enough RAM to do meaningful AI work. Generative AI models work autoregressively: that is, all of their weights are repeatedly used in a tight loop. In order for a GPU to provide a meaningful speedup it needs to have the whole model in GPU memory, because PCIe is slow (high latency) and also slow (low bandwidth). Nvidia knows this and that's why they are very stingy on GPU VRAM. Furthermore, training a model takes more memory than merely running it; I believe gradients are something like the number of weights times your batch size in terms of memory usage. There's two ways I could see around this, both of which are going to cause further problems:
- You could make 'mini' workunits where certain specific layers of the model are frozen and do not generate gradients. So you'd only train, say, 10% of the model at any one time. This is how you train very large models in centralized computing; you put a slice of the model on each GPU and exchange activations and gradients each batch. But we're on a distributed computer, so we don't have that kind of tight coupling, and we converge slower or not at all if we do this.
- You can change the model architecture to load specific chunks of weights at each layer, with another neural network to decide what chunks to load for each token. This is known as a "Mixture of Experts" model and it's the most efficient way we know of to stream weights in and out of a GPU, but training has to be aware of it and you can't change the size of the chunks to fit the current GPU. MoE lets a model have access to a lot of weights, but the scaling is worse. e.g. an 8x44B parameter MoE model is NOT equivalent to a 352B non-MoE model. It also causes problems with training that you have to solve for: very common bits of knowledge will be replicated across chunks, and certain chunks can become favored by the model because they're getting more gradients, which causes them to be favored more, so they get more gradients.
[0] My specific goal was to train a txt2img model purely on public domain Wikimedia Commons data, which failed for different reasons having to do with the fact that most of AI is just dataset sorting.
[dead]
[flagged]