When I asked the 32b r1 distilled model its context window it said it was 4k... I dont know if thats true or not as it might not know its own architecture, but if that is true, 4k doesnt leave much especially for its <thinking> tokens. Ive also seen some negative feedback on the model, it could be that the benchmarks are false and maybe the model has simply been trained on them or maybe because the model is so new the hyperparameters havent been set up properly. we will see in the next few days i guess. from my testing theres hints of something interesting in there, but i also dont like its extremely censored nature either. and i dont mean the CCP stuff, i mean the sanitized corpo safety nonsense it was most likely trained on....
Yeah this simply wouldn't work. Models don't have any concept of "themselves". These are just large matrices of floating points that we multiply together to predict a new token.
The context size would have to be in the training data which would not make sense to do.