•   9 months ago

Char/s vs Token/s in OSS

An heads up for people using this model

I made some benchmarks, and compared to other models, OSS has weak tokens at 1.2 Character per Token. Most models I tested hovers from 3.5 to 4.5 Character per Token. This has two implications:
1) Context is measured in tokens, you should use around 3X context size compared to what you use for other models for the same workload. If you do 4000 tokens in llama 3.2, it should be around 12000 tokens in OSS to do the same work
2) Speed is measured in tokens per second, if you look just at that, the model looks a lot faster than it is because of that 3X factors if you use english text. When comparing across models you should look at character per second, with the caveat that the conversion rate changes quite a bit, especially on short text.

Detailed stats over around 200 000 tokens, I don't expect many surprises for more english text. I haven't measured programming, it's plausible compression of OSS gets a lot better if you do mainly code but I haven't measured that in this bench

OSS20B

"n_char_per_token": {
"n_avg": 1.1982278934686585,
"n_rms": 1.2870897420857887,
"n_std": 0.4699467198482436,
"n_min": 0.449079754601227,
"n_max": 2.161705551086082,
},

Qwen 3 14B

"n_char_per_token": {
"n_avg": 4.030572019404191,
"n_rms": 4.082795976764328,
"n_std": 0.6509317815862128,
"n_min": 2.5442857142857145,
"n_max": 4.843283582089552,
},

Comments are closed.