eric orso • 9 months ago
Char/s vs Token/s in OSS
An heads up for people using this model
I made some benchmarks, and compared to other models, OSS has weak tokens at 1.2 Character per Token. Most models I tested hovers from 3.5 to 4.5 Character per Token. This has two implications:
1) Context is measured in tokens, you should use around 3X context size compared to what you use for other models for the same workload. If you do 4000 tokens in llama 3.2, it should be around 12000 tokens in OSS to do the same work
2) Speed is measured in tokens per second, if you look just at that, the model looks a lot faster than it is because of that 3X factors if you use english text. When comparing across models you should look at character per second, with the caveat that the conversion rate changes quite a bit, especially on short text.
Detailed stats over around 200 000 tokens, I don't expect many surprises for more english text. I haven't measured programming, it's plausible compression of OSS gets a lot better if you do mainly code but I haven't measured that in this bench
OSS20B
"n_char_per_token": {
"n_avg": 1.1982278934686585,
"n_rms": 1.2870897420857887,
"n_std": 0.4699467198482436,
"n_min": 0.449079754601227,
"n_max": 2.161705551086082,
},
Qwen 3 14B
"n_char_per_token": {
"n_avg": 4.030572019404191,
"n_rms": 4.082795976764328,
"n_std": 0.6509317815862128,
"n_min": 2.5442857142857145,
"n_max": 4.843283582089552,
},
Comments are closed.

1 comment
Dominik Kundel • 8 months ago
Hey Eric
Yes the two models use different vocabularies. The gpt-oss models use the same vocabulary as o3 and o4-mini if you want to have full visibility into the tokens.
https://github.com/openai/tiktoken/blob/main/tiktoken/model.py#L20
Comparing models with different vocabularies can indeed show variability so it depends on your overall goal. For example if you look at the cost to run Artificial Analysis' intelligence index the token usage of gpt-oss-120b and Qwen 3 reasoning end up in the same territory
https://artificialanalysis.ai/?intelligence-tab=openWeights&models=gpt-oss-120b%2Cgpt-oss-20b%2Cqwen3-235b-a22b-instruct-2507-reasoning%2Cqwen3-30b-a3b-2507-reasoning%2Cqwen3-235b-a22b-instruct-reasoning#output-tokens-used-to-run-artificial-analysis-intelligence-index
But yes it's helpful to remember that there are vocabulary differences between different open models.