Can you save on LLM tokens using images instead of text? pagewatch.ai 21 points by lpellis 6 days ago
bikeshaving 5 hours ago Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_... heltale 4 hours ago I suppose it’s only worth 256 words at a time right now. ;)https://arxiv.org/abs/2010.11929 estebarb 4 hours ago The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED bikeshaving 2 hours ago 2.4% relative error is not bad. behnamoh 2 hours ago how do you decompress all those 4 words from one token?
heltale 4 hours ago I suppose it’s only worth 256 words at a time right now. ;)https://arxiv.org/abs/2010.11929 estebarb 4 hours ago The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED bikeshaving 2 hours ago 2.4% relative error is not bad. behnamoh 2 hours ago how do you decompress all those 4 words from one token?
estebarb 4 hours ago The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED bikeshaving 2 hours ago 2.4% relative error is not bad. behnamoh 2 hours ago how do you decompress all those 4 words from one token?
ashed96 an hour ago In my experience, LLMs tend to take noticeably longer to process images than text.
floodfx 5 hours ago Why are completion tokens more with image prompts yet the text output was about the same? Garlef 2 hours ago "Thinking" Mode
Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?
https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...
I suppose it’s only worth 256 words at a time right now. ;)
https://arxiv.org/abs/2010.11929
The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED
2.4% relative error is not bad.
how do you decompress all those 4 words from one token?
In my experience, LLMs tend to take noticeably longer to process images than text.
Why are completion tokens more with image prompts yet the text output was about the same?
"Thinking" Mode