What Are AI Tokens? And How Many Do You Use for Text, Images, and Video?

Jan 3

If you build anything with AI (chatbots, summaries, code generation, image/video features), you’ll quickly hear the word “tokens.” Tokens are the meter that most AI models use for text — and increasingly for multimodal inputs too.

But tokens aren’t the same as words, and image/video costs don’t always work the same way across providers.

This article breaks it down in a practical, builder-friendly way: what tokens are, how much you typically consume, and what changes your bill for text, images, and video.

1) What is a token?

A token is a chunk of text that an AI model reads and writes. It can be:

a whole word (website)
part of a word (web + site)
punctuation, spaces, or symbols

For English, a reliable rule of thumb is:

1 token ≈ 4 characters
1 token ≈ ¾ of a word
100 tokens ≈ ~60–80 words (roughly 75 words is a common estimate)

Important: tokenization varies by language. Some languages tend to use more tokens per word than English.

2) Input tokens vs output tokens (and why both matter)

When you use an AI model, you pay (or consume credits) for two things:

Input tokens

Everything you send:

your prompt
system instructions
chat history / conversation context
tool definitions (if you’re using tools/functions)

Output tokens

Everything the model generates:

the response text
sometimes “thinking / reasoning tokens” depending on the model and billing rules

This is why long conversations get expensive: every new message often includes a growing history.

3) Typical token consumption for text responses

Here are practical, real-world estimates (English):

Short reply (2–3 sentences): ~60–150 tokens
One solid paragraph: ~100–200 tokens
A detailed answer (like a full help section): ~400–1,200 tokens
A long blog-style response: ~1,500–3,500+ tokens (output only)

And remember: input tokens can be as large (or larger) than output tokens if you paste docs, code, or long context.

A simple way to estimate quickly:

If you wrote ~300 words, that’s roughly ~400 tokens (give or take).
If you want a ~1,000-word output, that’s often ~1,300 tokens.

These are approximations — but accurate enough for SaaS feature planning.

4) Tokens for images (it depends on the platform)

This is where many founders get surprised: images aren’t “free.” Even when pricing is “per image,” your prompt text still uses tokens.

Image understanding

When an AI model “looks at” an image, the image is internally converted into a structured representation that consumes compute. Depending on the platform:

small images may count as a fixed token amount
larger images may be split into tiles, each consuming tokens

That means an “AI that can see images” can consume meaningful usage just from the image, before it even generates text.

Image generation

For image generation, pricing is often:

text input tokens (prompt)
image output cost (per image / resolution / quality tier)

So images are a mix of token-based and asset-based billing.

5) Tokens for video (and how quickly they add up)

Video is the fastest way to burn usage — because it’s large and time-based.

Video understanding

Some AI platforms convert video into tokens per second. A common reference point is:

~260 tokens per second of video
~30 tokens per second of audio

That means:

10 seconds of video ≈ ~2,600 tokens
60 seconds ≈ ~15,600 tokens

And that’s before any text output is generated.

Video generation

For video generation, pricing is often per second of generated video, based on:

resolution
frame rate
quality tier

Tokens still apply to the prompt and any text interaction, but seconds of video output are usually the main cost driver.

6) Practical examples (so you can budget AI features)

Here’s what common SaaS actions usually imply:

Example A: “Summarize this meeting”

Input: transcript (often thousands of tokens)
Output: summary (200–800 tokens)

Big cost driver: transcript length.

Example B: “Explain this bug”

Input: logs + code snippets (can explode input tokens)
Output: explanation and fix steps (300–1,200 tokens)

Big cost driver: pasted code and logs.

Example C: “Analyze this screenshot”

Input: image representation + prompt
Output: explanation text

Big cost driver: image size and resolution.

Example D: “Generate a 10-second video ad”

Input: prompt (small)
Output: video seconds (expensive)

Big cost driver: number of seconds and quality level.

7) How to reduce token usage without hurting quality

If you’re building AI features into a web app, this is how you keep costs under control:

Trim context: don’t resend full chat history every time
Summarize memory: compress old context into smaller summaries
Cache results: reuse outputs for repeated requests
Route models: cheap model for simple tasks, expensive model only when needed
Set limits: cap max input size, output length, and video duration

This is also why “unlimited AI” plans often fail: tokens and seconds are real variable costs.

Bottom line

Tokens are the units AI models use to process text.
Text costs scale with input + output tokens.
Images and video can consume large amounts of usage, especially video.
For SaaS products, AI should be treated like infrastructure, not a fixed-cost feature.

Teams that understand tokens early build better pricing, better limits, and more sustainable AI products.

Sorca Marian

Founder, CEO & CTO of Self-Manager.net & abZGlobal.net | Senior Software Engineer

https://self-manager.net/