How do I count words in Python?

len(text.split()) counts whitespace-separated words. For more analysis, see our Word Counter.

How do I count specific characters?

text.count('a') counts occurrences of 'a'. For multiple characters at once, use collections.Counter: Counter(text).

How do I detect emoji in Python?

Use the 'emoji' package: pip install emoji; emoji.is_emoji(c). For analysis, try our Emoji Counter.

What about counting bytes in a file?

len(open('file.txt', 'rb').read()) gives byte count. For character count, open in text mode with the correct encoding.

🐍

Python Tutorial

How to Count Characters in Python

Python's len() works well for counting characters because Python 3 strings are Unicode code-point sequences — no surrogate-pair surprises like JavaScript. But there are still gotchas around combining marks and byte counting. For a browser-based reference implementation, see our Character Counter; for emoji-specific edge cases, try our Emoji Counter.

Method 1: len() returns code points

Python 3 strings are sequences of Unicode code points. len() returns the count directly — no UTF-16 surprises.

python

text = "Hello, world!"
print(len(text))         # 13

print(len("😀"))          # 1 — correct in Python 3
print(len("中文"))         # 2

# Combining marks still count separately:
print(len("é"))     # 2 — e + combining acute accent

Method 2: Normalize before counting (combining marks)

If your text mixes precomposed and decomposed accented characters, normalize first.

python

import unicodedata

text_decomposed = "éllo"   # uses combining acute
text_normalized = unicodedata.normalize('NFC', text_decomposed)

print(len(text_decomposed))  # 5
print(len(text_normalized))  # 4 — é now one code point

Method 3: Grapheme clusters

Python's standard library doesn't include grapheme segmentation. Use the third-party 'grapheme' or 'regex' package for accurate visible-character counts.

python

# pip install grapheme
import grapheme

text = "👨‍👩‍👧‍👦 family"
print(len(text))                 # 13 — code points (the family emoji is 7 cp)
print(grapheme.length(text))     # 8  — visible characters

# Alternative: pip install regex
import regex
print(len(regex.findall(r'\X', text)))  # 8

Method 4: Byte counting

For storage, network, and API limits, encode the string and count bytes.

python

text = "Hello, 世界 😀"

# UTF-8 byte count
print(len(text.encode('utf-8')))   # 17

# Compare with character count
print(len(text))                    # 11

# UTF-16 (2 bytes per BMP char, 4 per emoji)
print(len(text.encode('utf-16-le')))  # 24

Method 5: Characters without whitespace

Quick filter for non-whitespace characters.

python

text = "Hello, world!"
print(len(text.replace(' ', '')))       # 12
print(sum(1 for c in text if not c.isspace()))  # 12 — same result

# Count specific categories
import string
letters_only = sum(1 for c in text if c in string.ascii_letters)
print(letters_only)  # 10

Common Pitfalls

⚠Python 2 was different

Python 2 strings were bytes by default. If you're maintaining legacy code, len() on a Python 2 str gives bytes, not characters. Use u'string' literals or upgrade.

⚠Emoji ZWJ sequences

The family emoji 👨‍👩‍👧‍👦 is built from 7 code points joined by zero-width joiners. len() returns 7; users see 1. Use grapheme segmentation for visible count.

⚠Normalization matters for comparison

If you compare strings (e.g., search, deduplication), normalize both with unicodedata.normalize('NFC', s) first. Otherwise visually identical strings won't match.

See a Working Character Counter

Our Character Counter is built using the patterns from this tutorial. Open the dev tools to inspect the live implementation.

📊Open Character Counter

FAQ

In Python 3, yes — len() returns the number of Unicode code points. Each emoji counts as one (unless it's a ZWJ sequence, which counts as multiple code points).