String Length in JavaScript vs Python vs PHP vs Go vs SQL — A Complete Comparison

Ask a beginner "how do you get the length of a string?" and they'll probably give you an answer in seconds. Ask a seasoned developer the same question and they'll pause, because string length is not simple. Between UTF-16 code units, byte counts, Unicode code points, grapheme clusters, and database collations, every language makes a different trade-off. Get it wrong and you'll truncate user names, reject valid emoji, or silently corrupt data.

In this post, we'll walk through how five major languages — JavaScript, Python, PHP, Go, and SQL — handle string length, where their pitfalls lie, and when you need to reach for an alternative function. If you've ever been bitten by an emoji breaking your character limit, this one's for you.

JavaScript: UTF-16 Code Units

JavaScript strings are sequences of UTF-16 code units. The .length property returns the number of 16-bit code units, not the number of visible characters.

const str = "hello";
console.log(str.length); // 5 — straightforward

Now try with an emoji:

const emoji = "😄";
console.log(emoji.length); // 2 — one emoji, two code units

Emoji like 😄 (U+1F604) lie outside the Basic Multilingual Plane (BMP) and are encoded as a surrogate pair — two 16-bit code units. JavaScript faithfully reports both.

CJK characters are typically within the BMP, so they work fine:

const cjk = "中文";
console.log(cjk.length); // 2 — correct, these are BMP characters

But many less-common CJK characters (CJK Unified Ideographs Extension B, some Cantonese characters, historical characters) are in supplementary planes and also use surrogate pairs:

const extB = "𫝀"; // U+2B740, CJK Extension B
console.log(extB.length); // 2 — surrogate pair

To count real Unicode code points (still not grapheme clusters), use the spread operator or Array.from:

const codePoints = [...emoji].length;
console.log(codePoints); // 1 — single code point

If you need grapheme-aware length (important for things like flag emoji or skin-tone sequences), reach for Intl.Segmenter:

const flags = "🇺🇸🇨🇦"; // US flag + Canada flag
console.log(flags.length);   // 8 — four surrogate pairs
console.log([...flags].length); // 4 — four code points

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const graphemes = [...segmenter.segment(flags)];
console.log(graphemes.length); // 2 — two flags as rendered

Bottom line: JavaScript's .length counts UTF-16 code units. Use [...str].length for code points, and Intl.Segmenter for grapheme clusters.

Python: Unicode-Aware by Default

Python 3 strings are sequences of Unicode code points. The built-in len() function returns the number of code points, not bytes. This is almost always what you want.

s = "hello"
print(len(s))  # 5

Emoji just work:

emoji = "😄"
print(len(emoji))  # 1 — one code point

Python's internal representation is flexible — it uses one, two, or four bytes per character depending on the largest code point in the string (PEP 393). But you never see that complexity.

When you need byte length — for example, to check MySQL utf8mb4 column limits — encode explicitly:

s = "😄"
print(len(s.encode("utf-8")))  # 4 — 4 bytes in UTF-8

CJK characters also vary in byte length:

cjk = "中文"
print(len(cjk))                # 2 — code points
print(len(cjk.encode("utf-8"))) # 6 — 3 bytes per character in UTF-8

Pitfall: Python 2's str was bytes, and unicode was code points. Python 3 fixed this, but if you maintain legacy code, beware of mixing str and bytes.

Bottom line: len() in Python 3 counts code points — it's correct and intuitive. Use len(s.encode('utf-8')) when you need byte length.

PHP: strlen vs mb_strlen

PHP has one of the most confusing behaviors: strlen() returns the byte length, not the character length.

echo strlen("hello"); // 5 — ASCII, bytes = characters
echo strlen("😄");    // 4 — 4 bytes in UTF-8
echo strlen("中文");  // 6 — 3 bytes per character

This is a historical artifact — PHP's core string functions predate broad Unicode support. The "multi-byte string" extension (mbstring) provides the correct function:

echo mb_strlen("😄", "UTF-8");  // 1
echo mb_strlen("中文", "UTF-8"); // 2

The common pitfall is forgetting the encoding parameter. Without it, mb_strlen() uses the internal encoding (usually ISO-8859-1 or UTF-8 depending on your mbstring.internal_encoding setting). Always pass "UTF-8" explicitly:

// Common mistake — encoding defaults to internal_encoding
echo mb_strlen("😄"); // might be 1, might be 4, depends on config

// Correct
echo mb_strlen("😄", "UTF-8"); // 1

Pitfall: Substring operations have the same issue. substr("😄", 0, 1) returns garbage bytes. Always use mb_substr() for multi-byte strings.

Bottom line: strlen() returns bytes. Use mb_strlen($str, 'UTF-8') for character count. Always pass the encoding explicitly.

Go: len vs utf8.RuneCountInString

In Go, strings are read-only byte slices. There is no character type — Go uses rune (an alias for int32) to represent a Unicode code point.

s := "hello"
fmt.Println(len(s)) // 5 — bytes, same as characters for ASCII

But len() counts bytes, not characters:

emoji := "😄"
fmt.Println(len(emoji)) // 4 — 4 bytes in UTF-8

cjk := "中文"
fmt.Println(len(cjk)) // 6 — 3 bytes each

Go does not provide a built-in character-length method on strings. You need the unicode/utf8 package:

import "unicode/utf8"

emoji := "😄"
fmt.Println(utf8.RuneCountInString(emoji)) // 1 — one rune

cjk := "中文"
fmt.Println(utf8.RuneCountInString(cjk)) // 2

A common pattern for range-looping over characters:

s := "😄中文"
for i, r := range s {
    fmt.Printf("%d: %c\n", i, r)
}
// Output:
// 0: 😄
// 4: 中
// 7: 文

Notice the index jumps from 0 to 4 to 7 — those are byte offsets, not character positions.

Pitfall: Slicing a string slices bytes, so s[:2] on "😄" produces an invalid UTF-8 sequence. Use string([]rune(s)[:n]) for character-based slicing.

Bottom line: len(s) on a Go string returns bytes. Use utf8.RuneCountInString(s) for Unicode code points. Convert to []rune for character-based indexing.

SQL: LENGTH vs LEN vs CHAR_LENGTH

SQL is trickier because every database has different function names and semantics.

MySQL / MariaDB

SELECT LENGTH('😄');    -- 4 (bytes in UTF-8)
SELECT CHAR_LENGTH('😄'); -- 1 (characters)
SELECT CHARACTER_LENGTH('😄'); -- 1 (alias for CHAR_LENGTH)

MySQL's LENGTH() returns bytes, matching the column's encoding. For utf8mb4 columns, accounting for multi-byte characters is critical when defining VARCHAR(n) — the n is characters, but the actual storage limit depends on the byte length.

SQL Server

SELECT LEN('😄');   -- 1 (characters, trailing spaces trimmed)
SELECT DATALENGTH('😄'); -- 4 (bytes, for NVARCHAR this is 2 × chars)

SQL Server's LEN() returns characters (not bytes) and trims trailing spaces. DATALENGTH() returns bytes — worth noting that for NVARCHAR it's always 2 bytes per character.

PostgreSQL

SELECT LENGTH('😄');   -- 1 (characters by default)
SELECT OCTET_LENGTH('😄'); -- 4 (bytes in UTF-8)
SELECT BIT_LENGTH('😄'); -- 32 (bits)

PostgreSQL's LENGTH() returns characters, not bytes. This is the friendliest default.

SQLite

SELECT LENGTH('😄'); -- 1 (characters)

SQLite also returns characters by default.

Bottom line: Know your database. MySQL's LENGTH() is bytes, PostgreSQL's and SQLite's LENGTH() is characters. Use CHAR_LENGTH() in MySQL and OCTET_LENGTH() in PostgreSQL when you need portability.

Unicode Warning

Why does any of this matter? Because real-world applications store user input, and user input includes emoji, accented characters, CJK text, and Arabic script. Here's what different measurement strategies return for the same input "I ❤️ devtools 😄":

MethodResultType
JavaScript .length20UTF-16 code units
Python len()16Unicode code points
PHP mb_strlen()16Unicode code points
Go utf8.RuneCountInString()16Unicode code points
Go len()28Bytes (UTF-8)
Python len(s.encode())28Bytes (UTF-8)
MySQL LENGTH()28Bytes (UTF-8)
MySQL CHAR_LENGTH()16Characters

If you validate a VARCHAR(20) column against JavaScript's .length, you'll pass a 16-character string that takes 28 bytes — fine. But if you validate against Go's len() and try to store in a VARCHAR(16), you'll get a surprise when MySQL rejects the data because 28 bytes > 16 bytes for that column.

Always match your validation method to the storage layer. If your database column is VARCHAR(255) in MySQL, validate using byte length. If it's NVARCHAR(255) in SQL Server, validate using character length (since NVARCHAR uses UCS-2, and supplementary characters require two code units).

Need a quick way to check string length across all these representations? Try the String Length Calculator on DevFormatters — it shows UTF-16 code units, Unicode code points, UTF-8 bytes, and grapheme clusters side by side.

Conclusion

String length is not a universal concept. Each language made different historical decisions about encoding, and those decisions still ripple through their APIs today:

  • JavaScript counts UTF-16 code units — use [...s].length or Intl.Segmenter for correct results.
  • Python 3 counts Unicode code points — use len(s.encode()) for bytes.
  • PHP counts bytes — always use mb_strlen($s, 'UTF-8').
  • Go counts bytes — use utf8.RuneCountInString(s) for runes.
  • SQL varies by vendor — know whether LENGTH() returns bytes or characters.

The next time someone says "just call .length," remember that 👨‍👩‍👧‍👦 (family emoji) is 25 bytes in UTF-8, 11 UTF-16 code units, 7 Unicode code points, and one visible grapheme cluster. Each of those numbers could be considered "the length" depending on your context.

Bookmark the String Length Calculator for those times you need to see exactly what each language will count. And when you're designing APIs or writing database migrations, always ask: am I validating bytes, code points, or grapheme clusters?