Question 1

Why is 🌏 one row but len = 2?

Accepted Answer

Emoji and other supplementary plane codepoints (>U+FFFF) take 2 UTF-16 code units in JavaScript strings, but they're one user-perceived character. The tool counts codepoints (Array.from) for the row count, but reports `string.length` separately so you can see the discrepancy.

Question 2

Total bytes vs UTF-8 column — same?

Accepted Answer

Yes. Total bytes = sum of each row's UTF-8 byte count, computed via TextEncoder for accuracy on edge cases. Useful for sizing storage or wire format.

Question 3

What's mojibake?

Accepted Answer

Garbled text from interpreting bytes in the wrong encoding. Classic: UTF-8 'é' (C3 A9) read as Latin-1 becomes 'Ã©'. This tool can help diagnose it — paste the garbled string and see if the codepoints match what 'wrong-decoded UTF-8' would produce.

Question 4

What about combining characters / grapheme clusters?

Accepted Answer

We show codepoints, not graphemes. 'é' can be one codepoint (U+00E9) or two (e + combining acute, U+0065 + U+0301). The visual character is the same; the byte representation isn't. For proper grapheme counting, you'd need Intl.Segmenter — beyond this tool's scope.

Char	Codepoint	UTF-8 bytes	HTML entity	CSS escape	Block
H	U+0048	48	H	\0048	Basic Latin (ASCII)
e	U+0065	65	e	\0065	Basic Latin (ASCII)
l	U+006C	6C	l	\006C	Basic Latin (ASCII)
l	U+006C	6C	l	\006C	Basic Latin (ASCII)
o	U+006F	6F	o	\006F	Basic Latin (ASCII)
,	U+002C	2C	,	\002C	Basic Latin (ASCII)
␠	U+0020	20		\0020	Basic Latin (ASCII)
世	U+4E16	E4 B8 96	世	\4E16	CJK Unified Ideographs
界	U+754C	E7 95 8C	界	\754C	CJK Unified Ideographs
!	U+0021	21	!	\0021	Basic Latin (ASCII)
␠	U+0020	20		\0020	Basic Latin (ASCII)
🌏	U+1F30F	F0 9F 8C 8F	🌏	\1F30F	Miscellaneous Symbols & Pictographs

Unicode Character Inspector

How to use

Frequently asked questions

Related tools

URL Slug Generator

Markdown Table Generator

Text Diff Viewer

Lorem Ipsum Generator

Case Converter

Character & Word Counter