AZ Tools

Unicode Character Inspector

Text

Every character in this tool gets a row showing: the character itself, the codepoint in `U+HHHH` hex, the UTF-8 byte sequence, an HTML decimal entity (`&#NNNN;`), a CSS escape (`\HHHH`), and the Unicode block it belongs to. Useful for debugging mojibake, finding the exact codepoint of a confusing character (is that a hyphen-minus or an em dash?), or seeing how many bytes your string takes in UTF-8 storage. Handles surrogate pairs correctly using Array.from for proper codepoint iteration.

β€”

string.length

13

Codepoints

12

UTF-8 bytes

19

CharCodepointUTF-8 bytesHTML entityCSS escapeBlock
HU+004848H\0048Basic Latin (ASCII)
eU+006565e\0065Basic Latin (ASCII)
lU+006C6Cl\006CBasic Latin (ASCII)
lU+006C6Cl\006CBasic Latin (ASCII)
oU+006F6Fo\006FBasic Latin (ASCII)
,U+002C2C,\002CBasic Latin (ASCII)
␠U+002020 \0020Basic Latin (ASCII)
δΈ–U+4E16E4 B8 96世\4E16CJK Unified Ideographs
η•ŒU+754CE7 95 8C界\754CCJK Unified Ideographs
!U+002121!\0021Basic Latin (ASCII)
␠U+002020 \0020Basic Latin (ASCII)
🌏U+1F30FF0 9F 8C 8F🌏\1F30FMiscellaneous Symbols & Pictographs

Codepoints iterated with Array.from (surrogate-pair safe). Block names cover the most common Unicode ranges β€” niche blocks may show 'β€”'.

How to use

  1. Paste or type text in the input box.
  2. Read each character's metadata in the table.
  3. Copy the parsed table as TSV with the copy button.

Frequently asked questions

Why is 🌏 one row but len = 2?
Emoji and other supplementary plane codepoints (>U+FFFF) take 2 UTF-16 code units in JavaScript strings, but they're one user-perceived character. The tool counts codepoints (Array.from) for the row count, but reports `string.length` separately so you can see the discrepancy.
Total bytes vs UTF-8 column β€” same?
Yes. Total bytes = sum of each row's UTF-8 byte count, computed via TextEncoder for accuracy on edge cases. Useful for sizing storage or wire format.
What's mojibake?
Garbled text from interpreting bytes in the wrong encoding. Classic: UTF-8 'Γ©' (C3 A9) read as Latin-1 becomes 'é'. This tool can help diagnose it β€” paste the garbled string and see if the codepoints match what 'wrong-decoded UTF-8' would produce.
What about combining characters / grapheme clusters?
We show codepoints, not graphemes. 'Γ©' can be one codepoint (U+00E9) or two (e + combining acute, U+0065 + U+0301). The visual character is the same; the byte representation isn't. For proper grapheme counting, you'd need Intl.Segmenter β€” beyond this tool's scope.

Related tools