Normalize Unicode Text
Advanced Unicode normalization with NFC/NFD/NFKC/NFKD forms, smart detection, and analysis.
Statistics
What is the Unicode Normalizer?
Did you know "café" can be written two ways? As c-a-f-é (4 characters) OR c-a-f-e-´ (5 characters). To a computer, these are completely different strings, which breaks search, sorting, and user logins. The Normalize Unicode Text tool fixes this by converting text into one valid standard form (like NFC or NFD). It's essential for anyone handling data from multiple sources. For simpler cleanup tasks, check out the Non-ASCII Remover.
Features
4 Unicode Forms
Complete coverage: NFC (compose), NFD (decompose), NFKC (compat compose), NFKD (compat decompose). All standard forms available.
Smart Detection
Auto-analyzes combining marks, compatibility chars (①½fi), ligatures. Recommends optimal form based on content.
Character Analysis
6 metrics: input/output length, changes, combining marks, compatibility chars, form. Understand text composition.
Unicode Code Points
Highlight mode shows transformations with tooltips (U+00E9 → U+0065 U+0301). Perfect for debugging.
Undo/Redo History
5-level history for testing different forms. Easy comparison between NFC/NFD/NFKC/NFKD effects.
File Processing
Upload/download files, batch mode for line-by-line processing. Handle entire datasets efficiently.
Use Cases
🔍 Search Indexing
If a user searches for "café" (NFC), they should find "café" (NFD). Normalizing text before indexing ensures 100% match accuracy.
🗄️ Database Deduplication
Prevent duplicate records where "User A" looks identical but has different underlying binary codes. Standardize to NFC before saving.
🔎 URL Safety
Web browsers and servers can get confused by non-normalized paths. Ensure your URLs are consistent across all platforms.
📂 File System Compatibility
macOS uses NFD while Windows uses NFC. Normalizing filenames prevents weird errors when moving files between operating systems.
How to Use
- Enter or Upload Text: Type/paste text or upload .txt/.md files. Smart detection automatically analyzes combining marks and compatibility characters.
- Choose Normalization Form: Select NFC (compose - most common), NFD (decompose), NFKC (compat compose - for search), or NFKD (compat decompose - maximum decomposition).
- Review Detection: Check the smart detection hint to see what Unicode features were found and which form is recommended.
- Enable Options: Use Batch Mode for line-by-line, Whitespace Normalizer, Highlight Mode to see code points, Auto-Copy for clipboard.
- Check Statistics: Review 6 metrics to understand the normalization impact: combining marks found, compatibility chars, characters changed.
- Compare Forms: Use Comparison Mode to see original vs normalized side-by-side. Use Undo/Redo to test different forms.
- Copy or Download: Save the normalized result with form-specific filename (e.g., normalized-nfc-xxx.txt).
Examples
Normalization Form Comparison
Input with combining marks:
café (NFD: e + ́)
NFC Output (composed):
café (single é char)
Input with compat chars:
① ½ final
NFKC Output (normalized):
1 1⁄2 final
Frequently Asked Questions
What is Unicode normalization and why is it needed?
Unicode normalization ensures that visually identical text has a consistent binary representation. The same character can be represented in multiple ways in Unicode: for example, 'é' can be stored as a single precomposed character (U+00E9) OR as 'e' (U+0065) + combining acute accent (U+0301). Both look identical but are stored differently. This causes problems: Text comparison fails - 'café' (NFC) ≠ 'café' (NFD) even though they look the same. Database lookups miss matches - searching for one form won't find the other. String sorting breaks - identical-looking strings sort differently. File naming issues - same filename can exist twice. URL conflicts - web paths become inconsistent. Unicode normalization converts text to a standard form, ensuring consistent storage, reliable comparisons, predictable sorting, and universal compatibility. It's essential for any application handling international text, especially databases, search engines, file systems, and web applications.
What are the 4 Unicode normalization forms and when should I use each?
The 4 standard Unicode normalization forms serve different purposes: NFC (Canonical Composition) - Composes characters using canonical equivalence. 'e' + accent → 'é'. This is the most common form, used by macOS, web standards, and most databases. Choose NFC when you want compact storage and broad compatibility. NFD (Canonical Decomposition) - Decomposes characters into base + combining marks. 'é' → 'e' + accent. Used by Linux file systems (HFS+) and for text processing that needs to analyze individual components. Choose NFD when you need to separate diacritics or perform linguistic analysis. NFKC (Compatibility Composition) - Like NFC but also normalizes compatibility characters like ligatures (fi→fi), circled numbers (①→1), fractions (½→1⁄2). Used for search normalization and data cleanup. Choose NFKC when you need semantic equivalence for matching purposes. NFKD (Compatibility Decomposition) - Like NFD but also decomposes compatibility characters. Most aggressive form. Choose NFKD for maximum decomposition before further processing. Default recommendation: Use NFC for general storage and web content. Use NFKC for search indexing and user input normalization.
How does Smart Form Detection work?
Smart Form Detection automatically analyzes your text to identify its Unicode characteristics and recommend the optimal normalization form. It scans for three key indicators: Combining Marks (U+0300-U+036F) - These are separate diacritical marks attached to base characters. If detected, your text likely uses NFD form (decomposed). The tool counts them and displays: 'X combining marks (NFD form)'. Compatibility Characters - These include circled numbers (①②③), vulgar fractions (½¼¾), Roman numerals (ⅰⅱⅲ), superscripts, subscripts, and enclosed alphanumerics. If detected: 'X compatibility chars' with NFKC recommendation. Ligatures - Special combined characters like fi (fi ligature), fl (fl ligature), ff (ff ligature). If detected: 'X ligatures' with NFKC recommendation. The tool provides intelligent recommendations: If it finds combining marks but no compatibility characters, it recommends NFC (compose to standard form). If it finds compatibility characters or ligatures, it recommends NFKC (normalize compatibility characters). If text appears standard, it shows: '✅ Text appears to be in standard form.' This automatic analysis saves you from having to understand the technical details while ensuring you choose the right normalization strategy.
What do the statistics tell me about my text?
The tool displays 6 comprehensive metrics to help you understand your text's Unicode composition: Input Length - Total characters in original text. Baseline measurement. Output Length - Characters after normalization. Can change with composition/decomposition. For example, 'é' (1 char in NFC) becomes 'e' + accent (2 chars in NFD). Changed - Count of characters that were actually modified by normalization. Zero means text was already in the target form. Combining Marks - Number of separate diacritical marks detected (U+0300-U+036F range). High count indicates NFD or NFKD form. Zero indicates NFC or NFKC. Compat Chars - Compatibility characters found (circled numbers, fractions, ligatures, etc.). Indicates whether NFKC/NFKD normalization will have an effect. Form - Which normalization form was applied (NFC/NFD/NFKC/NFKD). Example interpretations: '0 changed' = text was already in target form, no conversion needed. '15 combining marks, 0 compat' = text is in NFD form, use NFC to compose. '5 compat chars' = text has special characters that NFKC will normalize. Length increase from 50→75 when converting NFC→NFD = decomposition added combining marks. These metrics help validate your normalization choice and understand your text's structure.
Can I process files and use batch mode?
Yes! The tool supports both file processing and batch mode: File Upload - Click 'Upload' to load .txt or .md files. Perfect for normalizing entire documents, configuration files, or exported data. The tool reads the file, applies your selected normalization form, and you can download the normalized result with a timestamped filename (e.g., 'normalized-nfc-1642534567.txt'). Batch Mode - When enabled, processes each line independently. Essential for: CSV/TSV data - normalize each row separately while preserving structure. Line-based logs - process entries individually without affecting formatting. Lists of items - names, addresses, URLs where each line is a separate entity. Configuration files - normalize values line-by-line. Both features work together: upload a file AND enable Batch Mode for line-by-line processing of the file content. The Whitespace Normalizer option complements these by collapsing extra spaces and trimming lines after normalization. Use cases: Normalize a database export before import (upload CSV). Normalize 1000 product names in a list (upload + batch mode). Normalize user-generated content files for consistent storage. Clean up downloaded text data with mixed Unicode forms.
What does the highlight mode show?
Highlight Changes mode provides visual feedback showing exactly which characters were affected by normalization: Blue background - Characters that changed appear highlighted. Dotted blue underline - Additional visual indicator. Unicode code point tooltips - Hover over highlighted characters to see the transformation in Unicode code points (e.g., 'U+00E9 → U+0065 U+0301' showing é decomposed to e + combining acute). This is invaluable for: Understanding normalization - See visually what happens when you switch between forms (NFC→NFD, NFKC→NFKD). Debugging text issues - Identify which characters are causing problems in your data. Quality assurance - Verify the right characters are being normalized before saving. Learning Unicode - Understand how characters like 'é' can be stored as either precomposed or decomposed. Validation - Confirm compatibility characters like '①' are being converted (NFKC: ① → 1). The highlighting works in both normal and Comparison modes. In Comparison mode, you see original and normalized text side-by-side, with changes clearly marked. This transparency is essential when dealing with subtle Unicode differences that are invisible to the eye but critical for data consistency.
How does normalization fix text comparison issues?
Text comparison issues are one of the most common Unicode problems that normalization solves: The Problem: Two strings that look identical can fail comparison because they use different Unicode representations. For example, 'café' can be stored as: Form 1 (NFC): c + a + f + é (U+00E9 precomposed). Form 2 (NFD): c + a + f + e + ́ (U+0065 + U+0301 combining). When you compare these directly: Form1 === Form2 → FALSE (even though they look identical!). This breaks: Database searches - Users searching for 'café' won't find 'café' (different form). Username/email validation - Same email can be registered twice. File deduplication - Identical filenames stored separately. Password comparison - Same password fails authentication. The Solution: Normalize both strings to the same form before comparison: normalize(Form1, NFC) === normalize(Form2, NFC) → TRUE. By converting both to NFC (or any consistent form), you ensure: Reliable searches - all variations of 'café' match. Unique constraints work - databases properly reject duplicates. Consistent sorting - identical strings sort together. Predictable behavior - text operations work as expected. This tool lets you normalize user input, database content, or file data to ensure one canonical representation for each logical string, eliminating comparison failures.
What are combining marks and why do they matter?
Combining marks (also called diacritical marks or combining characters) are Unicode characters that modify the preceding base character without taking up their own space. They exist in the range U+0300 to U+036F and include accents, umlauts, tildes, etc. How they work: Instead of using a single precomposed character like 'é' (U+00E9), you can use: Base character 'e' (U+0065) + Combining acute accent (U+0301) = 'é'. Both produce visually identical results but are stored completely differently. Why they matter: NFD vs NFC - NFD form uses combining marks (decomposed), NFC uses precomposed characters (composed). Character counting - 'café' is 4 characters in NFC but 5 in NFD (extra combining mark). String manipulation - Splitting 'café' (NFD) at position 3 splits the 'é', leaving a broken character. Database storage - Some databases have poor combining mark support. Search issues - Searching for precomposed 'é' won't match decomposed 'e + combining accent'. Our tool helps: Displays combining mark count in statistics. Detects their presence and recommends appropriate normalization. Converts between composed (NFC) and decomposed (NFD) forms. Shows the transformation visually in highlight mode. For most applications, NFC (composed, no separate combining marks) is preferred for simpler handling and better compatibility.
What are compatibility characters and should I normalize them?
Compatibility characters are Unicode characters that exist for backward compatibility with older character sets but have semantically equivalent alternatives. Examples include: Circled numbers: ① ② ③ (should be 1, 2, 3). Vulgar fractions: ½ ¼ ¾ (should be 1⁄2, 1⁄4, 3⁄4 or 1/2, 1/4, 3/4). Roman numerals: ⅰ ⅱ ⅲ (should be i, ii, iii). Ligatures: fi fl ff (should be fi, fl, ff). Enclosed alphanumerics: ⒜ ⒝ ⒞. Superscripts/subscripts: ² ³ ℓ. When to normalize: Use NFKC or NFKD to convert these to their semantic equivalents when: Search normalization - so users searching '1' also find '①'. Data cleanup - ensuring consistent representation across systems. Database import - removing special formatting variants. Plain text extraction - converting formatted text to searchable content. Text analysis - ensuring 'fi' and 'fi' are counted the same. When NOT to normalize: Keep compatibility characters when: Preserving formatting - the visual distinction is meaningful. Display purposes - the special character conveys information. Historical accuracy - maintaining original text representation. Our tool: Detects compatibility characters in your text. Counts them in statistics (Compat Chars metric). Recommends NFKC when detected. Preserves them with NFC/NFD (only normalizes with NFKC/NFKD).
How is this different from other text normalization tools?
Our Unicode normalization tool offers several unique advantages over basic normalizers: Complete Form Coverage - Many tools only offer NFC or NFD. We provide all 4 standard forms (NFC/NFD/NFKC/NFKD), giving you complete control over normalization strategy. Smart Detection - Most tools require you to know which form to use. Our tool automatically detects combining marks, compatibility characters, and ligatures, then recommends the optimal form. Visual Analysis - Basic tools just convert text. We show you: Exactly what changed (highlight mode). Unicode code points for transformations (tooltips). Character composition analysis (statistics). Combining marks and compat character counts. Batch Processing - Process entire files or line-by-line datasets, not just single snippets. Real-time Statistics - See immediate feedback on normalization impact: length changes, character analysis, form verification. Comparison View - Side-by-side view of before/after for verification. Professional Features - File upload/download, undo/redo history, auto-copy, whitespace normalization. Educational - Helps you understand Unicode concepts through visual feedback, not just blindly convert. Use our tool when: You need to choose between multiple normalization forms. You want to understand what's changing in your text. You're processing large datasets or files. You need professional-grade Unicode normalization with validation.