Text that is repeatedly put through a web based content management system can get badly mutilated. If that text is treated as XML the mutilation can be fatal.
While investigating this I established the best encoding (numeric entities) to use and discovered a few problems that I hadn't been aware of.
This article is derived from internal notes and includes mention of, named character entities, literal Unicode glyphs, utf-8 encoding, numerically encoded characters (decimal and hexadecimal), truetype fonts, XML, XHTML, Content Management...
First published 10 September 2004
I have been working on a content management system. It takes web pages encoded as XHTML, lets users edit content then publishes the content back to the web.
I used third party tools expecting to simply plug them in and ignore the internal plumbing. It didn't work out that way. Even though I severely limited the editing, the XML was getting mangled and needed rehabilitation. Among the problems some characters were getting lost and others were transformed beyond recognition.
Not all of the several ways that characters can be represented in XML (or XHTML) survive repeated processing.
I looked at three of these representations:
The table below shows how these three different forms look. It contains characters which have a named entities and which tend to be most commonly needed in English, western European languages, mathematics... Even though the three columns look the same, under the skin they are written differently and they don't always behave the same. Some of these differences are described in the notes.
The same page will also look different if displayed with different fonts. In some fonts characters may show up while others may go missing. (Missing characters usually show as a hollow box.) I have picked one font that shows a lot of the characters which are missing on this page. It may be seen on this version of the page. If you have the font called Arial Unicode MS installed on your computer, the missing characters should show up. If not it'll look the same.
This displays using Arial, if you have it installed. Failing that it may display in Syntax, Helvetica or your default sans-serif font.
3 Representations | ||||||
Name | Description | Number | Numeric | Named | Literal | Notes |
nbsp | Non-breaking space | 160 | Named Entity ( ) is not legal XML, though it is legal XHTML. This makes these entities unsuitable for processing as XML. | |||
iexcl | Inverted exclamation | 161 | ¡ | ¡ | ¡ | Literal This literal and all other latin literals lost in a processing test. |
cent | Cent sign | 162 | ¢ | ¢ | ¢ | Literal |
pound | Pound sterling sign | 163 | £ | £ | £ | Literal |
curren | General currency sign | 164 | ¤ | ¤ | ¤ | Literal |
yen | Yen sign | 165 | ¥ | ¥ | ¥ | Literal |
brvbar | Broken vertical bar | 166 | ¦ | ¦ | ¦ | Literal |
sect | Section sign | 167 | § | § | § | Literal |
uml | Umlaut (diaeresis) | 168 | ¨ | ¨ | ¨ | Literal |
copy | Copyright sign | 169 | © | © | © | Literal |
ordf | Feminine ordinal | 170 | ª | ª | ª | Literal |
laquo | Left angle quote, guillemot left | 171 | « | « | « | Literal |
not | Not sign | 172 | ¬ | ¬ | ¬ | Literal |
shy | Soft hyphen | 173 | | | | Literal |
reg | Registered trademark | 174 | ® | ® | ® | Literal |
macr | Macron accent | 175 | ¯ | ¯ | ¯ | Literal |
deg | Degree sign | 176 | ° | ° | ° | Literal |
plusmn | Plus or minus sign | 177 | ± | ± | ± | Literal |
sup2 | Superscript two | 178 | ² | ² | ² | Literal |
sup3 | Superscript three | 179 | ³ | ³ | ³ | Literal |
acute | Acute accent | 180 | ´ | ´ | ´ | Literal |
micro | Micro sign | 181 | µ | µ | µ | Literal |
para | Paragraph sign (pilcrow) | 182 | ¶ | ¶ | ¶ | Literal |
middot | Middle dot | 183 | · | · | · | Literal |
cedil | Cedilla | 184 | ¸ | ¸ | ¸ | Literal |
sup1 | Superscript one | 185 | ¹ | ¹ | ¹ | Literal |
ordm | Masculine ordinal | 186 | º | º | º | Literal |
raquo | Right angle quote, guillemot right | 187 | » | » | » | Literal |
frac14 | One Quarter (vulgar fraction) | 188 | ¼ | ¼ | ¼ | Literal |
frac12 | One Half (vulgar fraction) | 189 | ½ | ½ | ½ | Literal |
frac34 | Three Quarters (vulgar fraction) | 190 | ¾ | ¾ | ¾ | Literal |
iquest | Inverted question mark | 191 | ¿ | ¿ | ¿ | Literal |
Agrave | Capital A, grave | 192 | À | À | À | Literal |
Aacute | Capital A, acute | 193 | Á | Á | Á | Literal |
Acirc | Capital A, circumflex | 194 | Â | Â | Â | Literal |
Atilde | Capital A, tilde | 195 | Ã | Ã | Ã | Literal |
Auml | Capital A, umlaut (diaeresis) | 196 | Ä | Ä | Ä | Literal |
Aring | Capital A, ring | 197 | Å | Å | Å | Literal |
AElig | Capital AE dipthong (ligature) | 198 | Æ | Æ | Æ | Literal |
Ccedil | Capital C, cedilla | 199 | Ç | Ç | Ç | Literal |
Egrave | Capital E, grave | 200 | È | È | È | Literal |
Eacute | Capita E, acute | 201 | É | É | É | Literal |
Ecirc | Capital E, circumflex | 202 | Ê | Ê | Ê | Literal |
Euml | Capital E, umlaut (diaeresis) | 203 | Ë | Ë | Ë | Literal |
Igrave | Capital I, grave | 204 | Ì | Ì | Ì | Literal |
Iacute | Capital I, acute | 205 | Í | Í | Í | Literal |
Icirc | Capital I, circumflex | 206 | Î | Î | Î | Literal |
Iuml | Capital I, umlaut (diaeresis) | 207 | Ï | Ï | Ï | Literal |
3 Representations | ||||||
Name | Description | Number | Numeric | Named | Literal | Notes |
ETH | Capital Eth, Icelandic | 208 | Ð | Ð | Ð | Literal |
Ntilde | Capital N, tilde | 209 | Ñ | Ñ | Ñ | Literal |
Ograve | Capital O, grave | 210 | Ò | Ò | Ò | Literal |
Oacute | Capital O, acute | 211 | Ó | Ó | Ó | Literal |
Ocirc | Capital O, circumflex | 212 | Ô | Ô | Ô | Literal |
Otilde | Capital O, tilde | 213 | Õ | Õ | Õ | Literal |
Ouml | Capital O, umlaut (diaeresis) | 214 | Ö | Ö | Ö | Literal |
times | Multiplication sign | 215 | × | × | × | Literal |
Oslash | Capital O, slash | 216 | Ø | Ø | Ø | Literal |
Ugrave | Capital U, grave | 217 | Ù | Ù | Ù | Literal |
Uacute | Capital U, acute | 218 | Ú | Ú | Ú | Literal |
Ucirc | Capital U, circumflex | 219 | Û | Û | Û | Literal |
Uuml | Capital U, umlaut (diaeresis) | 220 | Ü | Ü | Ü | Literal |
Yacute | Capital Y, acute | 221 | Ý | Ý | Ý | Literal |
THORN | Capital Thorn, Icelandic | 222 | Þ | Þ | Þ | Literal |
szlig | Small sharp s, German (sz ligature) | 223 | ß | ß | ß | Literal |
agrave | Small a, grave | 224 | à | à | à | Literal |
aacute | Small a, acute | 225 | á | á | á | Literal |
acirc | Small a, circumflex | 226 | â | â | â | Literal |
atilde | Small a, tilde | 227 | ã | ã | ã | Literal |
auml | Small a, umlaut (diaeresis) | 228 | ä | ä | ä | Literal |
aring | Small a, ring | 229 | å | å | å | Literal |
aelig | Small ae dipthong (ligature) | 230 | æ | æ | æ | Literal |
ccedil | Small c, cedilla | 231 | ç | ç | ç | Literal |
egrave | Small e, grave | 232 | è | è | è | Literal |
eacute | Small e, acute | 233 | é | é | é | Literal |
ecirc | Small e, circumflex | 234 | ê | ê | ê | Literal |
euml | Small e, umlaut (diaeresis) | 235 | ë | ë | ë | Literal |
igrave | Small i, grave | 236 | ì | ì | ì | Literal |
iacute | Small i, acute | 237 | í | í | í | Literal |
icirc | Small i, circumflex | 238 | î | î | î | Literal |
iuml | Small i, umlaut (diaeresis) | 239 | ï | ï | ï | Literal |
eth | Small eth, Icelandic | 240 | ð | ð | ð | Literal |
ntilde | Small n, tilde | 241 | ñ | ñ | ñ | Literal |
ograve | Small o, grave | 242 | ò | ò | ò | Literal |
oacute | Small o, acute | 243 | ó | ó | ó | Literal |
ocirc | Small o, circumflex | 244 | ô | ô | ô | Literal |
otilde | Small o, tilde | 245 | õ | õ | õ | Literal |
ouml | Small o, umlaut (diaeresis) | 246 | ö | ö | ö | Literal |
divide | Division sign | 247 | ÷ | ÷ | ÷ | Literal |
oslash | Small o, slash | 248 | ø | ø | ø | Literal |
ugrave | Small u, grave | 249 | ù | ù | ù | Literal |
uacute | Small u, acute | 250 | ú | ú | ú | Literal |
ucirc | Small u, circumflex | 251 | û | û | û | Literal |
uuml | Small u, umlaut (diaeresis) | 252 | ü | ü | ü | Literal |
yacute | Small y, acute | 253 | ý | ý | ý | Literal |
thorn | Small thorn, Icelandic | 254 | þ | þ | þ | Literal |
yuml | Small y, umlaut (diaeresis) | 255 | ÿ | ÿ | ÿ | Literal |
OElig | Latin Capital OE (ligature) | 338 | Œ | Œ | Œ | Literal |
oelig | Latin Small OE (ligature) | 339 | œ | œ | œ | Literal |
3 Representations | ||||||
Name | Description | Number | Numeric | Named | Literal | Notes |
Scaron | Capital S with caron | 352 | Š | Š | Š | Literal |
scaron | Small s with caron | 353 | š | š | š | Literal |
Yuml | Capital Y, umlaut (diaeresis) | 376 | Ÿ | Ÿ | Ÿ | Literal |
fnof | florin (latin small f with hook) | 402 | ƒ | ƒ | ƒ | Literal |
circ | Circumflex accent | 710 | ˆ | ˆ | ˆ | Literal |
tilde | Small tilde | 732 | ˜ | ˜ | ˜ | Literal |
Alpha | Capital Greek Alpha | 913 | Α | Α | Α | |
Beta | Capital Greek Beta | 914 | Β | Β | Β | |
Gamma | Capital Greek Gamma | 915 | Γ | Γ | Γ | |
Delta | Capital Greek Delta | 916 | Δ | Δ | Δ | |
Epsilon | Capital Greek Epsilon | 917 | Ε | Ε | Ε | |
Zeta | Capital Greek Zeta | 918 | Ζ | Ζ | Ζ | |
Eta | Capital Greek Eta | 919 | Η | Η | Η | |
Theta | Capital Greek Theta | 920 | Θ | Θ | Θ | |
Iota | Capital Greek Iota | 921 | Ι | Ι | Ι | |
Kappa | Capital Greek Kappa | 922 | Κ | Κ | Κ | |
Lambda | Capital Greek Lambda | 923 | Λ | Λ | Λ | |
Mu | Capital Greek Mu | 924 | Μ | Μ | Μ | |
Nu | Capital Greek Nu | 925 | Ν | Ν | Ν | |
Xi | Capital Greek Xi | 926 | Ξ | Ξ | Ξ | |
Omicron | Capital Greek Omicron | 927 | Ο | Ο | Ο | |
Pi | Capital Greek Pi | 928 | Π | Π | Π | |
Rho | Capital Greek Rho | 929 | Ρ | Ρ | Ρ | |
Sigma | Capital Greek Sigma | 931 | Σ | Σ | Σ | |
Tau | Capital Greek Tau | 932 | Τ | Τ | Τ | |
Upsilon | Capital Greek Upsilon | 933 | Υ | Υ | Υ | |
Phi | Capital Greek Phi | 934 | Φ | Φ | Φ | |
Chi | Capital Greek Chi | 935 | Χ | Χ | Χ | |
Psi | Capital Greek Psi | 936 | Ψ | Ψ | Ψ | |
Omega | Capital Greek Omega | 937 | Ω | Ω | Ω | |
alpha | Small Greek Alpha | 945 | α | α | α | |
beta | Small Greek Beta | 946 | β | β | β | |
gamma | Small Greek Gamma | 947 | γ | γ | γ | |
delta | Small Greek Delta | 948 | δ | δ | δ | |
epsilon | Small Greek Epsilon | 949 | ε | ε | ε | |
zeta | Small Greek Zeta | 950 | ζ | ζ | ζ | |
eta | Small Greek Eta | 951 | η | η | η | |
theta | Small Greek Theta | 952 | θ | θ | θ | |
iota | Small Greek Iota | 953 | ι | ι | ι | |
kappa | Small Greek Kappa | 954 | κ | κ | κ | |
lambda | Small Greek Lambda | 955 | λ | λ | λ | |
mu | Small Greek Mu | 956 | μ | μ | μ | |
nu | Small Greek Nu | 957 | ν | ν | ν | |
xi | Small Greek Xi | 958 | ξ | ξ | ξ | |
omicron | Small Greek Omicron | 959 | ο | ο | ο | |
pi | Small Greek Pi | 960 | π | π | π | |
rho | Small Greek Rho | 961 | ρ | ρ | ρ | |
sigmaf | Small Greek final Sigma | 962 | ς | ς | ς | |
sigma | Small Greek Sigma | 963 | σ | σ | σ | |
tau | Small Greek Tau | 964 | τ | τ | τ | |
3 Representations | ||||||
Name | Description | Number | Numeric | Named | Literal | Notes |
upsilon | Small Greek Upsilon | 965 | υ | υ | υ | |
phi | Small Greek Phi | 966 | φ | φ | φ | |
chi | Small Greek Chi | 967 | χ | χ | χ | |
psi | Small Greek Psi | 968 | ψ | ψ | ψ | |
omega | Small Greek Omega | 969 | ω | ω | ω | |
thetasym | Small Greek theta | 977 | ϑ | ϑ | ϑ | Common This glyph (character) not present in common fonts tested. |
upsih | Greek Upsilon with hook | 978 | ϒ | ϒ | ϒ | Common |
piv | Greek Pi symbol | 982 | ϖ | ϖ | ϖ | Common |
ensp | En space | 8194 | ||||
emsp | Em space | 8195 | ||||
thinsp | Thin space | 8201 | ||||
zwnj | Zero width non-joiner | 8204 | | | | |
zwj | Zero width joiner | 8205 | | | | |
lrm | Left-to-right mark | 8206 | | | | |
rlm | Right-to-left mark | 8207 | | | | |
ndash | En dash | 8211 | – | – | – | Literal |
mdash | Em dash | 8212 | — | — | — | Literal |
lsquo | Left single quotation mark | 8216 | ‘ | ‘ | ‘ | Literal |
rsquo | Right single quotation mark | 8217 | ’ | ’ | ’ | Literal |
sbquo | Single low-9 quotation mark | 8218 | ‚ | ‚ | ‚ | Literal |
ldquo | Left double quotation mark | 8220 | “ | “ | “ | Literal |
rdquo | Right double quotation mark | 8221 | ” | ” | ” | Literal |
bdquo | Double low-9 quotation mark | 8222 | „ | „ | „ | Literal |
dagger | Dagger | 8224 | † | † | † | Literal |
Dagger | Double Dagger | 8225 | ‡ | ‡ | ‡ | Literal |
bull | Bullet / Small black circle | 8226 | • | • | • | Literal |
hellip | Horizontal Ellipsis | 8230 | … | … | … | Literal |
permil | Per mille (thousand) sign | 8240 | ‰ | ‰ | ‰ | Literal |
prime | Prime / Minutes / Feet | 8242 | ′ | ′ | ′ | |
Prime | Double prime | 8243 | ″ | ″ | ″ | |
lsaquo | Single left-pointing angle quotation mark | 8249 | ‹ | ‹ | ‹ | Literal |
rsaquo | Single right-pointing angle quotation mark | 8250 | › | › | › | Literal |
oline | Overline / Spacing overscore | 8254 | ‾ | ‾ | ‾ | |
frasl | Fraction Slash | 8260 | ⁄ | ⁄ | ⁄ | |
euro | Euro sign* | 8364 | € | € | € | Literal |
image | Blackletter capital I (imaginary part) | 8465 | ℑ | ℑ | ℑ | Common |
weierp | Script capital P / Weierstrass p | 8472 | ℘ | ℘ | ℘ | Common |
real | Blackletter capital R (real part) | 8476 | ℜ | ℜ | ℜ | Common |
trade | Trademark symbol | 8482 | ™ | ™ | ™ | Literal |
alefsym | Alef symbol / First transfinite | 8501 | ℵ | ℵ | ℵ | Common |
larr | Leftwards arrow | 8592 | ← | ← | ← | |
uarr | Upwards arrow | 8593 | ↑ | ↑ | ↑ | |
rarr | Rightwards arrow | 8594 | → | → | → | |
darr | Downwards arrow | 8595 | ↓ | ↓ | ↓ | |
harr | Left Right arrow | 8596 | ↔ | ↔ | ↔ | |
crarr | Downwards arrow with corner leftwards | 8629 | ↵ | ↵ | ↵ | Common |
lArr | Leftwards double arrow | 8656 | ⇐ | ⇐ | ⇐ | Common |
uArr | Upwards double arrow | 8657 | ⇑ | ⇑ | ⇑ | Common |
rArr | Rightwards double arrow | 8658 | ⇒ | ⇒ | ⇒ | |
dArr | Downwards double arrow | 8659 | ⇓ | ⇓ | ⇓ | Common |
3 Representations | ||||||
Name | Description | Number | Numeric | Named | Literal | Notes |
hArr | Left Right double arrow | 8660 | ⇔ | ⇔ | ⇔ | |
forall | For All | 8704 | ∀ | ∀ | ∀ | |
part | Partial Differential | 8706 | ∂ | ∂ | ∂ | |
exist | There exists | 8707 | ∃ | ∃ | ∃ | Common |
empty | Empty Set | 8709 | ∅ | ∅ | ∅ | Common |
nabla | Nabla / Backward difference | 8711 | ∇ | ∇ | ∇ | |
isin | Element Of... | 8712 | ∈ | ∈ | ∈ | |
notin | Not an elementof | 8713 | ∉ | ∉ | ∉ | Common |
ni | Contains as member | 8715 | ∋ | ∋ | ∋ | |
prod | n-ary product / product sign | 8719 | ∏ | ∏ | ∏ | |
sum | n-ary sumation | 8721 | ∑ | ∑ | ∑ | |
minus | Minus sign | 8722 | − | − | − | |
lowast | Asterisk operator | 8727 | ∗ | ∗ | ∗ | Common |
radic | Square root / Radical sign | 8730 | √ | √ | √ | |
prop | Proportional to | 8733 | ∝ | ∝ | ∝ | Common |
infin | Infinity symbol | 8734 | ∞ | ∞ | ∞ | |
ang | Angle | 8736 | ∠ | ∠ | ∠ | |
and | Logical And / Wedge | 8743 | ∧ | ∧ | ∧ | |
or | Logical Or / Vee | 8744 | ∨ | ∨ | ∨ | |
cap | Intersection | 8745 | ∩ | ∩ | ∩ | |
cup | Union / Cup | 8746 | ∪ | ∪ | ∪ | Common |
int | Integral | 8747 | ∫ | ∫ | ∫ | |
there4 | Therefore | 8756 | ∴ | ∴ | ∴ | |
sim | Tilde operator | 8764 | ∼ | ∼ | ∼ | |
cong | Approximately equal to | 8773 | ≅ | ≅ | ≅ | Common |
asymp | Almost equal to / Asymptotic | 8776 | ≈ | ≈ | ≈ | |
ne | Not equal to | 8800 | ≠ | ≠ | ≠ | |
equiv | Identical to / Equivalent | 8801 | ≡ | ≡ | ≡ | |
le | Less than or euqal to | 8804 | ≤ | ≤ | ≤ | |
ge | Greater than or equal to | 8805 | ≥ | ≥ | ≥ | |
sub | Subset of | 8834 | ⊂ | ⊂ | ⊂ | |
sup | Superset of | 8835 | ⊃ | ⊃ | ⊃ | |
nsub | Not a subset of | 8836 | ⊄ | ⊄ | ⊄ | Common |
sube | Subset of or equal to | 8838 | ⊆ | ⊆ | ⊆ | |
supe | Superset of or equal to | 8839 | ⊇ | ⊇ | ⊇ | |
oplus | Circle plus | 8853 | ⊕ | ⊕ | ⊕ | |
otimes | Circled times | 8855 | ⊗ | ⊗ | ⊗ | Common |
perp | Othogonal / Perpendicular to / Up tack | 8869 | ⊥ | ⊥ | ⊥ | |
sdot | Dot operator | 8901 | ⋅ | ⋅ | ⋅ | Common |
lceil | Left ceiling | 8968 | ⌈ | ⌈ | ⌈ | Common |
rceil | Right ceiling | 8969 | ⌉ | ⌉ | ⌉ | Common |
lfloor | Left floor | 8970 | ⌊ | ⌊ | ⌊ | Common |
rfloor | Right floor | 8971 | ⌋ | ⌋ | ⌋ | Common |
lang | Left pointing angle bracket | 9001 | 〈 | 〈 | 〈 | Common |
rang | Right pointing angle bracket | 9002 | 〉 | 〉 | 〉 | Common |
loz | Lozenge | 9674 | ◊ | ◊ | ◊ | |
spades | Black Spade suit | 9824 | ♠ | ♠ | ♠ | |
clubs | Black Clubs suit | 9827 | ♣ | ♣ | ♣ | |
hearts | Black Hearts suit | 9829 | ♥ | ♥ | ♥ | |
diams | Black Diamonds suit | 9830 | ♦ | ♦ | ♦ | Common |
The table suggests a few things: