将命名 HTML 实体转换为数字 HTML 实体

2022-08-30 19:07:02

是否有 PHP 函数可以将命名的 HTML 实体转换为其各自的数字 HTML 实体?

例如:

$str = "Oggi è un bel giorno";
echo entities_to_unicode($str); // Oggi è un bel giorno

提前致谢,祝你有美好的一天!


答案 1

您正在寻找一个简单的转换函数,从命名的 HTML 实体到其数字对应项。

这可以通过使用转换表(即数组)和字符串转换函数(strtr)来完成:

$translated = strtr($string, $HTML401NamedToNumeric);

这适用于 UTF-8 编码或单字节字符集。$string

下面是 W3C 指定的 HTML 4.01 命名实体的示例数组,如上所示。它包含 252 个实体。如果你想支持XHTML,那么还有一个(我把它放在最后):

$HTML401NamedToNumeric = array(
    ' '     => ' ',  # no-break space = non-breaking space, U+00A0 ISOnum
    '¡'    => '¡',  # inverted exclamation mark, U+00A1 ISOnum
    '¢'     => '¢',  # cent sign, U+00A2 ISOnum
    '£'    => '£',  # pound sign, U+00A3 ISOnum
    '¤'   => '¤',  # currency sign, U+00A4 ISOnum
    '¥'      => '¥',  # yen sign = yuan sign, U+00A5 ISOnum
    '¦'   => '¦',  # broken bar = broken vertical bar, U+00A6 ISOnum
    '§'     => '§',  # section sign, U+00A7 ISOnum
    '¨'      => '¨',  # diaeresis = spacing diaeresis, U+00A8 ISOdia
    '©'     => '©',  # copyright sign, U+00A9 ISOnum
    'ª'     => 'ª',  # feminine ordinal indicator, U+00AA ISOnum
    '«'    => '«',  # left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum
    '¬'      => '¬',  # not sign, U+00AC ISOnum
    '­'      => '­',  # soft hyphen = discretionary hyphen, U+00AD ISOnum
    '®'      => '®',  # registered sign = registered trade mark sign, U+00AE ISOnum
    '¯'     => '¯',  # macron = spacing macron = overline = APL overbar, U+00AF ISOdia
    '°'      => '°',  # degree sign, U+00B0 ISOnum
    '±'   => '±',  # plus-minus sign = plus-or-minus sign, U+00B1 ISOnum
    '²'     => '²',  # superscript two = superscript digit two = squared, U+00B2 ISOnum
    '³'     => '³',  # superscript three = superscript digit three = cubed, U+00B3 ISOnum
    '´'    => '´',  # acute accent = spacing acute, U+00B4 ISOdia
    'µ'    => 'µ',  # micro sign, U+00B5 ISOnum
    '¶'     => '¶',  # pilcrow sign = paragraph sign, U+00B6 ISOnum
    '·'   => '·',  # middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum
    '¸'    => '¸',  # cedilla = spacing cedilla, U+00B8 ISOdia
    '¹'     => '¹',  # superscript one = superscript digit one, U+00B9 ISOnum
    'º'     => 'º',  # masculine ordinal indicator, U+00BA ISOnum
    '»'    => '»',  # right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum
    '¼'   => '¼',  # vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum
    '½'   => '½',  # vulgar fraction one half = fraction one half, U+00BD ISOnum
    '¾'   => '¾',  # vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum
    '¿'   => '¿',  # inverted question mark = turned question mark, U+00BF ISOnum
    'À'   => 'À',  # latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1
    'Á'   => 'Á',  # latin capital letter A with acute, U+00C1 ISOlat1
    'Â'    => 'Â',  # latin capital letter A with circumflex, U+00C2 ISOlat1
    'Ã'   => 'Ã',  # latin capital letter A with tilde, U+00C3 ISOlat1
    'Ä'     => 'Ä',  # latin capital letter A with diaeresis, U+00C4 ISOlat1
    'Å'    => 'Å',  # latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1
    'Æ'    => 'Æ',  # latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1
    'Ç'   => 'Ç',  # latin capital letter C with cedilla, U+00C7 ISOlat1
    'È'   => 'È',  # latin capital letter E with grave, U+00C8 ISOlat1
    'É'   => 'É',  # latin capital letter E with acute, U+00C9 ISOlat1
    'Ê'    => 'Ê',  # latin capital letter E with circumflex, U+00CA ISOlat1
    'Ë'     => 'Ë',  # latin capital letter E with diaeresis, U+00CB ISOlat1
    'Ì'   => 'Ì',  # latin capital letter I with grave, U+00CC ISOlat1
    'Í'   => 'Í',  # latin capital letter I with acute, U+00CD ISOlat1
    'Î'    => 'Î',  # latin capital letter I with circumflex, U+00CE ISOlat1
    'Ï'     => 'Ï',  # latin capital letter I with diaeresis, U+00CF ISOlat1
    'Ð'      => 'Ð',  # latin capital letter ETH, U+00D0 ISOlat1
    'Ñ'   => 'Ñ',  # latin capital letter N with tilde, U+00D1 ISOlat1
    'Ò'   => 'Ò',  # latin capital letter O with grave, U+00D2 ISOlat1
    'Ó'   => 'Ó',  # latin capital letter O with acute, U+00D3 ISOlat1
    'Ô'    => 'Ô',  # latin capital letter O with circumflex, U+00D4 ISOlat1
    'Õ'   => 'Õ',  # latin capital letter O with tilde, U+00D5 ISOlat1
    'Ö'     => 'Ö',  # latin capital letter O with diaeresis, U+00D6 ISOlat1
    '×'    => '×',  # multiplication sign, U+00D7 ISOnum
    'Ø'   => 'Ø',  # latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1
    'Ù'   => 'Ù',  # latin capital letter U with grave, U+00D9 ISOlat1
    'Ú'   => 'Ú',  # latin capital letter U with acute, U+00DA ISOlat1
    'Û'    => 'Û',  # latin capital letter U with circumflex, U+00DB ISOlat1
    'Ü'     => 'Ü',  # latin capital letter U with diaeresis, U+00DC ISOlat1
    'Ý'   => 'Ý',  # latin capital letter Y with acute, U+00DD ISOlat1
    'Þ'    => 'Þ',  # latin capital letter THORN, U+00DE ISOlat1
    'ß'    => 'ß',  # latin small letter sharp s = ess-zed, U+00DF ISOlat1
    'à'   => 'à',  # latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1
    'á'   => 'á',  # latin small letter a with acute, U+00E1 ISOlat1
    'â'    => 'â',  # latin small letter a with circumflex, U+00E2 ISOlat1
    'ã'   => 'ã',  # latin small letter a with tilde, U+00E3 ISOlat1
    'ä'     => 'ä',  # latin small letter a with diaeresis, U+00E4 ISOlat1
    'å'    => 'å',  # latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1
    'æ'    => 'æ',  # latin small letter ae = latin small ligature ae, U+00E6 ISOlat1
    'ç'   => 'ç',  # latin small letter c with cedilla, U+00E7 ISOlat1
    'è'   => 'è',  # latin small letter e with grave, U+00E8 ISOlat1
    'é'   => 'é',  # latin small letter e with acute, U+00E9 ISOlat1
    'ê'    => 'ê',  # latin small letter e with circumflex, U+00EA ISOlat1
    'ë'     => 'ë',  # latin small letter e with diaeresis, U+00EB ISOlat1
    'ì'   => 'ì',  # latin small letter i with grave, U+00EC ISOlat1
    'í'   => 'í',  # latin small letter i with acute, U+00ED ISOlat1
    'î'    => 'î',  # latin small letter i with circumflex, U+00EE ISOlat1
    'ï'     => 'ï',  # latin small letter i with diaeresis, U+00EF ISOlat1
    'ð'      => 'ð',  # latin small letter eth, U+00F0 ISOlat1
    'ñ'   => 'ñ',  # latin small letter n with tilde, U+00F1 ISOlat1
    'ò'   => 'ò',  # latin small letter o with grave, U+00F2 ISOlat1
    'ó'   => 'ó',  # latin small letter o with acute, U+00F3 ISOlat1
    'ô'    => 'ô',  # latin small letter o with circumflex, U+00F4 ISOlat1
    'õ'   => 'õ',  # latin small letter o with tilde, U+00F5 ISOlat1
    'ö'     => 'ö',  # latin small letter o with diaeresis, U+00F6 ISOlat1
    '÷'   => '÷',  # division sign, U+00F7 ISOnum
    'ø'   => 'ø',  # latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1
    'ù'   => 'ù',  # latin small letter u with grave, U+00F9 ISOlat1
    'ú'   => 'ú',  # latin small letter u with acute, U+00FA ISOlat1
    'û'    => 'û',  # latin small letter u with circumflex, U+00FB ISOlat1
    'ü'     => 'ü',  # latin small letter u with diaeresis, U+00FC ISOlat1
    'ý'   => 'ý',  # latin small letter y with acute, U+00FD ISOlat1
    'þ'    => 'þ',  # latin small letter thorn, U+00FE ISOlat1
    'ÿ'     => 'ÿ',  # latin small letter y with diaeresis, U+00FF ISOlat1
    'ƒ'     => 'ƒ',  # latin small f with hook = function = florin, U+0192 ISOtech
    'Α'    => 'Α',  # greek capital letter alpha, U+0391
    'Β'     => 'Β',  # greek capital letter beta, U+0392
    'Γ'    => 'Γ',  # greek capital letter gamma, U+0393 ISOgrk3
    'Δ'    => 'Δ',  # greek capital letter delta, U+0394 ISOgrk3
    'Ε'  => 'Ε',  # greek capital letter epsilon, U+0395
    'Ζ'     => 'Ζ',  # greek capital letter zeta, U+0396
    'Η'      => 'Η',  # greek capital letter eta, U+0397
    'Θ'    => 'Θ',  # greek capital letter theta, U+0398 ISOgrk3
    'Ι'     => 'Ι',  # greek capital letter iota, U+0399
    'Κ'    => 'Κ',  # greek capital letter kappa, U+039A
    'Λ'   => 'Λ',  # greek capital letter lambda, U+039B ISOgrk3
    'Μ'       => 'Μ',  # greek capital letter mu, U+039C
    'Ν'       => 'Ν',  # greek capital letter nu, U+039D
    'Ξ'       => 'Ξ',  # greek capital letter xi, U+039E ISOgrk3
    'Ο'  => 'Ο',  # greek capital letter omicron, U+039F
    'Π'       => 'Π',  # greek capital letter pi, U+03A0 ISOgrk3
    'Ρ'      => 'Ρ',  # greek capital letter rho, U+03A1
    'Σ'    => 'Σ',  # greek capital letter sigma, U+03A3 ISOgrk3
    'Τ'      => 'Τ',  # greek capital letter tau, U+03A4
    'Υ'  => 'Υ',  # greek capital letter upsilon, U+03A5 ISOgrk3
    'Φ'      => 'Φ',  # greek capital letter phi, U+03A6 ISOgrk3
    'Χ'      => 'Χ',  # greek capital letter chi, U+03A7
    'Ψ'      => 'Ψ',  # greek capital letter psi, U+03A8 ISOgrk3
    'Ω'    => 'Ω',  # greek capital letter omega, U+03A9 ISOgrk3
    'α'    => 'α',  # greek small letter alpha, U+03B1 ISOgrk3
    'β'     => 'β',  # greek small letter beta, U+03B2 ISOgrk3
    'γ'    => 'γ',  # greek small letter gamma, U+03B3 ISOgrk3
    'δ'    => 'δ',  # greek small letter delta, U+03B4 ISOgrk3
    'ε'  => 'ε',  # greek small letter epsilon, U+03B5 ISOgrk3
    'ζ'     => 'ζ',  # greek small letter zeta, U+03B6 ISOgrk3
    'η'      => 'η',  # greek small letter eta, U+03B7 ISOgrk3
    'θ'    => 'θ',  # greek small letter theta, U+03B8 ISOgrk3
    'ι'     => 'ι',  # greek small letter iota, U+03B9 ISOgrk3
    'κ'    => 'κ',  # greek small letter kappa, U+03BA ISOgrk3
    'λ'   => 'λ',  # greek small letter lambda, U+03BB ISOgrk3
    'μ'       => 'μ',  # greek small letter mu, U+03BC ISOgrk3
    'ν'       => 'ν',  # greek small letter nu, U+03BD ISOgrk3
    'ξ'       => 'ξ',  # greek small letter xi, U+03BE ISOgrk3
    'ο'  => 'ο',  # greek small letter omicron, U+03BF NEW
    'π'       => 'π',  # greek small letter pi, U+03C0 ISOgrk3
    'ρ'      => 'ρ',  # greek small letter rho, U+03C1 ISOgrk3
    'ς'   => 'ς',  # greek small letter final sigma, U+03C2 ISOgrk3
    'σ'    => 'σ',  # greek small letter sigma, U+03C3 ISOgrk3
    'τ'      => 'τ',  # greek small letter tau, U+03C4 ISOgrk3
    'υ'  => 'υ',  # greek small letter upsilon, U+03C5 ISOgrk3
    'φ'      => 'φ',  # greek small letter phi, U+03C6 ISOgrk3
    'χ'      => 'χ',  # greek small letter chi, U+03C7 ISOgrk3
    'ψ'      => 'ψ',  # greek small letter psi, U+03C8 ISOgrk3
    'ω'    => 'ω',  # greek small letter omega, U+03C9 ISOgrk3
    'ϑ' => 'ϑ',  # greek small letter theta symbol, U+03D1 NEW
    'ϒ'    => 'ϒ',  # greek upsilon with hook symbol, U+03D2 NEW
    'ϖ'      => 'ϖ',  # greek pi symbol, U+03D6 ISOgrk3
    '•'     => '•', # bullet = black small circle, U+2022 ISOpub
    '…'   => '…', # horizontal ellipsis = three dot leader, U+2026 ISOpub
    '′'    => '′', # prime = minutes = feet, U+2032 ISOtech
    '″'    => '″', # double prime = seconds = inches, U+2033 ISOtech
    '‾'    => '‾', # overline = spacing overscore, U+203E NEW
    '⁄'    => '⁄', # fraction slash, U+2044 NEW
    '℘'   => '℘', # script capital P = power set = Weierstrass p, U+2118 ISOamso
    'ℑ'    => 'ℑ', # blackletter capital I = imaginary part, U+2111 ISOamso
    'ℜ'     => 'ℜ', # blackletter capital R = real part symbol, U+211C ISOamso
    '™'    => '™', # trade mark sign, U+2122 ISOnum
    'ℵ'  => 'ℵ', # alef symbol = first transfinite cardinal, U+2135 NEW
    '←'     => '←', # leftwards arrow, U+2190 ISOnum
    '↑'     => '↑', # upwards arrow, U+2191 ISOnum
    '→'     => '→', # rightwards arrow, U+2192 ISOnum
    '↓'     => '↓', # downwards arrow, U+2193 ISOnum
    '↔'     => '↔', # left right arrow, U+2194 ISOamsa
    '↵'    => '↵', # downwards arrow with corner leftwards = carriage return, U+21B5 NEW
    '⇐'     => '⇐', # leftwards double arrow, U+21D0 ISOtech
    '⇑'     => '⇑', # upwards double arrow, U+21D1 ISOamsa
    '⇒'     => '⇒', # rightwards double arrow, U+21D2 ISOtech
    '⇓'     => '⇓', # downwards double arrow, U+21D3 ISOamsa
    '⇔'     => '⇔', # left right double arrow, U+21D4 ISOamsa
    '∀'   => '∀', # for all, U+2200 ISOtech
    '∂'     => '∂', # partial differential, U+2202 ISOtech
    '∃'    => '∃', # there exists, U+2203 ISOtech
    '∅'    => '∅', # empty set = null set = diameter, U+2205 ISOamso
    '∇'    => '∇', # nabla = backward difference, U+2207 ISOtech
    '∈'     => '∈', # element of, U+2208 ISOtech
    '∉'    => '∉', # not an element of, U+2209 ISOtech
    '∋'       => '∋', # contains as member, U+220B ISOtech
    '∏'     => '∏', # n-ary product = product sign, U+220F ISOamsb
    '∑'      => '∑', # n-ary sumation, U+2211 ISOamsb
    '−'    => '−', # minus sign, U+2212 ISOtech
    '∗'   => '∗', # asterisk operator, U+2217 ISOtech
    '√'    => '√', # square root = radical sign, U+221A ISOtech
    '∝'     => '∝', # proportional to, U+221D ISOtech
    '∞'    => '∞', # infinity, U+221E ISOtech
    '∠'      => '∠', # angle, U+2220 ISOamso
    '∧'      => '∧', # logical and = wedge, U+2227 ISOtech
    '∨'       => '∨', # logical or = vee, U+2228 ISOtech
    '∩'      => '∩', # intersection = cap, U+2229 ISOtech
    '∪'      => '∪', # union = cup, U+222A ISOtech
    '∫'      => '∫', # integral, U+222B ISOtech
    '∴'   => '∴', # therefore, U+2234 ISOtech
    '∼'      => '∼', # tilde operator = varies with = similar to, U+223C ISOtech
    '≅'     => '≅', # approximately equal to, U+2245 ISOtech
    '≈'    => '≈', # almost equal to = asymptotic to, U+2248 ISOamsr
    '≠'       => '≠', # not equal to, U+2260 ISOtech
    '≡'    => '≡', # identical to, U+2261 ISOtech
    '≤'       => '≤', # less-than or equal to, U+2264 ISOtech
    '≥'       => '≥', # greater-than or equal to, U+2265 ISOtech
    '⊂'      => '⊂', # subset of, U+2282 ISOtech
    '⊃'      => '⊃', # superset of, U+2283 ISOtech
    '⊄'     => '⊄', # not a subset of, U+2284 ISOamsn
    '⊆'     => '⊆', # subset of or equal to, U+2286 ISOtech
    '⊇'     => '⊇', # superset of or equal to, U+2287 ISOtech
    '⊕'    => '⊕', # circled plus = direct sum, U+2295 ISOamsb
    '⊗'   => '⊗', # circled times = vector product, U+2297 ISOamsb
    '⊥'     => '⊥', # up tack = orthogonal to = perpendicular, U+22A5 ISOtech
    '⋅'     => '⋅', # dot operator, U+22C5 ISOamsb
    '⌈'    => '⌈', # left ceiling = apl upstile, U+2308 ISOamsc
    '⌉'    => '⌉', # right ceiling, U+2309 ISOamsc
    '⌊'   => '⌊', # left floor = apl downstile, U+230A ISOamsc
    '⌋'   => '⌋', # right floor, U+230B ISOamsc
    '⟨'     => '〈', # left-pointing angle bracket = bra, U+2329 ISOtech
    '⟩'     => '〉', # right-pointing angle bracket = ket, U+232A ISOtech
    '◊'      => '◊', # lozenge, U+25CA ISOpub
    '♠'   => '♠', # black spade suit, U+2660 ISOpub
    '♣'    => '♣', # black club suit = shamrock, U+2663 ISOpub
    '♥'   => '♥', # black heart suit = valentine, U+2665 ISOpub
    '♦'    => '♦', # black diamond suit, U+2666 ISOpub
    '"'     => '"',   # quotation mark = APL quote, U+0022 ISOnum
    '&'      => '&',   # ampersand, U+0026 ISOnum
    '<'       => '<',   # less-than sign, U+003C ISOnum
    '>'       => '>',   # greater-than sign, U+003E ISOnum
    'Œ'    => 'Œ',  # latin capital ligature OE, U+0152 ISOlat2
    'œ'    => 'œ',  # latin small ligature oe, U+0153 ISOlat2
    'Š'   => 'Š',  # latin capital letter S with caron, U+0160 ISOlat2
    'š'   => 'š',  # latin small letter s with caron, U+0161 ISOlat2
    'Ÿ'     => 'Ÿ',  # latin capital letter Y with diaeresis, U+0178 ISOlat2
    'ˆ'     => 'ˆ',  # modifier letter circumflex accent, U+02C6 ISOpub
    '˜'    => '˜',  # small tilde, U+02DC ISOdia
    ' '     => ' ', # en space, U+2002 ISOpub
    ' '     => ' ', # em space, U+2003 ISOpub
    ' '   => ' ', # thin space, U+2009 ISOpub
    '‌'     => '‌', # zero width non-joiner, U+200C NEW RFC 2070
    '‍'      => '‍', # zero width joiner, U+200D NEW RFC 2070
    '‎'      => '‎', # left-to-right mark, U+200E NEW RFC 2070
    '‏'      => '‏', # right-to-left mark, U+200F NEW RFC 2070
    '–'    => '–', # en dash, U+2013 ISOpub
    '—'    => '—', # em dash, U+2014 ISOpub
    '‘'    => '‘', # left single quotation mark, U+2018 ISOnum
    '’'    => '’', # right single quotation mark, U+2019 ISOnum
    '‚'    => '‚', # single low-9 quotation mark, U+201A NEW
    '“'    => '“', # left double quotation mark, U+201C ISOnum
    '”'    => '”', # right double quotation mark, U+201D ISOnum
    '„'    => '„', # double low-9 quotation mark, U+201E NEW
    '†'   => '†', # dagger, U+2020 ISOpub
    '‡'   => '‡', # double dagger, U+2021 ISOpub
    '‰'   => '‰', # per mille sign, U+2030 ISOtech
    '‹'   => '‹', # single left-pointing angle quotation mark, U+2039 ISO proposed
    '›'   => '›', # single right-pointing angle quotation mark, U+203A ISO proposed
    '€'     => '€', # euro sign, U+20AC NEW
);

还有一个用于 XHTML 的:

    '''     => ''',   # apostrophe = APL quote, U+0027 ISOnum

答案 2

此解决方案基于 php.net 中的代码:

function entities_to_unicode($str) {
    $str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
    $str = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $str);
    return $str;
}

$str = 'Oggi è un bel giorno';
echo entities_to_unicode($str);

推荐