HTML字符实体引用Character Entity References in HTML 4 and XHTML 1.0

HTML实体全解析
本文详细介绍HTML实体的作用,展示了如何使用实体来生成键盘上没有的特殊字符,如货币符号、数学符号、希腊字母等。提供了完整的HTML实体表格,涵盖特殊字符、重音符号、标点符号、数学符号和希腊字母。

HTML实体常用于生成那些键盘上没有的印刷字符,比如€、∞、≠、©等等。

HTML实体以和号(&)开头,分号(;)结尾,两者之间表示实体的字符串(或数字)。

这里摘抄了国外一个比较完整的HTML实体表格作为个人的备忘,并分享给大家。

每个表有五列。

第一列包含实体引用,是常用的&entity_name;形式。

第二列显示该实体出现在浏览器的效果。

第三列包含实体引用,是&#number;形式。

第四列显示该实体出现在浏览器的效果。

第五列是实体的描述。
---------
HTML特殊字符

Entity	Entity
Displayed	Number	Number
Displayed	Description
&	&	&	&	ampersand
>	>	>	>	greater-than sign
&lt;	<	&#60;	<	less-than sign
&quot;	"	&#34;	"	quotation mark = APL quote

HTML实体重音符号及西欧语言符号

Entity	Entity
Displayed	Number	Number
Displayed	Description
&acute;	´	&#180;	´	acute accent = spacing acute
&cedil;	¸	&#184;	¸	cedilla = spacing cedilla
&circ;	ˆ	&#710;	ˆ	modifier letter circumflex accent
&macr;	¯	&#175;	¯	macron = spacing macron = overline = APL overbar
&middot;	·	&#183;	·	middle dot = Georgian comma = Greek middle dot
&tilde;	˜	&#732;	˜	small tilde
&uml;	¨	&#168;	¨	diaeresis = spacing diaeresis

&Aacute;	Á	&#193;	Á	latin capital letter A with acute
&aacute;	á	&#225;	á	latin small letter a with acute
&Acirc;	Â	&#194;	Â	latin capital letter A with circumflex
&acirc;	â	&#226;	â	latin small letter a with circumflex
&AElig;	Æ	&#198;	Æ	latin capital letter AE = latin capital ligature AE
&aelig;	æ	&#230;	æ	latin small letter ae = latin small ligature ae
&Agrave;	À	&#192;	À	latin capital letter A with grave = latin capital letter A grave
&agrave;	à	&#224;	à	latin small letter a with grave = latin small letter a grave
&Aring;	Å	&#197;	Å	latin capital letter A with ring above = latin capital letter A ring
&aring;	å	&#229;	å	latin small letter a with ring above = latin small letter a ring
&Atilde;	Ã	&#195;	Ã	latin capital letter A with tilde
&atilde;	ã	&#227;	ã	latin small letter a with tilde
&Auml;	Ä	&#196;	Ä	latin capital letter A with diaeresis
&auml;	ä	&#228;	ä	latin small letter a with diaeresis
&Ccedil;	Ç	&#199;	Ç	latin capital letter C with cedilla
&ccedil;	ç	&#231;	ç	latin small letter c with cedilla
&Eacute;	É	&#201;	É	latin capital letter E with acute
&eacute;	é	&#233;	é	latin small letter e with acute
&Ecirc;	Ê	&#202;	Ê	latin capital letter E with circumflex
&ecirc;	ê	&#234;	ê	latin small letter e with circumflex
&Egrave;	È	&#200;	È	latin capital letter E with grave
&egrave;	è	&#232;	è	latin small letter e with grave
&ETH;	Ð	&#208;	Ð	latin capital letter ETH
&eth;	ð	&#240;	ð	latin small letter eth
&Euml;	Ë	&#203;	Ë	latin capital letter E with diaeresis
&euml;	ë	&#235;	ë	latin small letter e with diaeresis
&Iacute;	Í	&#205;	Í	latin capital letter I with acute
&iacute;	í	&#237;	í	latin small letter i with acute
&Icirc;	Î	&#206;	Î	latin capital letter I with circumflex
&icirc;	î	&#238;	î	latin small letter i with circumflex
&Igrave;	Ì	&#204;	Ì	latin capital letter I with grave
&igrave;	ì	&#236;	ì	latin small letter i with grave
&Iuml;	Ï	&#207;	Ï	latin capital letter I with diaeresis
&iuml;	ï	&#239;	ï	latin small letter i with diaeresis
&Ntilde;	Ñ	&#209;	Ñ	latin capital letter N with tilde
&ntilde;	ñ	&#241;	ñ	latin small letter n with tilde
&Oacute;	Ó	&#211;	Ó	latin capital letter O with acute
&oacute;	ó	&#243;	ó	latin small letter o with acute
&Ocirc;	Ô	&#212;	Ô	latin capital letter O with circumflex
&ocirc;	ô	&#244;	ô	latin small letter o with circumflex
&OElig;	Œ	&#338;	Œ	latin capital ligature OE
&oelig;	œ	&#339;	œ	latin small ligature oe (note)
&Ograve;	Ò	&#210;	Ò	latin capital letter O with grave
&ograve;	ò	&#242;	ò	latin small letter o with grave
&Oslash;	Ø	&#216;	Ø	latin capital letter O with stroke = latin capital letter O slash
&oslash;	ø	&#248;	ø	latin small letter o with stroke, = latin small letter o slash
&Otilde;	Õ	&#213;	Õ	latin capital letter O with tilde
&otilde;	õ	&#245;	õ	latin small letter o with tilde
&Ouml;	Ö	&#214;	Ö	latin capital letter O with diaeresis
&ouml;	ö	&#246;	ö	latin small letter o with diaeresis
&Scaron;	Š	&#352;	Š	latin capital letter S with caron
&scaron;	š	&#353;	š	latin small letter s with caron
&szlig;	ß	&#223;	ß	latin small letter sharp s = ess-zed
&THORN;	Þ	&#222;	Þ	latin capital letter THORN
&thorn;	þ	&#254;	þ	latin small letter thorn
&Uacute;	Ú	&#218;	Ú	latin capital letter U with acute
&uacute;	ú	&#250;	ú	latin small letter u with acute
&Ucirc;	Û	&#219;	Û	latin capital letter U with circumflex
&ucirc;	û	&#251;	û	latin small letter u with circumflex
&Ugrave;	Ù	&#217;	Ù	latin capital letter U with grave
&ugrave;	ù	&#249;	ù	latin small letter u with grave
&Uuml;	Ü	&#220;	Ü	latin capital letter U with diaeresis
&uuml;	ü	&#252;	ü	latin small letter u with diaeresis
&Yacute;	Ý	&#221;	Ý	latin capital letter Y with acute
&yacute;	ý	&#253;	ý	latin small letter y with acute
&yuml;	ÿ	&#255;	ÿ	latin small letter y with diaeresis
&Yuml;	Ÿ	&#376;	Ÿ	latin capital letter Y with diaeresis

HTML实体标点符号

Entity	Entity
Displayed	Number	Number
Displayed	Description
&cent;	¢	&#162;	¢	cent sign
&curren;	¤	&#164;	¤	currency sign
&euro;	€	&#8364;	€	euro sign
&pound;	£	&#163;	£	pound sign
&yen;	¥	&#165;	¥	yen sign = yuan sign

&brvbar;	¦	&#166;	¦	broken bar = broken vertical bar
&bull;	•	&#8226;	•	bullet = black small circle (note)
&copy;	©	&#169;	©	copyright sign
&dagger;	†	&#8224;	†	dagger
&Dagger;	‡	&#8225;	‡	double dagger
&frasl;	⁄	&#8260;	⁄	fraction slash
&hellip;	…	&#8230;	…	horizontal ellipsis = three dot leader
&iexcl;	¡	&#161;	¡	inverted exclamation mark
&image;	ℑ	&#8465;	ℑ	blackletter capital I = imaginary part
&iquest;	¿	&#191;	¿	inverted question mark = turned question mark
&lrm;	‎	&#8206;	‎	left-to-right mark (for formatting only)
&mdash;	—	&#8212;	—	em dash
&ndash;	–	&#8211;	–	en dash
&not;	¬	&#172;	¬	not sign
&oline;	‾	&#8254;	‾	overline = spacing overscore
&ordf;	ª	&#170;	ª	feminine ordinal indicator
&ordm;	º	&#186;	º	masculine ordinal indicator
&para;	¶	&#182;	¶	pilcrow sign = paragraph sign
&permil;	‰	&#8240;	‰	per mille sign
&prime;	′	&#8242;	′	prime = minutes = feet
&Prime;	″	&#8243;	″	double prime = seconds = inches
&real;	ℜ	&#8476;	ℜ	blackletter capital R = real part symbol
&reg;	®	&#174;	®	registered sign = registered trade mark sign
&rlm;	‏	&#8207;	‏	right-to-left mark (for formatting only)
&sect;	§	&#167;	§	section sign
&shy;		&#173;		soft hyphen = discretionary hyphen (displays incorrectly on Mac)
&sup1;	¹	&#185;	¹	superscript one = superscript digit one
&trade;	™	&#8482;	™	trade mark sign
&weierp;	℘	&#8472;	℘	script capital P = power set = Weierstrass p

&bdquo;	„	&#8222;	„	double low-9 quotation mark
&laquo;	«	&#171;	«	left-pointing double angle quotation mark = left pointing guillemet
&ldquo;	“	&#8220;	“	left double quotation mark
&lsaquo;	‹	&#8249;	‹	single left-pointing angle quotation mark (note)
&lsquo;	‘	&#8216;	‘	left single quotation mark
&raquo;	»	&#187;	»	right-pointing double angle quotation mark = right pointing guillemet
&rdquo;	”	&#8221;	”	right double quotation mark
&rsaquo;	›	&#8250;	›	single right-pointing angle quotation mark (note)
&rsquo;	’	&#8217;	’	right single quotation mark
&sbquo;	‚	&#8218;	‚	single low-9 quotation mark

&emsp;	 	&#8195;	 	em space
&ensp;	 	&#8194;	 	en space
&nbsp;	 	&#160;	 	no-break space = non-breaking space
&thinsp;	 	&#8201;	 	thin space
&zwj;	‍	&#8205;	‍	zero width joiner
&zwnj;	‌	&#8204;	‌	zero width non-joiner

HTML实体标点符号

Entity	Entity
Displayed	Number	Number
Displayed	Description
&deg;	°	&#176;	°	degree sign
&divide;	÷	&#247;	÷	division sign
&frac12;	½	&#189;	½	vulgar fraction one half = fraction one half
&frac14;	¼	&#188;	¼	vulgar fraction one quarter = fraction one quarter
&frac34;	¾	&#190;	¾	vulgar fraction three quarters = fraction three quarters
&ge;	≥	&#8805;	≥	greater-than or equal to
&le;	≤	&#8804;	≤	less-than or equal to
&minus;	−	&#8722;	−	minus sign
&sup2;	²	&#178;	²	superscript two = superscript digit two = squared
&sup3;	³	&#179;	³	superscript three = superscript digit three = cubed
&times;	×	&#215;	×	multiplication sign

&alefsym;	ℵ	&#8501;	ℵ	alef symbol = first transfinite cardinal (note)
&and;	∧	&#8743;	∧	logical and = wedge
&ang;	∠	&#8736;	∠	angle
&asymp;	≈	&#8776;	≈	almost equal to = asymptotic to
&cap;	∩	&#8745;	∩	intersection = cap
&cong;	≅	&#8773;	≅	approximately equal to
&cup;	∪	&#8746;	∪	union = cup
&empty;	∅	&#8709;	∅	empty set = null set = diameter
&equiv;	≡	&#8801;	≡	identical to
&exist;	∃	&#8707;	∃	there exists
&fnof;	ƒ	&#402;	ƒ	latin small f with hook = function = florin
&forall;	∀	&#8704;	∀	for all
&infin;	∞	&#8734;	∞	infinity
&int;	∫	&#8747;	∫	integral
&isin;	∈	&#8712;	∈	element of
&lang;	⟨	&#9001;	〈	left-pointing angle bracket = bra (note)
&lceil;	⌈	&#8968;	⌈	left ceiling = apl upstile
&lfloor;	⌊	&#8970;	⌊	left floor = apl downstile
&lowast;	∗	&#8727;	∗	asterisk operator
&micro;	µ	&#181;	µ	micro sign
&nabla;	∇	&#8711;	∇	nabla = backward difference
&ne;	≠	&#8800;	≠	not equal to
&ni;	∋	&#8715;	∋	contains as member (note)
&notin;	∉	&#8713;	∉	not an element of
&nsub;	⊄	&#8836;	⊄	not a subset of
&oplus;	⊕	&#8853;	⊕	circled plus = direct sum
&or;	∨	&#8744;	∨	logical or = vee
&otimes;	⊗	&#8855;	⊗	circled times = vector product
&part;	∂	&#8706;	∂	partial differential
&perp;	⊥	&#8869;	⊥	up tack = orthogonal to = perpendicular
&plusmn;	±	&#177;	±	plus-minus sign = plus-or-minus sign
&prod;	∏	&#8719;	∏	n-ary product = product sign (note)
&prop;	∝	&#8733;	∝	proportional to
&radic;	√	&#8730;	√	square root = radical sign
&rang;	⟩	&#9002;	〉	right-pointing angle bracket = ket (note)
&rceil;	⌉	&#8969;	⌉	right ceiling
&rfloor;	⌋	&#8971;	⌋	right floor
&sdot;	⋅	&#8901;	⋅	dot operator (note)
&sim;	∼	&#8764;	∼	tilde operator = varies with = similar to (note)
&sub;	⊂	&#8834;	⊂	subset of
&sube;	⊆	&#8838;	⊆	subset of or equal to
&sum;	∑	&#8721;	∑	n-ary sumation (note)
&sup;	⊃	&#8835;	⊃	superset of (note)
&supe;	⊇	&#8839;	⊇	superset of or equal to
&there4;	∴	&#8756;	∴	therefore

&Alpha;	Α	&#913;	Α	greek capital letter alpha
&alpha;	α	&#945;	α	greek small letter alpha
&Beta;	Β	&#914;	Β	greek capital letter beta
&beta;	β	&#946;	β	greek small letter beta
&Chi;	Χ	&#935;	Χ	greek capital letter chi
&chi;	χ	&#967;	χ	greek small letter chi
&Delta;	Δ	&#916;	Δ	greek capital letter delta
&delta;	δ	&#948;	δ	greek small letter delta
&Epsilon;	Ε	&#917;	Ε	greek capital letter epsilon
&epsilon;	ε	&#949;	ε	greek small letter epsilon
&Eta;	Η	&#919;	Η	greek capital letter eta
&eta;	η	&#951;	η	greek small letter eta
&Gamma;	Γ	&#915;	Γ	greek capital letter gamma
&gamma;	γ	&#947;	γ	greek small letter gamma
&Iota;	Ι	&#921;	Ι	greek capital letter iota
&iota;	ι	&#953;	ι	greek small letter iota
&Kappa;	Κ	&#922;	Κ	greek capital letter kappa
&kappa;	κ	&#954;	κ	greek small letter kappa
&Lambda;	Λ	&#923;	Λ	greek capital letter lambda
&lambda;	λ	&#955;	λ	greek small letter lambda
&Mu;	Μ	&#924;	Μ	greek capital letter mu
&mu;	μ	&#956;	μ	greek small letter mu
&Nu;	Ν	&#925;	Ν	greek capital letter nu
&nu;	ν	&#957;	ν	greek small letter nu
&Omega;	Ω	&#937;	Ω	greek capital letter omega
&omega;	ω	&#969;	ω	greek small letter omega
&Omicron;	Ο	&#927;	Ο	greek capital letter omicron
&omicron;	ο	&#959;	ο	greek small letter omicron
&Phi;	Φ	&#934;	Φ	greek capital letter phi
&phi;	φ	&#966;	φ	greek small letter phi
&Pi;	Π	&#928;	Π	greek capital letter pi
&pi;	π	&#960;	π	greek small letter pi
&piv;	ϖ	&#982;	ϖ	greek pi symbol
&Psi;	Ψ	&#936;	Ψ	greek capital letter psi
&psi;	ψ	&#968;	ψ	greek small letter psi
&Rho;	Ρ	&#929;	Ρ	greek capital letter rho
&rho;	ρ	&#961;	ρ	greek small letter rho
&Sigma;	Σ	&#931;	Σ	greek capital letter sigma
&sigma;	σ	&#963;	σ	greek small letter sigma
&sigmaf;	ς	&#962;	ς	greek small letter final sigma (note)
&Tau;	Τ	&#932;	Τ	greek capital letter tau
&tau;	τ	&#964;	τ	greek small letter tau
&Theta;	Θ	&#920;	Θ	greek capital letter theta
&theta;	θ	&#952;	θ	greek small letter theta
&thetasym;	ϑ	&#977;	ϑ	greek small letter theta symbol
&upsih;	ϒ	&#978;	ϒ	greek upsilon with hook symbol
&Upsilon;	Υ	&#933;	Υ	greek capital letter upsilon
&upsilon;	υ	&#965;	υ	greek small letter upsilon
&Xi;	Ξ	&#926;	Ξ	greek capital letter xi
&xi;	ξ	&#958;	ξ	greek small letter xi
&Zeta;	Ζ	&#918;	Ζ	greek capital letter zeta
&zeta;	ζ	&#950;	ζ	greek small letter zeta

HTML实体标点符号

Entity	Entity
Displayed	Number	Number
Displayed	Description
&crarr;	↵	&#8629;	↵	downwards arrow with corner leftwards = carriage return
&darr;	↓	&#8595;	↓	downwards arrow
&dArr;	⇓	&#8659;	⇓	downwards double arrow
&harr;	↔	&#8596;	↔	left right arrow
&hArr;	⇔	&#8660;	⇔	left right double arrow
&larr;	←	&#8592;	←	leftwards arrow
&lArr;	⇐	&#8656;	⇐	leftwards double arrow (note)
&rarr;	→	&#8594;	→	rightwards arrow
&rArr;	⇒	&#8658;	⇒	rightwards double arrow (note)
&uarr;	↑	&#8593;	↑	upwards arrow
&uArr;	⇑	&#8657;	⇑	upwards double arrow

&clubs;	♣	&#9827;	♣	black club suit = shamrock
&diams;	♦	&#9830;	♦	black diamond suit
&hearts;	♥	&#9829;	♥	black heart suit = valentine
&spades;	♠	&#9824;	♠	black spade suit (note)

&loz;	◊	&#9674;	◊	lozenge
"""A parser for HTML and XHTML.""" # This file is based on sgmllib.py, but the API is slightly different. # XXX There should be a way to distinguish between PCDATA (parsed # character data -- the normal case), RCDATA (replaceable character # data -- only char and entity references and end tags are special) # and CDATA (character data -- only end tags are special). import _markupbase import re # Regular expressions used for parsing interesting_normal = re.compile('[&<]') incomplete = re.compile('&[a-zA-Z#]') entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') charref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]') starttagopen = re.compile('<[a-zA-Z]') piclose = re.compile('>') commentclose = re.compile(r'--\s*>') # see http://www.w3.org/TR/html5/tokenization.html#tag-open-state # and http://www.w3.org/TR/html5/tokenization.html#tag-name-state # note: if you change tagfind/attrfind remember to update locatestarttagend too tagfind = re.compile('([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*') # this regex is currently unused, but left for backward compatibility tagfind_tolerant = re.compile('[a-zA-Z][^\t\n\r\f />\x00]*') attrfind = re.compile( r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*' r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*') locatestarttagend = re.compile(r""" <[a-zA-Z][^\t\n\r\f />\x00]* # tag name (?:[\s/]* # optional whitespace before attribute name (?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name (?:\s*=+\s* # value indicator (?:'[^']*' # LITA-enclosed value |"[^"]*" # LIT-enclosed value |(?!['"])[^>\s]* # bare value ) )?(?:\s|/(?!>))* )* )? \s* # trailing whitespace """, re.VERBOSE) endendtag = re.compile('>') # the HTML 5 spec, section 8.1.2.2, doesn't allow spaces between # </ and the tag name, so maybe this should be fixed endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>') class HTMLParseError(Exception): """Exception raised for all parse errors.""" def __init__(self, msg, position=(None, None)): assert msg self.msg = msg self.lineno = position[0] self.offset = position[1] def __str__(self): result = self.msg if self.lineno is not None: result = result + ", at line %d" % self.lineno if self.offset is not None: result = result + ", column %d" % (self.offset + 1) return result class HTMLParser(_markupbase.ParserBase): """Find tags and other markup and call handler functions. Usage: p = HTMLParser() p.feed(data) ... p.close() Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling self.handle_entityref() with the entity reference as the argument. Numeric character references are passed to self.handle_charref() with the string containing the reference as the argument. """ CDATA_CONTENT_ELEMENTS = ("script", "style") def __init__(self): """Initialize and reset this instance.""" self.reset() def reset(self): """Reset this instance. Loses all unprocessed data.""" self.rawdata = '' self.lasttag = '???' self.interesting = interesting_normal self.cdata_elem = None self.alldata = [] self.rawdatabk = '' _markupbase.ParserBase.reset(self) def feed(self, data): r"""Feed data to the parser. Call this as often as you want, with as little or as much text as you want (may include '\n'). """ self.rawdata = self.rawdata + data self.goahead(0) def close(self): """Handle any buffered data.""" self.goahead(1) def error(self, message): raise HTMLParseError(message, self.getpos()) __starttag_text = None def get_starttag_text(self): """Return full source of start tag: '<...>'.""" return self.__starttag_text def set_cdata_mode(self, elem): self.cdata_elem = elem.lower() self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I) def clear_cdata_mode(self): self.interesting = interesting_normal self.cdata_elem = None # Internal -- handle data as far as reasonable. May leave state # and data to be processed by a subsequent call. If 'end' is # true, force handling all data as if followed by EOF marker. def goahead(self, end): rawdata = self.rawdata self.rawdatabk = self.rawdata i = 0 n = len(rawdata) while i < n: match = self.interesting.search(rawdata, i) # < or & if match: j = match.start() else: if self.cdata_elem: break j = n if i < j: self.handle_data(i, rawdata[i:j]) i = self.updatepos(i, j) if i == n: break startswith = rawdata.startswith if startswith('<', i): if starttagopen.match(rawdata, i): # < + letter k = self.parse_starttag(i) elif startswith("</", i): k = self.parse_endtag(i) elif startswith("<!--", i): k = self.parse_comment(i) elif startswith("<?", i): k = self.parse_pi(i) elif startswith("<!", i): k = self.parse_html_declaration(i) elif (i + 1) < n: self.handle_data(i, "<") k = i + 1 else: break if k < 0: if not end: break k = rawdata.find('>', i + 1) if k < 0: k = rawdata.find('<', i + 1) if k < 0: k = i + 1 else: k += 1 self.handle_data(rawdata[i:k]) i = self.updatepos(i, k) elif startswith("&#", i): match = charref.match(rawdata, i) if match: name = match.group()[2:-1] self.handle_charref(name) k = match.end() if not startswith(';', k-1): k = k - 1 i = self.updatepos(i, k) continue else: if ";" in rawdata[i:]: # bail by consuming '&#' self.handle_data(i, rawdata[i:i+2]) i = self.updatepos(i, i+2) break elif startswith('&', i): match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) k = match.end() if not startswith(';', k-1): k = k - 1 i = self.updatepos(i, k) continue match = incomplete.match(rawdata, i) if match: # match.group() will contain at least 2 chars if end and match.group() == rawdata[i:]: self.error("EOF in middle of entity or char ref") # incomplete break elif (i + 1) < n: # not the end of the buffer, and can't be confused # with some other construct self.handle_data(i, "&") i = self.updatepos(i, i + 1) else: break else: assert 0, "interesting.search() lied" # end while if end and i < n and not self.cdata_elem: self.handle_data(i, rawdata[i:n]) i = self.updatepos(i, n) self.rawdata = rawdata[i:] # Internal -- parse html declarations, return length or -1 if not terminated # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state # See also parse_declaration in _markupbase def parse_html_declaration(self, i): rawdata = self.rawdata if rawdata[i:i+2] != '<!': self.error('unexpected call to parse_html_declaration()') if rawdata[i:i+4] == '<!--': # this case is actually already handled in goahead() return self.parse_comment(i) elif rawdata[i:i+3] == '<![': return self.parse_marked_section(i) elif rawdata[i:i+9].lower() == '<!doctype': # find the closing > gtpos = rawdata.find('>', i+9) if gtpos == -1: return -1 self.handle_decl(rawdata[i+2:gtpos]) return gtpos+1 else: return self.parse_bogus_comment(i) # Internal -- parse bogus comment, return length or -1 if not terminated # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state def parse_bogus_comment(self, i, report=1): rawdata = self.rawdata if rawdata[i:i+2] not in ('<!', '</'): self.error('unexpected call to parse_comment()') pos = rawdata.find('>', i+2) if pos == -1: return -1 if report: self.handle_comment(rawdata[i+2:pos]) return pos + 1 # Internal -- parse processing instr, return end or -1 if not terminated def parse_pi(self, i): rawdata = self.rawdata assert rawdata[i:i+2] == '<?', 'unexpected call to parse_pi()' match = piclose.search(rawdata, i+2) # > if not match: return -1 j = match.start() self.handle_pi(rawdata[i+2: j]) j = match.end() return j # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): self.__starttag_text = None endpos = self.check_for_whole_start_tag(i) if endpos < 0: return endpos rawdata = self.rawdata self.__starttag_text = rawdata[i:endpos] # Now parse the data between i+1 and j into a tag and attrs attrs = [] match = tagfind.match(rawdata, i+1) assert match, 'unexpected call to parse_starttag()' k = match.end() self.lasttag = tag = match.group(1).lower() while k < endpos: m = attrfind.match(rawdata, k) if not m: break attrname, rest, attrvalue = m.group(1, 2, 3) if not rest: attrvalue = None elif attrvalue[:1] == '\'' == attrvalue[-1:] or \ attrvalue[:1] == '"' == attrvalue[-1:]: attrvalue = attrvalue[1:-1] if attrvalue: attrvalue = self.unescape(attrvalue) attrs.append((attrname.lower(), attrvalue)) k = m.end() end = rawdata[k:endpos].strip() if end not in (">", "/>"): lineno, offset = self.getpos() if "\n" in self.__starttag_text: lineno = lineno + self.__starttag_text.count("\n") offset = len(self.__starttag_text) \ - self.__starttag_text.rfind("\n") else: offset = offset + len(self.__starttag_text) self.handle_data(i, rawdata[i:endpos]) return endpos if end.endswith('/>'): # XHTML-style empty tag: <span attr="value" /> self.handle_startendtag(tag, attrs) else: self.handle_starttag(tag, attrs) if tag in self.CDATA_CONTENT_ELEMENTS: self.set_cdata_mode(tag) return endpos # Internal -- check to see if we have a complete starttag; return end # or -1 if incomplete. def check_for_whole_start_tag(self, i): rawdata = self.rawdata m = locatestarttagend.match(rawdata, i) if m: j = m.end() next = rawdata[j:j+1] if next == ">": return j + 1 if next == "/": if rawdata.startswith("/>", j): return j + 2 if rawdata.startswith("/", j): # buffer boundary return -1 # else bogus input self.updatepos(i, j + 1) self.error("malformed empty start tag") if next == "": # end of input return -1 if next in ("abcdefghijklmnopqrstuvwxyz=/" "ABCDEFGHIJKLMNOPQRSTUVWXYZ"): # end of input in or before attribute value, or we have the # '/' from a '/>' ending return -1 if j > i: return j else: return i + 1 raise AssertionError("we should not get here!") # Internal -- parse endtag, return end or -1 if incomplete def parse_endtag(self, i): rawdata = self.rawdata assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag" match = endendtag.search(rawdata, i+1) # > if not match: return -1 gtpos = match.end() match = endtagfind.match(rawdata, i) # </ + tag + > if not match: if self.cdata_elem is not None: self.handle_data(i, rawdata[i:gtpos]) return gtpos # find the name: w3.org/TR/html5/tokenization.html#tag-name-state namematch = tagfind.match(rawdata, i+2) if not namematch: # w3.org/TR/html5/tokenization.html#end-tag-open-state if rawdata[i:i+3] == '</>': return i+3 else: return self.parse_bogus_comment(i) tagname = namematch.group(1).lower() # consume and ignore other stuff between the name and the > # Note: this is not 100% correct, since we might have things like # </tag attr=">">, but looking for > after tha name should cover # most of the cases and is much simpler gtpos = rawdata.find('>', namematch.end()) self.handle_endtag(tagname) return gtpos+1 elem = match.group(1).lower() # script or style if self.cdata_elem is not None: if elem != self.cdata_elem: self.handle_data(i, rawdata[i:gtpos]) return gtpos self.handle_endtag(elem) self.clear_cdata_mode() return gtpos # Overridable -- finish processing of start+end tag: <tag.../> def handle_startendtag(self, tag, attrs): self.handle_starttag(tag, attrs) self.handle_endtag(tag) # Overridable -- handle start tag def handle_starttag(self, tag, attrs): pass # Overridable -- handle end tag def handle_endtag(self, tag): pass # Overridable -- handle character reference def handle_charref(self, name): pass # Overridable -- handle entity reference def handle_entityref(self, name): pass # Overridable -- handle data def handle_data(self, i, data): pass # Overridable -- handle comment def handle_comment(self, data): pass # Overridable -- handle declaration def handle_decl(self, decl): pass # Overridable -- handle processing instruction def handle_pi(self, data): pass def unknown_decl(self, data): pass # Internal -- helper to remove special character quoting entitydefs = None def unescape(self, s): if '&' not in s: return s def replaceEntities(s): s = s.groups()[0] try: if s[0] == "#": s = s[1:] if s[0] in ['x','X']: c = int(s[1:], 16) else: c = int(s) return chr(c) except ValueError: return '&#'+s+';' else: # Cannot use name2codepoint directly, because HTMLParser supports apos, # which is not part of HTML 4 import html.entities if HTMLParser.entitydefs is None: entitydefs = HTMLParser.entitydefs = {'apos':"'"} for k, v in html.entities.name2codepoint.items(): entitydefs[k] = chr(v) try: return self.entitydefs[s] except KeyError: return '&'+s+';' return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
最新发布
12-05
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值