XML文件中nbsp的产生和问题解决

XML文件在遇到'nbs...](http://gv.ca/dtd/character-entities.dtd">)来解决问题。

1.   nbsp usage, the definitive, full answer (and you thought it was 42?) 2.   nbsp doesn't work 3.   nbsp in output 4.   Another explanation 5.   nbsp, why doesn't it work

1.

nbsp usage, the definitive, full answer (and you thought it was 42?)

Jeni Tennison.


  > Could somebody explain to my WHY '&' translates to '&'
  > but ' ' doesn't change at all?

Let's consider this simple stylesheet:

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <xsl:stylesheet version="1.0"
                  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  
  <xsl:template match="/">
    <html>
      <head><title>Test</title></head>
      <body>
        <p>Non-breaking&amp;nbsp;space</p>
        <p>Non-breaking&#160;space</p>
      </body>
    </html>
  </xsl:template>
                  
  </xsl:stylesheet>

This stylesheet is stored on the hard disk as a series of bytes. The bytes match characters according to the ISO-8859-1 encoding (see the encoding pseudo-attribute on the XML declaration?).

When the XML parser reads in this as an XML document, it decodes the bytes into Unicode characters. It also parses the document, recognising things like start tags (e.g. <p>), built-in entity references (e.g. &amp;) and character references (e.g. &#160;).

The parser knows that &amp; stands for an & character (because it knows XML) and knows that &#160; stands for a non-breaking space character (because it knows XML and Unicode).

The parser reports to the XSLT processor when elements occur and what characters text is made up of, but doesn't report whether a particular character was originally serialized as the plain character (an actual space character), an entity reference or a character reference.

As far as an XSLT processor is concerned, therefore, the following elements in the stylesheet (or in an XML source document) would all be reported as *exactly* the same (a p element containing a text node whose string value is a double-quote character):

    <p>"</p>
    <p>&#34;</p>
    <p>&#x22;</p>
    <p>&quot;</p>
    <p><![CDATA["]]></p>

The two p elements serialized in the stylesheet, look like:

    <p>Non-breaking&amp;nbsp;space</p>
    <p>Non-breaking&#160;space</p>

For the first p element, the XML parser reports the string (here containing no escaping of any kind - every character is a literal character):

Non-breaking&nbsp;space

For the second p element, the XML parser reports the string (here containing an underscore character as a stand-in for a non-breaking space, since you can't see non-breaking spaces in emails):

Non-breaking_space

The XSLT processor builds a result tree from the stylesheet, which contains these text nodes and looks something like:

    /
    +- html
       +- head
       |  +- title
       |     +- text: "Test"
       +- body
          +- p
          |  +- text: "Non-breaking&nbsp;space"
          +- p
             +- text: "Non-breaking_space"

This tree exists in memory. All the characters are Unicode characters.

Once the XSLT processor has finished its transformation, it serializes this result tree. There are three methods that it could use to serialize the result tree: xml, html and text, which is controlled by the method attribute of xsl:output. It could also use any encoding - any mapping of characters to bytes - which is controlled by the encoding attribute of xsl:output.

The most straight-forward output method is the XML output method. In the XML output method, element nodes are serialized as a start tag, followed by content, followed by an end tag. Any characters in the element content that have to be escaped due to XML rules are escaped. So if you have a less-than sign in your text node, then it is automatically escaped to &lt;. If you have an ampersand in your text node then it is automatically escaped to &amp;. If you have a character that can't be represented by the encoding that you're using, then it is escaped using character references (e.g. &#160;).

Let's use a really really basic encoding, ASCII, which only covers 128 characters (and doesn't include non-breaking spaces). You can usually make your stylesheet generate ASCII with:

  <xsl:output encoding="ASCII" />

The non-breaking space character isn't covered by ASCII, so the non-breaking space character has to be escaped in the serialization using a character reference. So the serialization of the output tree will look like:

  <html>
    <head><title>Title</title></head>
    <body>
      <p>Non-breaking&amp;nbsp;space</p>
      <p>Non-breaking&#160;space</p>
    </body>
  </html>

If you used an encoding that covers the non-breaking space character, such as ISO-8859-1 or UTF-8 or UTF-16, then the non-breaking space character would be output as a literal non-breaking space character, and you'd get (substituting _ for non-breaking space characters again):

  <html>
    <head><title>Title</title></head>
    <body>
      <p>Non-breaking&amp;nbsp;space</p>
      <p>Non-breaking_space</p>
    </body>
  </html>

Trouble arises, however, when you try to view a document that's been saved using UTF-16 in an editor that doesn't support UTF-16 . The editor always tries to interpret the sequence of bytes that it reads from the file as ISO-8859-1 characters. It's a bit like taking an English document and trying to read it as if it were written in German. Some of the words might make sense, but most of the time you get gobbledy-gook.

Specifically, because UTF-16 uses two bytes for every character whereas ISO-8859-1 uses one, when you try to read a UTF-16 document as if it were ISO-8859-1, you see two characters for every one character that you expect. The first byte in a UTF-16 character is usually the same as the byte that is used in ISO-8859-1 to mean the Ă character, while the second byte is the one that actually contains the information. So you tend to see Ă_ rather than just _, for example.

Let's return to looking at the possible serializations of the result tree. The next possible serialization is HTML. HTML is serialized more-or-less the same as XML, with a few differences. The difference that is pertinent here is that when you use the html output method, XSLT processors are allowed to use the entities defined in HTML rather than as a native character (if the character can be represented in the encoding) or a character reference (if it can't). In our case, XSLT processors are allowed to serialize the non-breaking space character as the HTML character entity reference &nbsp;. So serializing as HTML, you may get:

  <html>
    <head><title>Title</title></head>
    <body>
      <p>Non-breaking&amp;nbsp;space</p>
      <p>Non-breaking&nbsp;space</p>
    </body>
  </html>

Finally, let's consider the text output method. In the text output method, everything aside from text nodes are ignored, and the text is output without any automatic escaping. If a character can be represented in the encoding that you use, then it will be serialized as a native character. If it can't be, then the XSLT processor gives you an error. In our case, assuming that we're using an encoding that supports the non-breaking space characters, we'd get something like (again with _ representing the non-breaking space):

  Non-breaking&nbsp;spaceNon-breaking_space

  > And, how would you suggest someone actually get '&nbsp;' into the
  > output in order to avoid the issue which started this thread in the
  > first place? (browsers assuming a different encoding type than is
  > sent, and therefore mistranslating character 160 as 'Ă' instead of '
  > '? I have yet to see a browser which misunderstands '&nbsp;'.

Hopefully, what I've explained above makes it clear that a browser that sees a non-breaking space character as an Ă followed by a non-breaking space character is making that error because it is reading the result of the transformation as if it is in one encoding (e.g. ISO-8859-1) when in fact it is in another encoding (e.g. UTF-16).

There are several solutions:

- change the browser so that it auto-detects the actual encoding that's being used in the HTML/XML document (and make sure that you're reporting the correct encoding in the HTTP headers)
- change the serialization process so that you use an encoding that the browser is expecting, by adding encoding="ISO-8859-1" to the xsl:output element
- change the serialization process so that you use an encoding that doesn't include the non-breaking space character, so that the processor uses a character reference for it, for example using ASCII as the encoding
- use the HTML output method with an XSLT processor that serializes non-breaking spaces as &nbsp;

Cheers, Jeni

P.S. There is another solution that will work with some processors, but not all - disabling output escaping for the text node that contains the relevant characters. But since you can solve the problem a lot more elegantly with one of the methods above, there's no reason to use it.  Jeni Tennison

2.

nbsp doesn't work

Mike Brown


> How can I make insert a tab and/or space
> characters into my html output from the xsl?
> &nbsp;, etc aren't legal in the xsl document....

This is the all-time #1 FAQ.

Regardless, just pick one:

1. &#160;

2. &#xA0;

3. &nbsp; after putting <!DOCTYPE xsl:stylesheet [<!ENTITY nbsp "&#160;">]> at the top of your stylesheet, after the XML declaration but before anything else (or reference any DTD containing that entity declaration);

4. type the character directly, if your keyboard and/or OS provide a way for you to do so, and your editor can be counted on to save the document in an encoding that supports that character, and you've made the encoding declaration match your editor's output.

3.

nbsp in output

Mike Brown

> I'm generating HTML from XML 
> The output HTML needs to contain some "&nbsp;". But until now I could not
> find a way to implement that.
&nbsp; is, by definition &#160;

Just put &#160; (or &#xA0;, the hex equivalent) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either &#160; or &nbsp; assuming you have <xsl:output method="html"/> in the stylesheet.

Wendell Piez outlines a use in tables with empty cells.

Outputting spaces in html table cells

Use &#160; for a non-breaking space. Your XML parser does not pick up the named entity &nbsp; because it hasn't been declared. But a numbered character reference (which is what &#160; is) will be recognized -- #160 is a non-breaking space.

You can even declare nbsp in an internal subset of your stylesheet if you want a friendlier representation of the character.

>There is some code before this that generates a table. 
if the value of "blah" is blank, and I was outputing this to html, then
>netscape would
>not handle blank <td/> fields in an elegant manner because it would shift
>the next column over one to replace the blank column. Normally, I would insert an '&nbsp'
>between each <td> tag so that netscape would render a space and not ignore the cell, but as
>you know, '&' is reserved in xml. I tried &amp;, but that doesn't render a space but rather
>the real '&' symbol. So my question is what is the best way to solve this problem?
>

4.

Another explanation

Trevor Nash

In an attempt to reduce the number of 'how do I get &nbsp;' questions, I have tried to update Dave Pawson's FAQ on the subject: text follows. I also sent a message to the list owners to see if we can get the search mechanism tweaked to make it easier to find &nbsp;

I actually found it quite hard to locate definitive answers on the subject which cover all the angles, partly because it has been discussed so many times, and partly becuase some need to be edited for language ;-)

I have paraphrased my recollections of what has been said about dealing with badly configured / old browsers. I would welcome pointers to actual messages off the list which I could quote instead, and any improvements on the ones I have chosen.

How to output &nbsp in HTML

[ existing text from the nbsp topic ]
Mike Brown:

> I'm generating HTML from XML 
> The output HTML needs to contain some "&nbsp;". But until now I could not
> find a way to implement that.

&nbsp; is by definition &#160; Just put &#160; (or &#xA0;) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either &#160; or &nbsp; assuming you have <xsl:output method="html"/> in the stylesheet.

> I thought the &nbsp; entity was predefined in xml.

It is not predefined. Only &lt; &gt; &amp; &quot; &apos; are predefined. You can either use &#160; or &#xA0;, or you can define an entity like nbsp for the same.

Try:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp "&#160;"> ]>
<xsl:stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
version="1.0">

Apparently one motivation for trying to get &nbsp; into the output is to cope with browsers that either cannot handle the encoding being used or have been set up incorrectly (the advice is to set to 'auto detect' if this option is available).

Mike Brown:
(http://www.biglist.com/lists/xsl-list/archives/200001/msg00255.html)

> Another part of my problem was that a literal character #160 was
> mysteriously coming through not as a non-breaking space, but as a Â
> character, which is ANSI #194.

&#160; in an XML document always refers to UCS character code U+00A0. This character must be encoded upon output in a document. If your document is encoded as ISO-8859-1 or US-ASCII, the character will manifest as the single byte A0 (in hex, or 160 in decimal). If your document is encoded with UTF-8, it will be the pair of bytes C2 C0.

If you are looking at the UTF-8 encoded document in an editor or shell/terminal window that doesn't know to interpret hex C2 C0 as a UTF-8 sequence, then you'll probably see  (the character in many character sets/fonts at position hex C2, aka decimal 192) followed by an invisible character (C0, which if interpreted as an ISO-8859-x character happens to be invalid in HTML).

If you don't like the encoding your XSLT processor gives you normally, you can use the encoding attribute on the xsl:output element to specify a particular encoding (provided your processor knows how to deal with it).

Ref: http://www.w3.org/TR/xslt#output

If you are having to deal with old browsers and/or misconfigured clients which you do not have the power to change, then you might be left with no choice other than getting &nbsp; into the output. There is no nice way to do this (as I hope we have already established, the standards are constructed such that it should not be necessary). But if it has to be done, here are the choices, and their caveats:

Choose a processor such as Saxon which gives you additional control over the serialisation: Saxon for example. Caveat: ties you to one processor.

Use <xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>, possibly with the DTD subset trick described above to keep the stylesheet readable. Caveat: disable-output-escaping doesn't have to be honoured by the processor. Even if it seems to work, it can be fragile because it may be ignored if you later decide to send the ouput via a DOM, or you use variables and node-set() to store part of your output. See also  DOE

Use an element or processing instruction to represent the non-breaking space, and substitute it with a custom serialiser. Caveat: hard work, and ties you to a specific processor or class of processors.

Wendell Piez outlines a use in tables with empty cells.

Outputting spaces in html table cells

Use &#160; for a non-breaking space. Your XML parser does not pick up the named entity &nbsp; because it hasn't been declared. But a numbered character reference (which is what &#160; is) will be recognized -- #160 is a non-breaking space.

Some references: On the finer points of encodings and character references:  List archive   Mike Brown on browser character encodings  List archive

5.

nbsp, why doesn't it work

Ragulf Pickaxe, David Carlisle


> By googling I found a suggestion to use &#160; instead.

> Is there a reason why &nbsp; is not working?

"&nbsp;" is an HTML entity. XML only knows three entities: "&lt;" "&gt;" "&amp;"

Therefore all other characters that you need must be with their char code, as you have found with "&#160;".

because XSLT files have to be well formed XML and in XML (and HTML) entities must be defined before use. Most HTML browsers implictly use a catalogue that (implictly) defines the entities in the HTML DTD including nbsp but in general it's just an undefined reference, unless you define it.

++++++++++++++++++++


As many of you may have noticed, DOM parser gives errors if the '&nbsp;' entity is present. The E_WARN message looks like:

Warning: DOMDocument::load() [function.load]: Entity 'nbsp' not defined in ...

There're many ways to solve this:
a) The hard way
<xsl:text disable-output-escaping="yes"> &amp;nbsp;</xsl:text>

b) Defining &nbsp;
At the top of the document, after the <?xml?> definition, add:
   <!DOCTYPE xsl:stylesheet [ 
    <!ENTITY nbsp "&#160;" >
    ]>

c) External Doctype
Just in case you want need other HTML entities, you can call an external doctype with the proper definitions

<!DOCTYPE page SYSTEM "http://gv.ca/dtd/character-entities.dtd ">

Of course, you can download the file and place it in your server.

<think>好的,用户问的是关于使用Dom4j的XMLWriter生成XML时出现\u00a0的问题。首先,我需要回忆一下Dom4j的基本用法,特别是XMLWriter的相关知识。 用户提到的\u00a0是Unicode中的非换行空格,通常在XML中出现这种情况可能是因为编码问题或者输出时的转义处理。Dom4j的XMLWriter在生成XML时,默认会处理特殊字符的转义,比如将空格转成实体引用,但有时候可能会有不同的行为。 接下来,我需要考虑XMLWriter的配置选项。比如,setEscapeText方法可能会影响是否转义文本内容。如果这个方法被设置为false,那么某些特殊字符可能不会被转义,直接输出成Unicode字符,比如\u00a0。另外,编码格式是否正确设置也是一个关键点,比如UTF-8是否正确指定,否则可能导致编码不一致的问题。 然后,检查输出的字符串是否有手动添加的空格或者不可见字符。有时候在拼接字符串时,可能不小心引入了非标准的空格字符,这也会导致生成XML中出现\u00a0的情况。这时候需要检查代码中的字符串处理部分,看是否有这样的问题。 另外,格式化输出选项,比如缩进使用的空格或制表符,如果配置不当,也可能导致出现意料之外的空格字符。需要确认XMLWriter的格式化配置是否正确,是否使用了正确的缩进策略。 可能还需要考虑Dom4j版本的问题,不同版本在处理转义编码时可能有差异。如果用户使用的是较旧或存在已知问题的版本,升级到最新版可能会解决问题。 最后,建议用户检查生成的XML文件的编码声明,确保与实际编码一致,比如<?xml version="1.0" encoding="UTF-8"?>,这样可以避免解析时的编码错误。同时,提供一个示例代码,展示如何正确配置XMLWriter,帮助用户对比自己的代码,找出可能的配置错误。 总结起来,可能的原因包括编码设置不当、转义配置关闭、字符串处理问题、格式化选项错误或版本问题。需要一步步排查这些可能性,找出具体原因并给出相应的解决方案。</think>关于Dom4j的`XMLWriter`生成XML时出现`\u00a0`字符的问题,以下是逐步解析与解决方案: --- ### **问题定位** `\u00a0`是Unicode编码的非换行空格(Non-breaking Space),通常由以下原因导致: 1. XML文本中实际包含HTML的`&nbsp;`实体 2. 输出编码未正确配置 3. 特殊字符未正确转义 --- ### **关键解决步骤** #### 1. 检查编码配置 确保`XMLWriter`使用UTF-8编码输出: ```java OutputFormat format = OutputFormat.createPrettyPrint(); format.setEncoding("UTF-8"); // 关键配置 XMLWriter writer = new XMLWriter(new FileOutputStream("output.xml"), format); ``` #### 2. 禁用文本转义(可选) 若需保留原生空格字符,关闭转义功能: ```java writer.setEscapeText(false); // 谨慎使用!可能引发XML格式错误 ``` #### 3. 预处理特殊字符 手动替换`\u00a0`为XML合法实体: ```java String content = element.getText().replace('\u00a0', ' '); element.setText(content); ``` #### 4. 验证输出格式 生成的XML头部需包含编码声明: ```xml <?xml version="1.0" encoding="UTF-8"?> ``` --- ### **完整示例代码** ```java import org.dom4j.Document; import org.dom4j.DocumentHelper; import org.dom4j.Element; import org.dom4j.io.OutputFormat; import org.dom4j.io.XMLWriter; public class Dom4jDemo { public static void main(String[] args) throws Exception { Document doc = DocumentHelper.createDocument(); Element root = doc.addElement("root"); // 添加含特殊空格的内容 Element data = root.addElement("data") .addText("ValueWith\u00a0NonBreakingSpace"); // 配置输出格式 OutputFormat format = OutputFormat.createPrettyPrint(); format.setEncoding("UTF-8"); // 写入文件 try (XMLWriter writer = new XMLWriter(new FileWriter("output.xml"), format)) { writer.write(doc); } } } ``` --- ### **结果验证** 生成的`output.xml`内容应为: ```xml <?xml version="1.0" encoding="UTF-8"?> <root> <data>ValueWith&nbsp;NonBreakingSpace</data> <!-- 合法转义为&nbsp; --> </root> ``` --- ### **常见错误排查** - ❌ 编码不匹配:若文件保存为ANSI编码,`\u00a0`会显示为乱码 - ❌ 未声明编码:XML解析器可能默认使用系统编码 - ❌ 手动拼接XML:避免直接操作字符串,应通过Dom4j API构建节点 通过上述步骤,可有效解决Dom4j生成XML时出现的`\u00a0`字符异常问题
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值