How do I strip the html content out from a json file without breaking it?
与任何其他序列化数据结构的方式相同。通过使用适当的解析器(在本例中是一个小型递归函数)。在import json
import re
json_string = """{
"prop_1": {
"prop_1_1": ["some data", 17, "more data"],
"prop_1_2": "here some , too"
},
"prop_2": "and more "
}"""
def unhtml(string):
# replace ..., possibly more than once
done = False
while not done:
temp = re.sub(r']*>[\s\S]*?\1>', '', string)
done = temp == string
string = temp
# replace remaining standalone tags, if any
string = re.sub(r']*>', '', string)
string = re.sub(r'\s{2,}', ' ', string)
return string.strip()
def cleanup(element):
if isinstance(element, list):
for i, item in enumerate(element):
element[i] = cleanup(item)
elif isinstance(element, dict):
for key in element.keys():
element[key] = cleanup(element[key])
elif isinstance(element, basestring):
element = unhtml(element)
return element
用作
^{pr2}$
抛出HTML标记的regex只能解决一半问题。所有字符实体(如&或<将保留在字符串中。在
重写unhtml()以使用proper parser。在