一个纯文本test.txt
Welcome to World Wide Spam, Inc.
These are the corporate web pages of *World Wide Spam*, Inc. We hope
you find your stay enjoyable, and that you will sample many of our
products.
A short history of the company
World Wide Spam was started in the summer of 2000. The business
concept was to ride the dot-com wave and to make money both through
bulk email and by selling canned meat online.
After receiving several complaints from customers who weren't
satisfied by their bulk email, World Wide Spam altered their profile,
and focused 100% on canned goods. Today, they rank as the world's
13,892nd online supplier of SPAM.
Destinations
From this page you may visit several of our interesting web pages:
- What is SPAM? (http://wwspam.fu/whatisspam)
- How do they make it? (http://wwspam.fu/howtomakeit)
- Why should I eat it? (http://wwspam.fu/whyeatit)
How to get in touch with us
You can get in touch with us in *many* ways: By phone (555-1234), by
email (wwspam@wwspam.fu) or by visiting our customer feedback page
(http://wwspam.fu/feedback).
初次实现
首先需要把文本切分成段落
找出块的一个简单方法就是收集遇到的所有行,直到遇到一个空行,然后返回已经收集的行。那些返回的行就是一个块
文本块生成器util.py
def lines(file):
for line in file: yield line
yield '\n'
def blocks(file):
block = []
for line in lines(file):
if line.strip():
block.append(line)
elif block:
yield ''.join(block).strip()
block = []
添加一些html标识simple.py
import sys, re
from util import *
print '<html><head><title>...</title><body>'
title = True
for block in blocks(sys.stdin):
block = re.sub(r'\*(.+?)\*', r'<em>\1</em>', block)
if title:
print '<h1>'
print block
print '</h1>'
title = False
else:
print '<p>'
print block
print '</p>'
print '</body></html>'
在cmd里执行的时候输入simple.py <test.txt> test.html
<html><head><title>...</title><body>
<h1>
Welcome to World Wide Spam, Inc.
</h1>
<p>
These are the corporate web pages of <em>World Wide Spam</em>, Inc. We hope
you find your stay enjoyable, and that you will sample many of our
products.
</p>
<p>
A short history of the company
</p>
<p>
World Wide Spam was started in the summer of 2000. The business
concept was to ride the dot-com wave and to make money both through
bulk email and by selling canned meat online.
</p>
<p>
After receiving several complaints from customers who weren't
satisfied by their bulk email, World Wide Spam altered their profile,
and focused 100% on canned goods. Today, they rank as the world's
13,892nd online supplier of SPAM.
</p>
<p>
Destinations
</p>
<p>
From this page you may visit several of our interesting web pages:
</p>
<p>
- What is SPAM? (http://wwspam.fu/whatisspam)
</p>
<p>
- How do they make it? (http://wwspam.fu/howtomakeit)
</p>
<p>
- Why should I eat it? (http://wwspam.fu/whyeatit)
</p>
<p>
How to get in touch with us
</p>
<p>
You can get in touch with us in <em>many</em> ways: By phone (555-1234), by
email (wwspam@wwspam.fu) or by visiting our customer feedback page
(http://wwspam.fu/feedback).
</p>
</body></html>
下面我们进一步实现
语法分析器 规则 过滤器 处理程序
handler.py
class Handler:
"""
An object that handles method calls from the Parser.
The Parser will call the start() and end() methods at the
beginning of each block, with the proper block name as a
parameter. The sub() method will be used in regular expression
substitution. When called with a name such as 'emphasis', it will
return a proper substitution function.
"""
def callback(self, prefix, name, *args):
method = getattr(self, prefix+name, None)
if callable(method): return method(*args)
def start(self, name):
self.callback('start_', name)
def end(self, name):
self.callback('end_', name)
def sub(self, name):
def substitution(match):
result = self.callback('sub_', name, match)
if result is None: match.group(0)
return result
return substitution
class HTMLRenderer(Handler):
"""
A specific handler used for rendering HTML.
The methods in HTMLRenderer are accessed from the superclass
Handler's start(), end(), and sub() methods. They implement basic
markup as used in HTML documents.
"""
def start_document(self):
print '<html><head><title>...</title></head><body>'
def end_document(self):
print '</body></html>'
def start_paragraph(self):
print '<p>'
def end_paragraph(self):
print '</p>'
def start_heading(self):
print '<h2>'
def end_heading(self):
print '</h2>'
def start_list(self):
print '<ul>'
def end_list(self):
print '</ul>'
def start_listitem(self):
print '<li>'
def end_listitem(self):
print '</li>'
def start_title(self):
print '<h1>'
def end_title(self):
print '</h1>'
def sub_emphasis(self, match):
return '<em>%s</em>' % match.group(1)
def sub_url(self, match):
return '<a href="%s">%s</a>' % (match.group(1), match.group(1))
def sub_mail(self, match):
return '<a href="mailto:%s">%s</a>' % (match.group(1), match.group(1))
def feed(self, data):
print data
规则rules.py
class Rule:
"""
Base class for all rules.
"""
def action(self, block, handler):
handler.start(self.type)
handler.feed(block)
handler.end(self.type)
return True
class HeadingRule(Rule):
"""
A heading is a single line that is at most 70 characters and
that doesn't end with a colon.
"""
type = 'heading'
def condition(self, block):
return not '\n' in block and len(block) <= 70 and not block[-1] == ':'
class TitleRule(HeadingRule):
"""
The title is the first block in the document, provided that it is
a heading.
"""
type = 'title'
first = True
def condition(self, block):
if not self.first: return False
self.first = False
return HeadingRule.condition(self, block)
class ListItemRule(Rule):
"""
A list item is a paragraph that begins with a hyphen. As part of
the formatting, the hyphen is removed.
"""
type = 'listitem'
def condition(self, block):
return block[0] == '-'
def action(self, block, handler):
handler.start(self.type)
handler.feed(block[1:].strip())
handler.end(self.type)
return True
class ListRule(ListItemRule):
"""
A list begins between a block that is not a list item and a
subsequent list item. It ends after the last consecutive list
item.
"""
type = 'list'
inside = False
def condition(self, block):
return True
def action(self, block, handler):
if not self.inside and ListItemRule.condition(self, block):
handler.start(self.type)
self.inside = True
elif self.inside and not ListItemRule.condition(self, block):
handler.end(self.type)
self.inside = False
return False
class ParagraphRule(Rule):
"""
A paragraph is simply a block that isn't covered by any of the
other rules.
"""
type = 'paragraph'
def condition(self, block):
return True
主程序markup.py
import sys, re
from handlers import *
from util import *
from rules import *
class Parser:
"""
A Parser reads a text file, applying rules and controlling a
handler.
"""
def __init__(self, handler):
self.handler = handler
self.rules = []
self.filters = []
def addRule(self, rule):
self.rules.append(rule)
def addFilter(self, pattern, name):
def filter(block, handler):
return re.sub(pattern, handler.sub(name), block)
self.filters.append(filter)
def parse(self, file):
self.handler.start('document')
for block in blocks(file):
for filter in self.filters:
block = filter(block, self.handler)
for rule in self.rules:
if rule.condition(block):
last = rule.action(block, self.handler)
if last: break
self.handler.end('document')
class BasicTextParser(Parser):
"""
A specific Parser that adds rules and filters in its
constructor.
"""
def __init__(self, handler):
Parser.__init__(self, handler)
self.addRule(ListRule())
self.addRule(ListItemRule())
self.addRule(TitleRule())
self.addRule(HeadingRule())
self.addRule(ParagraphRule())
self.addFilter(r'\*(.+?)\*', 'emphasis')
self.addFilter(r'(http://[\.a-zA-Z/]+)', 'url')
self.addFilter(r'([\.a-zA-Z]+@[\.a-zA-Z]+[a-zA-Z]+)', 'mail')
handler = HTMLRenderer()
parser = BasicTextParser(handler)
parser.parse(sys.stdin)
cmd执行命令markup.py <test.txt> test.html
<html><head><title>...</title></head><body>
<h1>
Welcome to World Wide Spam, Inc.
</h1>
<p>
These are the corporate web pages of <em>World Wide Spam</em>, Inc. We hope
you find your stay enjoyable, and that you will sample many of our
products.
</p>
<h2>
A short history of the company
</h2>
<p>
World Wide Spam was started in the summer of 2000. The business
concept was to ride the dot-com wave and to make money both through
bulk email and by selling canned meat online.
</p>
<p>
After receiving several complaints from customers who weren't
satisfied by their bulk email, World Wide Spam altered their profile,
and focused 100% on canned goods. Today, they rank as the world's
13,892nd online supplier of SPAM.
</p>
<h2>
Destinations
</h2>
<p>
From this page you may visit several of our interesting web pages:
</p>
<ul>
<li>
What is SPAM? (<a href="http://wwspam.fu/whatisspam">http://wwspam.fu/whatisspam</a>)
</li>
<li>
How do they make it? (<a href="http://wwspam.fu/howtomakeit">http://wwspam.fu/howtomakeit</a>)
</li>
<li>
Why should I eat it? (<a href="http://wwspam.fu/whyeatit">http://wwspam.fu/whyeatit</a>)
</li>
</ul>
<h2>
How to get in touch with us
</h2>
<p>
You can get in touch with us in <em>many</em> ways: By phone (555-1234), by
email (<a href="mailto:wwspam@wwspam.fu">wwspam@wwspam.fu</a>) or by visiting our customer feedback page
(<a href="http://wwspam.fu/feedback">http://wwspam.fu/feedback</a>).
</p>
</body></html>
效果图
Welcome to World Wide Spam, Inc.
These are the corporate web pages of World Wide Spam, Inc. We hope you find your stay enjoyable, and that you will sample many of our products.
A short history of the company
World Wide Spam was started in the summer of 2000. The business concept was to ride the dot-com wave and to make money both through bulk email and by selling canned meat online.
After receiving several complaints from customers who weren't satisfied by their bulk email, World Wide Spam altered their profile, and focused 100% on canned goods. Today, they rank as the world's 13,892nd online supplier of SPAM.
Destinations
From this page you may visit several of our interesting web pages:
- What is SPAM? (http://wwspam.fu/whatisspam)
- How do they make it? (http://wwspam.fu/howtomakeit)
- Why should I eat it? (http://wwspam.fu/whyeatit)
How to get in touch with us
You can get in touch with us in many ways: By phone (555-1234), by email (wwspam@wwspam.fu) or by visiting our customer feedback page (http://wwspam.fu/feedback).