使用Python语言设计基于HTML的C语言语法加亮显示程序

本文链接：https://blog.youkuaiyun.com/gashero/article/details/585829

2005-2006学年第1学期

编译原理

课程设计报告

班级 02计(二)

学号 19

姓名刘晓明

成绩

指导教师 卢朝辉

一、 设计目的

加深对编译原理的进一步认识，加强实践动手能力和程序的开发能力培养，提高分析问题和解决问题的能力。

二、 设计任务

1.单词识别

C语言常数

C语言标识符

2.程序的文本处理

将C语言的所有注释字母大写

将C语言的所有保留字大写

3.递归下降分析

三、 设计过程

1.总体设计

通过读入C语言源文件之后生成相关的词法分析，并输出成经过词法加亮的HTML文件用于显示。另外输出单词符号表。生成的HTML文件的文件名为out.html，单词符号表文件为token.txt。

运行方法为：进入dist文件夹运行main *.c。这里的*替换为C语言文件名，后缀名为C，但是也可以使用其他后缀名。使用默认设置的启动请直接双击dist目录下run.bat文件，默认分析sample.c文件。

程序分为三个模块：HTML模块负责提供HTML文件生成相关的细节；wordfix模块提供词法分析的步骤；main模块提供了文件I/O和程序总体控制。

2.HTML.py

实现了HTML文件的相关细节。包含以下函数：

writehead()

用于生成HTML文件头

writeline(line)

用于输出一些数据到HTML文件并加入两种换行，分别实现HTML源文件和HTML显示格式的换行

writeident(line)

输出标识符到HTML文件

writekeyword(line)

输出关键字到HTML文件

writecomment(line)

输出注释和预处理串到HTML文件

writeconst(line)

输出常数和字符串等常量到HTML文件

writeoper(line)

输出算符和界符到HTML文件

writetail(line)

输出HTML文件的结尾并关闭HTML文件

fixmark(instr)

由于浏览器无法显示一些特殊字符，只能事先在HTML文件中转换成其他字符串，fixmark函数提供这种转换。需要转换的字符包括几种空白和&、"、>、<等。

HTML模块最后提供了单元测试用的主方法。模块只有一个成员outfile用于全局的存储输出HTML的文件句柄。

3.main.py

提供了程序启动和文件I/O的操作。包括以下函数：

openfile(filename)

在具备错误处理的情况下打开文件的服务，打开成功则返回文件句柄，失败则返回False

showfile(filename)

提供文件打开测试和显示功能

主方法提供了文件打开，设置各个模块全局变量等服务。并开启了词法分析器。

4.wordfix.py

词法分析模块，包含如下几个成员：

infile

用于读入的源文件，使用C语言的子集

outfile

用于输出的HTML文件句柄，程序中不直接使用

tokenfile

用于输出单词符号的文件句柄

outlines

输出转换大小写以后的源程序，字符串数组

htmllines

输出转换成功的HTML文件，字符串数组

identlist

标识符列表，字符串数组

digitlist

数字常量列表，字符串数组

stringlist

字符串列表，字符串数组，存储从程序中读出的C语言字符串。

keywords

C语言和少数C++的保留字列表

模块包含的函数如下：

IsDigit(d)

判断输入的字符是否是数字，返回True和False

IsChar(c)

判断输入的字符是否是阴文字符，包括大写和小写

IsBlank(b)

判断输入字符是否是空白，空白包括空格、制表符、换行符、回车

IsKeyword(word)

判断输入的单词是否是一个保留字，如果是则返回所在的位置，如果不是则返回False

wordfix()

词法分析函数，按照文档中的状态转换图编写。因为程序需要输出具有格式和保留注释的源程序，所以扫描器每次读入的不再是缓冲区，而是每次读入一行。对读入的行进行识别和处理。

模块的最后也提供了单元测试的主函数。

5.单词识别过程

每次读入一行之后前将行开头的空白去除并写入HTML文档，之后查看是否还有其他符号，如果没有则是一个空行，继续下一行。如果有内容区分几种情况，读入的第一个字符如果是英文字母或下划线则进入标识符的识别。标识符的下一步允许有英文字母、下划线和数字，最终读入其他字符而停止。得到标识符后查找保留字表，并判定是否是保留字，之后分别处理。

如果第一个字符是数字则进入数字识别，数字的下一个字符允许是数字和小数点。识别之后存入数字常数列表。

如果下面字符是'//'或'#'则识别为单行注释，在HTML文件上也显示为单行注释，但是在更改大小写的时候会区分。单行注释以换行为结束

如果下面字符是'/*'则进入了多行注释，程序也进入了多行注释状态。这是在接下来的同一行中查找'*/'字符串，如果找到则退出多行注释状态。之后开始输出多行注释。并在下一次读入一行的时候判断是否出于多行注释状态，如果仍然出于多行注释状态则继续查找'*/'字符串，找到则退出多行注释状态，找不到则继续输出注释到HTML文件。

如果读入了'"'，即双引号则开始进入字符串识别状态，字符串中的任何字符都不被识别为其他词法符号。字符串不允许直接换行。

之后是双字符算符处理，包括较为常用的12个算符。

还有单字符算符处理，包括常用的22个算符。

最下面的例外处理，所有例外符号都作为其他符号而继续写入源文件中。

程序的异常处理部分主要是为了处理读入字符串数组序号以外的索引而设置的。发生原因就是不可预料的行结束。这里要按照之前的不同状态来设置行结束的不同处理。

四、 设计体会

这次课程设计当中我实践了最新学到的编译原理知识和接触尚不足20天的Python语言，实践当中有了一些体会。首先是编译原理中知识的实用性很强，在后来的编程中都是非常具体的指导，具体到变量设置。编译原理课程中的很多工具，比如状态转换图、有限自动机等都是很好的工具，可以极大的降低编译器设计的难度。书上提供的词法分析器例子也给了我很大的帮助。

另外，关于Python语言，是我在上月25日才开始接触的一门语言，给了我很深的印象，他是一种让人在编程中感到舒服的语言。使用Python编写词法分析器也是这个学期多次编程实践中唯一一次提前完成任务的例子。我通过11个小时的编程就完成了约460行程序的编写和调试工作。这个时间还包括编制C语言例子和HTML文件处理和颜色配置等等工作。

在本次课程设计中除因为时间紧迫而使用Windows系统外，其他的软件工具全部为开源软件，包括：Python 2.4.2 、vim6.3、Notepad++3.4、gcc3.3.3、grep。这也让我对完全使用开源软件进行工作有了信心。

#setup.py
from distutils.core import setup
import py2exe

setup(console=["main.py","HTML.py","wordfix.py"])

# run : python setup.py py2exe

# -*- coding: gb2312 -*-

##### functions #####

import sys
import os
import HTML
import wordfix

########## global variables ##########
#infile=''
outfile=open('out.html','w')

########## functions ##########
def openfile(filename):
'return a file handle to read'

'if return a False then failed'
'else return a file handle'
if os.path.isfile(filename) == False:
  return False
try:
  f=open(filename,'r')
except IOError,detail:
  if str(detail)[1:8]=='Errno 2':
   return False
  else:
   print detail
else:
  return f

def showfile(filename):
'test a text file if it can show out'
#print 'in showfile()'
#f=open(filename,'r')
f=openfile(filename)
if f==False:
  print 'File Not Found!'
  sys.exit()
while True:
  line=f.readline()
  if line=='':
   break
  if line[-1]=='/n':
   line=line[0:len(line)-1]
  print line
f.close()

##### #####

if __name__=='__main__':
# main()
print 'main()'
#print 'WordFix v 0.1'
#print 'Copyright@1999-2006, Harry Gashero Liu.'
if len(sys.argv)<2:
 print 'not enough params, exit!'
 sys.exit()
else:
 #print 'input file:',sys.argv[1]
 pass
#showfile(sys.argv[1])
#f.close()
f=openfile(sys.argv[1])
if f==False:
 print 'open file failed'
else:
 wordfix.infile=f
HTML.outfile=outfile
HTML.writehead()
wordfix.outfile=outfile
tokenfile=open('token.txt','w')
wordfix.tokenfile=tokenfile
wordfix.wordfix()
HTML.writetail()
print 'end of program'
print wordfix.identlist
print wordfix.digitlist
print wordfix.stringlist
print wordfix.outlines
#fff=open('othertxt.txt','w')
#for x in wordfix.outlines:
# fff.write(x+'/n')
#fff.close()
#fff=open('list.txt','w')
#for x in wordfix.identlist:
# fff.write(x+'/n')
#fff.close()

# -*- coding: gb2312 -*-

# the output file handle
outfile=''

def writehead():
"write a HTML file's header"
outfile.write('<html><head>/n')
outfile.write('<title>word fix result</title>/n')
outfile.write('</head>/n<body bgcolor="#E0E8FF">/n')

def writeline(line):
"write a HTML section to file"
outfile.write(fixmark(line)+' /n')

def writeident(line):
"write ident in gray"
outfile.write(''+/
fixmark(line)+'')

def writekeyword(line):
"write keyword in green"
outfile.write(''+ /
fixmark(line)+'')

def writecomment(line):
"write comment in light blue"
outfile.write(''+ /
fixmark(line)+'')

def writeconst(line):
"write const in red"
outfile.write(''+ /
fixmark(line)+'')

def writeoper(line):
"write operator in yellow"
outfile.write(''+ /
fixmark(line)+'')
def writetail():
"write a HTML file's tail"
outfile.write('</body>/n</html>/n')
outfile.close()

def fixmark(instr):
'fix space to html space'
newc=''
for c in instr:
 if c==' ':
 newc+=' '
 elif c=='/t':
 newc+='    '
 elif c=='&':
 newc+='&'
 elif c=='"':
 newc+='"'
 elif c=='>':
 newc+='>'
 elif c=='<':
 newc+='<'
 elif c=='/n':
 newc+=' '
 else:
 newc+=c
return newc

# unit testing
if __name__=='__main__':
f=open('test.html','w')
outfile=f
##########
writehead()
writeident('python ')
writekeyword('int void shit ')
writeline('')
writecomment('a comment then')
writeconst('12345')
writeoper('** +++ -- new delete')
writetail()

f.close()

# -*- coding: gb2312 -*-

import HTML

##### 全局变量 #####
infile=''  #输入文件，读入的源程序，文件句柄类型，已经打开
outfile=''  #输出文件，输出HTML源程序，文件句柄类型，已经打开
tokenfile=''  #词法分析的单词符号输出文件，文件句柄类型，已经打开
outlines=[]  #输出字符串列表，包含修改过的源程序
htmllines=[]   #输出HTML字符串列表
identlist=[]         #输出的标识符表格
digitlist=[]         #输出的常数表格
stringlist=[]         #输出的字符串表格
keywords=['auto','break','case','char','continue','default',/
'do','double','else','entry','enum','extern','for',/
'float','goto','if','int','long','new','NULL','register',/
'return','short','signed','sizeof','static','struct',/
'switch','typedef','union','unsigned','void','while']

def IsDigit(d):
"判断输入字符是否为数字"
if d in ['0','1','2','3','4','5','6','7','8','9']:
return True
else:
return False

def IsChar(c):
"判断输入字符是否是英文字符，包括大写和小写"
if (ord(c)>=ord('a') and ord(c)<=ord('z')) or /
 (ord(c)>=ord('A') and ord(c)<=ord('Z')):
 return True
else:
 return False

def IsBlank(b):
"判断输入字符是否是空白，包括空格、制表符、换行、回车"
if b in [' ','/t','/n','/r']:
return True
else:
return False

def IsKeyword(word):
"判断输入的标识符是否是一个关键字"
try:
  nnn=keywords.index(word)
  return nnn
except ValueError:
  return False

def wordfix():
'word fix'
newline=''
ch=''
start=0
nowpos=0
word=''
state=''
#initial
HTML.outfile=outfile
while True:
 line=infile.readline()
 if line=='':
 break
 if line[-1]=='/n':
 line=line[0:len(line)-1]
 # start process
 newline=''
 start=0
 nowpos=0
 word=''
 print 'LINE: ',line
 if state=='multicomment':
 #在多行注释状态中
 newline=line
 try:
 if line.index(r'*/')!=-1:
 state=''
 except ValueError:
 #没有找到多行注释的结束符
 state='multicomment'
 outlines.append(newline.upper())
 HTML.writecomment(newline)
 HTML.writeline('')
 continue
 else:
 state=''
 while True:
 try:
 start=nowpos
 nowpos+=1
 ch=line[start]
 #print 'doing char : ',ch,' :: ',ord(ch),'nowpos=',nowpos
 while IsBlank(ch):
 #去除所有空白
 state='blank'
 newline+=ch
 HTML.writecomment(ch)
 start+=1
 nowpos+=1
 ch=line[start:nowpos]
 if ch=='':
 HTML.writeline('')
 break
 if IsChar(ch) or ch=='_':
 #识别标识符
 state='ident'
 nowpos+=1
 ch=line[nowpos-1:nowpos]
 while IsChar(ch) or IsDigit(ch) or ch=='_':
 nowpos+=1
 ch=line[nowpos-1:nowpos]
 #if ch=='':
 # break
 nowpos-=1
 word=line[start:nowpos]
 if IsKeyword(word)==False:
 #标识符
 identlist.append(word)
 newline+=word
 tokenfile.write('ID/t/t'+word+'/n')
 HTML.writeident(word)
 else:
 #关键字
 newline+=word.upper()
 tokenfile.write('KEY/t/t'+word+'/n')
 HTML.writekeyword(word)
 start=nowpos
 continue #================================
 if IsDigit(ch):
 #识别常数
 state='digit'
 nowpos+=1
 ch=line[nowpos-1:nowpos]
 while IsDigit(ch) or ch=='.':
 nowpos+=1
 ch=line[nowpos-1:nowpos]
 #if ch=='':
 # break
 nowpos-=1
 word=line[start:nowpos]
 digitlist.append(word)
 newline+=word
 tokenfile.write('DIGIT/t/t'+word+'/n')
 HTML.writeconst(word)
 start=nowpos
 continue #==================================
 elif (line[start:start+2]=='//') or ch=='#':
 #单行注释，C语言预处理也作为单行注释
 state='singlecomment'
 print 'a single comment'
 word=line[start:]
 if ch!='#':
 newline+=word.upper()
 else:
 newline+=word
 HTML.writecomment(word)
 HTML.writeline('')
 outlines.append(newline)
 break #===================================
 elif line[start:start+2]=='/*':
 #多行注释
 state='multicomment'
 print 'go into multi comment'
 try:
 #可以找到多行注释结束符
 nowpos=line[start+1:].index('*/')
 state=''
 nowpos+=(start+3)
 word=line[start:]
 newline+=word.upper()
 HTML.writecomment(word)
 HTML.writeline('')
 outlines.append(newline)
 start=nowpos
 break
 except ValueError:
 #没有在本行找到多行注释的结束，本行结束
 state='multicomment'
 word=line[start:]
 newline+=word.upper()
 HTML.writecomment(word)
 HTML.writeline('')
 outlines.append(newline)
 break #========================================
 elif ch=='"':
 #识别字符串
 state='string'
 try:
 nowpos=line[start+1:].index('"')
 state=''
 nowpos+=(start+2)
 word=line[start:nowpos]
 newline+=word
 HTML.writeconst(word)
 stringlist.append(word)
 start=nowpos
 continue
 except ValueError:
 #没有找到字符串的结束，是个错误，不处理
 state=''
 word=line[start:]
 newline+=word
 HTML.writeconst(word)
 HTML.writeline('')
 outlines.append(newline)
 break
 elif line[start:start+2]=='++' or line[start:start+2]=='--' or/
 line[start:start+2]=='==' or line[start:start+2]=='!=' or/
 line[start:start+2]=='<<' or line[start:start+2]=='>>' or/
 line[start:start+2]=='+=' or line[start:start+2]=='-=' or/
 line[start:start+2]=='*=' or line[start:start+2]=='/=' or/
 line[start:start+2]=='&&' or line[start:start+2]=='||':
 word=line[start:start+2]
 newline+=word
 nowpos+=1
 HTML.writeoper(word)
 tokenfile.write('OPER/t/t'+word+'/n')
 continue
 elif ch=='+' or ch=='-' or ch=='(' or ch==')' or/
 ch=='[' or ch==']' or ch=='*' or ch=='/' or/
 ch==',' or ch=='=' or ch=='{' or ch=='}' or/
 ch==';' or ch=='&' or ch=='%' or ch=='~' or/
 ch=='|' or ch=='^' or ch=='?' or ch==':' or/
 ch=='<' or ch=='>':
 state='signal'
 word=ch
 newline+=word
 HTML.writeoper(word)
 tokenfile.write('OPER/t/t'+word+'/n')
 continue
 else:
 state='other sign'
 newline+=ch
 #print 'doing char : ',ch,' :: ',ord(ch),' in else'
 tokenfile.write('sign/t/t'+ch+'/n')
 continue
 except IndexError:
 #读入到了一行的末尾
 if state=='blank':
 #在处理空白时读到行末
 outlines.append(newline)
 #HTML.writecomment(newline)
 break
 elif state=='ident':
 #在处理标识符时读到行末
 word=line[start:]
 newline+=word
 if IsKeyword(word)==False:
 #标识符
 identlist.append(word)
 newline+=word
 tokenfile.write('ID :: '+word+'/n')
 HTML.writeident(word)
 else:
 #关键字
 newline+=word
 tokenfile.write('KEY :: '+word+'/n')
 HTML.writekeyword(word)
 outlines.append(newline)
 HTML.writeline('')
 break
 elif state=='singlecomment':
 print 'singlecomment here'
 elif state=='digit':
 #在识别数字时读到行末
 word=line[start:]
 newline+=word
 digitlist.append(word)
 tokenfile.write('DIGIT :: '+word+'/n')
 HTML.writeconst(word)
 break
 else:
 #HTML.writecomment(newline)
 HTML.writeline('')
 outlines.append(newline)
 break

########## main() to unit testing ##########
if __name__=="__main__" :
if IsDigit('4'):
  print 'digit 4'
if IsChar('c'):
  print 'char c'
if IsBlank(' '):
  print 'blank'
if IsKeyword('int'):
  print 'keyword int: ',IsKeyword('int')
print 'end'

##################################################