用Pygments做词法分析

用Pygments做词法分析


Pygments是一个实现的格式化高亮代码的Python库。包含四个模块:

  1. lexer - 将代码解析成token流,每个一token都包含两个属性,一个是当前token的类型,一个是当前token的值。Pygments支持的语言很多,具体可以参考附录A。
  2. filters - 通过lexer解析出来的token流送入filter,可以根据条件修改某些token的属性或者值。
  3. formatter - token流最终输出是根据formater的定义,输出成实现定义好的格式,例如HTML, LaTex或者RTF。
  4. style - 定义了输出的风格,例如不同类型的token按照不同的颜色显示,或者是否粗体。

当前Pygments的最新版本是0.8.1
http://pygments.org/

接触到Pygments不是因为要做语法高亮,而是要在一个项目的源代码(java)中搜索出所有hard coding的字符串,并给替换成其他的值,对于注释中的字符串不需要考虑。因此,用到的就是Pygments中的lexer模块。通过该模块,找到属性为Token.Literal.String的token,然后将其替换成所需要的值。

下表是所支持的token类型:
Text        Token.Text
Whitespace  Token.Text.Whitespace
Error       Token.Error
Other       Token.Other 
Keyword     Token.Keyword 
Name        Token.Name 
Literal     Token.Literal 
String      Token.Literal.String
Number      Token.Literal.Number
Operator    Token.Operator
Punctuation Token.Punctuation
Comment     Token.Comment
Generic     Token.Generic

方案1:
通过get_tokens()函数来遍历所有的token

#  author: A.TNG
#
 date: 1/26/2008

from  pygments  import  highlight
from  pygments.lexers  import  get_lexer_by_name
from  pygments.formatters  import  HtmlFormatter

file 
=   " input.java "
lx 
=  get_lexer_by_name( ' java ' )
handle 
=  open(file,  ' r ' )
text 
=  handle.read()
handle.close()

handle 
=  open( " output.txt " ' w ' )
for  type, val  in  lx.get_tokens(text):
    
print  str(type), str(val)
    handle.write(str(type)
+ '   '   +  str(val) + ' ' )
    
print   " ------------------------------- "
    handle.write(
" ------------------------------- " + ' ' )
handle.close()

 

方案2:
使用RawTokenFormatter来格式化代码,返回的结果就是token的集合。其实原理与方案1类似。

#  author: A.TNG
#
 date: 1/26/2008

from  pygments  import  highlight
from  pygments.lexers  import  get_lexer_by_name
from  pygments.formatters  import  RawTokenFormatter

def  code2token(code, lang = ' java ' ):
    lexer 
=  get_lexer_by_name(lang, encoding = ' utf-8 ' , stripall = True)
    formatter 
=  RawTokenFormatter()
    result 
=  highlight(code, lexer, formatter)
    
return  result

file 
=   " input.java "
handle 
=  open(file,  ' r ' )
data 
=  handle.read()
handle.close()

handle 
=  open( " output.java " ' w ' )
ret 
=  code2token(data)
handle.write(ret)
handle.close()

 

****************************************

附录A
1. Diff (Lexer)
    Short names: diff
    Filename extensions: *.diff, *.patch
    Mimetypes: text/x-diff, text/x-patch
2. Delphi (Lexer)
    Short names: delphi, pas, pascal, objectpascal
    Filename extensions: *.pas
    Mimetypes: text/x-pascal
3. JavaScript+Mako (Lexer)
    Short names: js+mako, javascript+mako
    Filename extensions:
    Mimetypes: application/x-javascript+mako, text/x-javascript+mako, text/javascript+mako
4. Myghty (Lexer)
    Short names: myghty
    Filename extensions: *.myt, autodelegate
    Mimetypes: application/x-myghty
5. HTML+Genshi (Lexer)
    Short names: html+genshi, html+kid
    Filename extensions:
    Mimetypes: text/html+genshi
6. Raw token data (Lexer)
    Short names: raw
    Filename extensions: *.raw
    Mimetypes: application/x-pygments-tokens
7. DylanLexer (Lexer)
    Short names: dylan
    Filename extensions: *.dylan
    Mimetypes: text/x-dylan
8. Brainfuck (Lexer)
    Short names: brainfuck, bf
    Filename extensions: *.bf, *.b
    Mimetypes: application/x-brainfuck
9. MoinMoin/Trac Wiki markup (Lexer)
    Short names: trac-wiki, moin
    Filename extensions:
    Mimetypes: text/x-trac-wiki
10. reStructuredText (Lexer)
    Short names: rst, rest, restructuredtext
    Filename extensions: *.rst, *.rest
    Mimetypes: text/x-rst
11. C (Lexer)
    Short names: c
    Filename extensions: *.c, *.h
    Mimetypes: text/x-chdr, text/x-csrc
12. HTML (Lexer)
    Short names: html
    Filename extensions: *.html, *.htm, *.xhtml
    Mimetypes: text/html, application/xhtml+xml
13. INI (Lexer)
    Short names: ini, cfg
    Filename extensions: *.ini, *.cfg
    Mimetypes: text/x-ini
14. Genshi (Lexer)
    Short names: genshi, kid, xml+genshi, xml+kid
    Filename extensions: *.kid
    Mimetypes: application/x-genshi, application/x-kid
15. CSS+Mako (Lexer)
    Short names: css+mako
    Filename extensions:
    Mimetypes: text/css+mako
16. Bash (Lexer)
    Short names: bash, sh
    Filename extensions: *.sh
    Mimetypes: application/x-sh, application/x-shellscript
17. Mako (Lexer)
    Short names: mako
    Filename extensions: *.mao
    Mimetypes: application/x-mako
18. HTML+PHP (Lexer)
    Short names: html+php
    Filename extensions: *.phtml
    Mimetypes: application/x-php, application/x-httpd-php, application/x-httpd-php3, application/x-httpd-php4, application/x-httpd-php5
19. HTML+Django/Jinja (Lexer)
    Short names: html+django, html+jinja
    Filename extensions:
    Mimetypes: text/html+django, text/html+jinja
20. CSS+PHP (Lexer)
    Short names: css+php
    Filename extensions:
    Mimetypes: text/css+php
21. Lua (Lexer)
    Short names: lua
    Filename extensions: *.lua
    Mimetypes: text/x-lua, application/x-lua
22. VimL (Lexer)
    Short names: vim
    Filename extensions: *.vim, .vimrc
    Mimetypes: text/x-vim
23. CSS+Genshi Text (Lexer)
    Short names: css+genshitext, css+genshi
    Filename extensions:
    Mimetypes: text/css+genshi
24. OCaml (Lexer)
    Short names: ocaml
    Filename extensions: *.ml, *.mli
    Mimetypes: text/x-ocaml
25. CSS+Myghty (Lexer)
    Short names: css+myghty
    Filename extensions:
    Mimetypes: text/css+myghty
26. C# (Lexer)
    Short names: csharp, c#
    Filename extensions: *.cs
    Mimetypes: text/x-csharp
27. IRC logs (Lexer)
    Short names: irc
    Filename extensions:
    Mimetypes: text/x-irclog
28. Text only (Lexer)
    Short names: text
    Filename extensions: *.txt
    Mimetypes: text/plain
29. Smarty (Lexer)
    Short names: smarty
    Filename extensions: *.tpl
    Mimetypes: application/x-smarty
30. Haskell (Lexer)
    Short names: haskell
    Filename extensions: *.hs
    Mimetypes:
31. Python (Lexer)
    Short names: python, py
    Filename extensions: *.py, *.pyw
    Mimetypes: text/x-python, application/x-python
32. CSS+Django/Jinja (Lexer)
    Short names: css+django, css+jinja
    Filename extensions:
    Mimetypes: text/css+django, text/css+jinja
33. CSS+Smarty (Lexer)
    Short names: css+smarty
    Filename extensions:
    Mimetypes: text/css+smarty
34. Redcode (Lexer)
    Short names: redcode
    Filename extensions: *.cw
    Mimetypes:
35. JavaScript+Myghty (Lexer)
    Short names: js+myghty, javascript+myghty
    Filename extensions:
    Mimetypes: application/x-javascript+myghty, text/x-javascript+myghty, text/javascript+mygthy
36. Ruby irb session (Lexer)
    Short names: rbcon, irb
    Filename extensions:
    Mimetypes: text/x-ruby-shellsession
37. Ruby (Lexer)
    Short names: rb, ruby
    Filename extensions: *.rb, *.rbw, Rakefile, *.rake, *.gemspec, *.rbx
    Mimetypes: text/x-ruby, application/x-ruby
38. JavaScript+Ruby (Lexer)
    Short names: js+erb, javascript+erb, js+ruby, javascript+ruby
    Filename extensions:
    Mimetypes: application/x-javascript+ruby, text/x-javascript+ruby, text/javascript+ruby
39. Makefile (Lexer)
    Short names: make, makefile, mf
    Filename extensions: *.mak, Makefile, makefile
    Mimetypes: text/x-makefile
40. RHTML (Lexer)
    Short names: rhtml, html+erb, html+ruby
    Filename extensions: *.rhtml
    Mimetypes: text/html+ruby
41. Django/Jinja (Lexer)
    Short names: django, jinja
    Filename extensions:
    Mimetypes: application/x-django-templating, application/x-jinja
42. ApacheConf (Lexer)
    Short names: apacheconf, aconf, apache
    Filename extensions: .htaccess, apache.conf, apache2.conf
    Mimetypes: text/x-apacheconf
43. TeX (Lexer)
    Short names: tex, latex
    Filename extensions: *.tex, *.aux, *.toc
    Mimetypes: text/x-tex, text/x-latex
44. Genshi Text (Lexer)
    Short names: genshitext
    Filename extensions:
    Mimetypes: application/x-genshi-text, text/x-genshi
45. Java (Lexer)
    Short names: java
    Filename extensions: *.java
    Mimetypes: text/x-java
46. JavaScript+Genshi Text (Lexer)
    Short names: js+genshitext, js+genshi, javascript+genshitext, javascript+genshi
    Filename extensions:
    Mimetypes: application/x-javascript+genshi, text/x-javascript+genshi, text/javascript+genshi
47. Boo (Lexer)
    Short names: boo
    Filename extensions: *.boo
    Mimetypes: text/x-boo
48. XML+Ruby (Lexer)
    Short names: xml+erb, xml+ruby
    Filename extensions:
    Mimetypes: application/xml+ruby
49. Batchfile (Lexer)
    Short names: bat
    Filename extensions: *.bat, *.cmd
    Mimetypes: application/x-dos-batch
50. Python console session (Lexer)
    Short names: pycon
    Filename extensions:
    Mimetypes: text/x-python-doctest
51. HTML+Smarty (Lexer)
    Short names: html+smarty
    Filename extensions:
    Mimetypes: text/html+smarty
52. Objective-C (Lexer)
    Short names: objective-c, objectivec, obj-c, objc
    Filename extensions: *.m
    Mimetypes: text/x-objective-c
53. JavaScript (Lexer)
    Short names: js, javascript
    Filename extensions: *.js
    Mimetypes: application/x-javascript, text/x-javascript, text/javascript
54. D (Lexer)
    Short names: d
    Filename extensions: *.d, *.di
    Mimetypes: text/x-dsrc
55. JavaScript+Django/Jinja (Lexer)
    Short names: js+django, javascript+django, js+jinja, javascript+jinja
    Filename extensions:
    Mimetypes: application/x-javascript+django, application/x-javascript+jinja, text/x-javascript+django, text/x-javascript+jinja, text/javascript+django, text/javascript+jinja
56. Python Traceback (Lexer)
    Short names: pytb
    Filename extensions: *.pytb
    Mimetypes: text/x-python-traceback
57. VB.net (Lexer)
    Short names: vb.net, vbnet
    Filename extensions: *.vb, *.bas
    Mimetypes: text/x-vbnet, text/x-vba
58. BBCode (Lexer)
    Short names: bbcode
    Filename extensions:
    Mimetypes: text/x-bbcode
59. HTML+Myghty (Lexer)
    Short names: html+myghty
    Filename extensions:
    Mimetypes: text/html+myghty
60. PHP (Lexer)
    Short names: php, php3, php4, php5
    Filename extensions: *.php, *.php[345]
    Mimetypes: text/x-php
61. MiniD (Lexer)
    Short names: minid
    Filename extensions: *.md
    Mimetypes: text/x-minidsrc
62. XML+Smarty (Lexer)
    Short names: xml+smarty
    Filename extensions:
    Mimetypes: application/xml+smarty
63. CSS (Lexer)
    Short names: css
    Filename extensions: *.css
    Mimetypes: text/css
64. Scheme (Lexer)
    Short names: scheme
    Filename extensions: *.scm
    Mimetypes: text/x-scheme, application/x-scheme
65. MuPAD (Lexer)
    Short names: mupad
    Filename extensions: *.mu
    Mimetypes:
66. JavaScript+Smarty (Lexer)
    Short names: js+smarty, javascript+smarty
    Filename extensions:
    Mimetypes: application/x-javascript+smarty, text/x-javascript+smarty, text/javascript+smarty
67. JavaScript+PHP (Lexer)
    Short names: js+php, javascript+php
    Filename extensions:
    Mimetypes: application/x-javascript+php, text/x-javascript+php, text/javascript+php
68. Perl (Lexer)
    Short names: perl, pl
    Filename extensions: *.pl, *.pm
    Mimetypes: text/x-perl, application/x-perl
69. SQL (Lexer)
    Short names: sql
    Filename extensions: *.sql
    Mimetypes: text/x-sql
70. XML+PHP (Lexer)
    Short names: xml+php
    Filename extensions:
    Mimetypes: application/xml+php
71. CSS+Ruby (Lexer)
    Short names: css+erb, css+ruby
    Filename extensions:
    Mimetypes: text/css+ruby
72. Debian Sourcelist (Lexer)
    Short names: sourceslist, sources.list
    Filename extensions: sources.list
    Mimetypes:
73. XML+Django/Jinja (Lexer)
    Short names: xml+django, xml+jinja
    Filename extensions:
    Mimetypes: application/xml+django, application/xml+jinja
74. Java Server Page (Lexer)
    Short names: jsp
    Filename extensions: *.jsp
    Mimetypes: application/x-jsp
75. XML (Lexer)
    Short names: xml
    Filename extensions: *.xml, *.xsl, *.rss
    Mimetypes: text/xml, application/xml, image/svg+xml, application/rss+xml, application/atom+xml, application/xsl+xml, application/xslt+xml
76. Groff (Lexer)
    Short names: groff, nroff, man
    Filename extensions: *.[1234567], *.man
    Mimetypes: application/x-troff, text/troff
77. C++ (Lexer)
    Short names: cpp, c++
    Filename extensions: *.cpp, *.hpp, *.c++, *.h++
    Mimetypes: text/x-c++hdr, text/x-c++src
78. XML+Mako (Lexer)
    Short names: xml+mako
    Filename extensions:
    Mimetypes: application/xml+mako
79. ERB (Lexer)
    Short names: erb
    Filename extensions:
    Mimetypes: application/x-ruby-templating
80. HTML+Mako (Lexer)
    Short names: html+mako
    Filename extensions:
    Mimetypes: text/html+mako
81. XML+Myghty (Lexer)
    Short names: xml+myghty
    Filename extensions:
    Mimetypes: application/xml+myghty
82. Befunge (Lexer)
    Short names: befunge
    Filename extensions: *.befunge
    Mimetypes: application/x-befunge
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值