这个问题我已经想搞清楚很久了,不得不说,很多开发工具(或者说开源项目)的人,脑子都是一团浆糊,虽然标榜自己可用于科研,但是完全不知道科研人员的需求在哪,写出来的文档也是乱七八糟。
按照以前的Joern的文档,我根本没搞清楚怎么生成PDG,现在算是清楚了:https://docs.joern.io/exporting/
我们以一个别人论文里的例子来说明,例如我要生成下面这个Example.c文件的PDG:
int main(int argc, char **argv)
{
char *items[] = {"boat", "car", "truck", "train"};
int index = Untrusted();
printf("You selected %s\n", items[index-1]);
int upbound = sizeof(items) / sizeof(items[0]);
printf("Last item %s\n", items[upbound - 1]);
}
这个文件是没法编译的,但是通过island grammar,Joern是可以分析它的,在安装好了Joern之后,我们运行:
./joern-parse Example.c
./joern-export --repr pdg --out Example
就可以生成一个Example的目录,并且在目录下面有0-pdg.dot,1-pdg.dot,2-pdg.dot这些文件,我们一般只关心第一个即可,可以看看这个文件长什么样:
digraph main {
"1000100" [label = "(METHOD,main)" ]
"1000135" [label = "(METHOD_RETURN,int)" ]
"1000101" [label = "(PARAM,int argc)" ]
"1000102" [label = "(PARAM,char **argv)" ]
"1000105" [label = "(<operator>.assignment,*items[] = {\"boat\", \"car\", \"truck\", \"train\"})" ]
"1000108" [label = "(<operator>.assignment,index = Untrusted())" ]
"1000111" [label = "(printf,printf(\"You selected %s\n\", items[index-1]))" ]
"1000115" [label = "(<operator>.subtraction,index-1)" ]
"1000119" [label = "(<operator>.assignment,upbound = sizeof(items) / sizeof(items[0]))" ]
"1000121" [label = "(<operator>.division,sizeof(items) / sizeof(items[0]))" ]
"1000122" [label = "(<operator>.sizeOf,sizeof(items))" ]
"1000124" [label = "(<operator>.sizeOf,sizeof(items[0]))" ]
"1000128" [label = "(printf,printf(\"Last item %s\n\", items[upbound - 1]))" ]
"1000132" [label = "(<operator>.subtraction,upbound - 1)" ]
"1000119" -> "1000135" [ label = "DDG: sizeof(items) / sizeof(items[0])"]
"1000128" -> "1000135" [ label = "DDG: items[upbound - 1]"]
"1000102" -> "1000135" [ label = "DDG: argv"]
"1000124" -> "1000135" [ label = "DDG: items[0]"]
"1000101" -> "1000135" [ label = "DDG: argc"]
"1000111" -> "1000135" [ label = "DDG: printf(\"You selected %s\n\", items[index-1])"]
"1000122" -> "1000135" [ label = "DDG: items"]
"1000111" -> "1000135" [ label = "DDG: items[index-1]"]
"1000128" -> "1000135" [ label = "DDG: printf(\"Last item %s\n\", items[upbound - 1])"]
"1000108" -> "1000135" [ label = "DDG: Untrusted()"]
"1000132" -> "1000135" [ label = "DDG: upbound"]
"1000115" -> "1000135" [ label = "DDG: index"]
"1000100" -> "1000101" [ label = "DDG: "]
"1000100" -> "1000102" [ label = "DDG: "]
"1000100" -> "1000105" [ label = "DDG: "]
"1000100" -> "1000108" [ label = "DDG: "]
"1000100" -> "1000111" [ label = "DDG: "]
"1000105" -> "1000111" [ label = "DDG: items"]
"1000108" -> "1000115" [ label = "DDG: index"]
"1000100" -> "1000115" [ label = "DDG: "]
"1000100" -> "1000119" [ label = "DDG: "]
"1000100" -> "1000121" [ label = "DDG: "]
"1000100" -> "1000122" [ label = "DDG: "]
"1000100" -> "1000128" [ label = "DDG: "]
"1000119" -> "1000132" [ label = "DDG: upbound"]
"1000100" -> "1000132" [ label = "DDG: "]
}
如果我们运行:
dot -Tpng 0-pdg.dot -o 0-pdg.png
就可以生成下面这个图:

可以很明显地看到,这个图并不是以语句为节点单位的,而是更细粒度的类似于AST节点的node。但是,往往我们在分析源码的时候,需要和某一行对应起来(例如Diff是以一行作为单位的),那怎么办呢?抱歉,Joern的文档是不会给我们答案的。例如这里有人也提了这个问题:https://gitter.im/joern-code-analyzer/community

呵呵呵,但是并没有人回复他。所以得自己想想办法。
按照这里的介绍,我们是可以query所生成的CPG(Code Property Graph)的所有信息的:https://docs.joern.io/quickstart#querying-the-code-property-graph
例如我们可以通过:
cpg.method.name.l
来返回所分析代码中的所有function name,按道理说,看上面的0-pdg.dot文件,每一个节点都有一个唯一的编号,那我们应该能把其行号输出出来:
最方便起见,我们可以用:
cpg.all.l
可以看到是包含所有的节点行号的,但是似乎不太好分析,看到这里介绍:https://docs.joern.io/cpgql/reference-card#execution-directives
除了用.l之外,我们可以输出成Json,这样就好分析了(顺便吐槽一下这个.l,你前面都没简写,后面不能用个.list吗,非得用个.l,让人和.1傻傻分不清):
cpg.all.toJsonPretty |> "Json.txt"
如果我们要批量分析很多个源码文件呢,那显然用这种interactive的方式就不行了,我们需要借助于:https://docs.joern.io/interpreter
例如我们保存一个test.sc文件:
@main def exec() = {
loadCpg("cpg.bin")
cpg.all.toJsonPretty |> "Temp.json"

本文详细介绍如何使用Joern工具生成程序依赖图(PDG),并通过实际案例解释如何从生成的PDG中获取节点对应的源代码行号,为后续源代码分析提供支持。
最低0.47元/天 解锁文章

3708

被折叠的 条评论
为什么被折叠?



