- Analyzing Network Protocols of Application Layer Using Hidden Semi-Markov Model。选择的数据集是自己抓取的数据,包括两个文本协议:HTTP和SSDP协议,四个二进制协议:BitTorrent,QQ,DNS和NetBIOS协议。
- Bit-oriented format extraction approcah for automatic binary protocol reverse。选择的二进制协议样本包括:AIS协议作为representative of protocols with cable formats?,HDLC协议作为链路层协议的代表,NetBIOS和ICMP协议作为网络层协议代表。
- 基于递归聚类的报文结构提取方法。选择在样本规模为500、5000的条件下对FTP、DNS、Emule、SIP和SMB协议进行测试
- 基于最大似然概率的协议关键词长度确定方法(通信学报)。选择HTTP、FTP、SMTP、POP、SSDP和BitTorrent协议,而且对样本进行了噪音过滤、重构会话、重组报文以及长报文截断等处理。
- Exploiting Semantic for the Automatic Reverse Engineering of Communication Protocols。
这篇比较好的综述性文章很详细,做了两种类型的实验,一种是:对已知协议进行推断与协议规范比较。选择的样本是FTP和SAMBA协议。两一种是:对未知协议进行推断,与其他的方法得到的结果进行比较。选择的样本是P2P协议(used by a botnet known as ZeroAccess)以及一个私有协议VoIP。
还是英文讲得比较清楚:
Our comparative study relies on six datasets: two of them correspond to a wellknown text protocol (FTP), two of them to the well-known SAMBAv2 binary protocol (SMB), one to a P2P botnet protocol and one to a typical commercial proprietary product .To create the calibration datasets we used the first solution detailed in section 6.2 to create scripts that execute various actions with predefined parameters on the protocol implementation.we annotated the captured traces with the executed actions and contextual data used to generate them。
we used traces captured in both academic and professional environments.To create the evaluation datasets, we used traces captured in both academic and professional environments. The realistic FTP dataset is a subset of traces published by LBNL [104], collected in an university network. We arbitrary considered the first 1000 packets in three different days of capture (days 10, 11 and 12) to produce a dataset of reasonable size. The second realistic dataset comes from a full day of SMB traffic captured in a company network. Users agreed to participate and behaved in a normal way. We retained a portion of the whole traffic that represents 1000 packets. Obtained dataset is composed of 937 distinct SMB packets, covering 22 different true formats. By true format, we hereby refer to the format detailed in protocol specifications.
For anonymity reasons, the LBNL dataset only includes traces that hold no precise definition of the context in which they were captured. In such situation, we would have used the last solution proposed in Section 6.2 to obtain necessary semantic information. However, in that case this datasets would not reflect the same quality as those used for calibration. Returned results would therefore be difficult to interpret as various factors would have influenced them. Thus, to ensure consistency between parameters used for calibration and evaluation, we extracted from evaluation network traces the same types of contextual data than the one we used for calibration. We relied on the Wireshark tool that can be use to extract the contextual information we were looking for. We followed the same approach on the SMB datasets.