2020年6月16日 初次实践BERT,尝试在自己的数据上简单跑通Bert fine-tune;
过程参考:https://zhuanlan.zhihu.com/p/50774647
依赖环境:Google发布的BERT源码(https://github.com/google-research/bert)
试验设置:将每个triple转化为形如"ptext : vtext"的文本,要求学得分类器判断两triple间相似程度(分类问题,三种label:["contradiction", "entailment", "neutral"] 分别对应不相似、相似、中等);
(主要修改BERT源码中的run_classifier.py代码)
0、监督数据构造:
(1)将每个entity description中的第0个triple与其他各triple构成triple pair;
(2)根据某种已有文本相似度打分对triple pair打标签(例如此处用Jaro-Winkler相似度,更多字符串相似度计算参考 https://pypi.org/project/strsim )
from strsimpy.jaro_winkler import JaroWinkler
def get_lines_from_eids(eids):
jarowinker = JaroWinkler() # for compute similarity
lines = []
for eid in eids:
tid_ttext_list = _load_ttext_for_eid(eid)
ti = tid_ttext_list[0]
ttexti = ti[1]
for j in range(1, len(tid_ttext_list)):
tj = tid_ttext_list[j]
ttextj = tj[1]
sim = jarowinker.similarity(ttexti, ttextj) # in range [0, 1]
label = "contradiction" if sim<0.3 else "entailment" if sim>0.7 else "neutral"
lines.append((ttexti, ttextj, label))
return lines
def _load_ttext_for_eid(eid):
conn = getConnESBM()
sql = 'select tid, ptext, vtext from jwsall_triple_info where tid in (select id from jwsall_triple where eid=%d) order by tid'%eid
cursor = conn.cursor()
cursor.execute(sql)
result = cursor.f