BERT小试

最新推荐文章于 2025-01-09 15:00:18 发布

Sophia_Liu

最新推荐文章于 2025-01-09 15:00:18 发布

阅读量786

点赞数

分类专栏： ML代码文章标签： tensorflow 深度学习机器学习

本文链接：https://blog.youkuaiyun.com/xiao_xia_/article/details/106795157

版权

本文记录了作者初次尝试在自定义数据上使用BERT进行fine-tune的过程，包括数据准备、模型配置、训练与结果分析。作者通过调整BERT源码，构建了一个分类器来判断triple对之间的相似程度，并探讨了在不使用estimator的情况下如何将预训练的BERT模型整合到自定义模型中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2020年6月16日初次实践BERT，尝试在自己的数据上简单跑通Bert fine-tune；

过程参考：https://zhuanlan.zhihu.com/p/50774647

依赖环境：Google发布的BERT源码（https://github.com/google-research/bert）

试验设置：将每个triple转化为形如"ptext : vtext"的文本，要求学得分类器判断两triple间相似程度（分类问题，三种label：["contradiction", "entailment", "neutral"] 分别对应不相似、相似、中等）；

（主要修改BERT源码中的run_classifier.py代码）

0、监督数据构造：

（1）将每个entity description中的第0个triple与其他各triple构成triple pair；

（2）根据某种已有文本相似度打分对triple pair打标签（例如此处用Jaro-Winkler相似度，更多字符串相似度计算参考 https://pypi.org/project/strsim ）


from strsimpy.jaro_winkler import JaroWinkler
def get_lines_from_eids(eids):
    jarowinker = JaroWinkler() # for compute similarity
    lines = []
    for eid in eids:
        tid_ttext_list = _load_ttext_for_eid(eid)
        ti = tid_ttext_list[0]
        ttexti = ti[1]
        for j in range(1, len(tid_ttext_list)):
            tj = tid_ttext_list[j]
            ttextj = tj[1]
            sim = jarowinker.similarity(ttexti, ttextj) # in range [0, 1]
            label = "contradiction" if sim<0.3 else "entailment" if sim>0.7 else "neutral"
            lines.append((ttexti, ttextj, label))
    return lines


def _load_ttext_for_eid(eid):
    conn = getConnESBM()
    sql = 'select tid, ptext, vtext from jwsall_triple_info where tid in (select id from jwsall_triple where eid=%d) order by tid'%eid
    cursor = conn.cursor()
    cursor.execute(sql)
    result = cursor.f