关于2020语言与智能技术竞赛中的一部分代码的简要小结

最新推荐文章于 2022-04-01 23:48:55 发布

weixin_41710583

最新推荐文章于 2022-04-01 23:48:55 发布

阅读量199

点赞数

文章标签：自然语言处理机器学习数据挖掘深度学习 tensorflow

本文链接：https://blog.youkuaiyun.com/weixin_41710583/article/details/111463343

版权

本次博客主要记录代码部分的理解和遇到的知识点，包括将词组转化为数字字典、NotImplementedError报错机制、__iter__和__next__方法的运用、判断类中方法是否存在、__len__方法的含义，以及加载arg参数等，后续将补充对其他部分代码的理解。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本次主要记录了标题中的一些代码部分的理解和遇到的知识点，主要是encode部分的情况。

01 下图代码段中的with open部分中，通过每次新增键值对的长度个数，讲词组word转化为对应数字字典。

from bert4keras.tokenizers import Tokenizer

maxlen = 128


# tokenizer = Tokenizer(dict_path, do_lower_case=True)


def load_vocab(dict_path, encoding='utf-8', simplified=False, startswith=None):
    """从bert的词典文件中读取词典
    """
    token_dict = {}
    with open(dict_path, encoding=encoding) as reader:
        for line in reader:
            token = line.split()
            token = token[0] if token else line.strip()
            token_dict[token] = len(token_dict)

    if simplified:  # 过滤冗余部分token
        new_token_dict, keep_tokens = {}, []
        startswith = startswith or []
        for t in startswith:
            new_token_dict[t] = len(new_token_dict)
            keep_tokens.append(token_dict[t])

        for t, _ in sorted(token_dict.items(), key=lambda s: s[1]):
            if t not in new_token_dict:
                keep = True
                if len(t) > 1:
                    for c in Tokenizer.stem(t):
                        if (
                            Tokenizer._is_cjk_character(c) or
                            Tokenizer._is_punctuation(c)
                        ):
                            keep = False
                            break
                if keep:
                    new_token_dict[t] = len(new_token_dict)
                    keep_tokens.append(token_dict[t])

        return new_token_dict, keep_tokens
    else:
        return token_dict

02 下图中NotImplementedError，指的是子类中通常也会有同样的函数，定义其中的内容，当子类中该函数不存在时，就会从父类中引入该函数，返回NotImplementedError报错。

    def _tokenize(self, text):
        """基本分词函数
        """
        raise NotImplementedError

03 __iter__的运用：这个方法我理解的意思，应该是创建可迭代的对象，通常和__next__一起使用，__next__指的是对可迭代的对象，进行返回对应的值。

04 hasattr(self.data, '__len__') 这个方法是判断类中__len__这个方法是否存在。

05 __len__ 这个方法意思为：__len__ 为类class a(): 的方法，当len(a())时就会执行__len__这个方法。

06 加载arg参数，在执行管理参数的时候，可以用下面的方法进行添加，这个包还没有仔细理解其中的意思，待补充。

parser = argparse.ArgumentParser('shortA')
parser.add_argument('--verbose', action='store_true', help='print QPA info of every sample')
parser.add_argument('--debug', action='store_true', help='debug mode')
args = parser.parse_args(args=[])
args

_StoreTrueAction(option_strings=['--debug'], dest='debug', nargs=0, const=True, default=False, type=None, choices=None, help='debug mode', metavar=None)

Namespace(debug=False, verbose=False)