- 博客(48)
- 资源 (1)
- 收藏
- 关注
原创 multiprocessing imap多进程
from multiprocessing import Poolfrom tqdm import tqdmimport multiprocessingimport timedef preprocess_data_multiprocessing(x): time.sleep(0.1) return x+5data = range(20)if __name__ == '__main__': multiprocessing.freeze_support() t.
2021-12-18 18:33:35
534
原创 SequenceSummary
from torch import nnfrom torch.nn import Identityfrom typing import Callable,Optionalimport torchPretrainedConfig = Noneget_activation = Noneclass SequenceSummary(nn.Module): r""" Compute a single vector summary of a sequence hidden states..
2021-05-19 13:58:02
296
原创 自定义Conv1d和nn.Conv1d
from torch import nnimport torchclass Conv1D(nn.Module): """ 1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2). Basically works like a linear layer but the weights are transposed. Args: .
2021-05-10 10:40:16
435
原创 transformers的beam_search
"""transformer的generation_beam_search.py中beam_search过程当decoder的输入是[N,1],N为batch_size,设置beams=k,将输入转化为[N*k,1]输入放入到decoder中生成了logits,形状为[N*k,T],T为总的token数logits和历史beam_score相加成为新的beam_score,进行topk排序,获取next_beam_scores、next_beam_index、next_beam_tokensbe.
2021-04-27 16:49:39
2341
2
原创 NoRepeatNGramLogitsProcessor的_calc_banned_ngram_tokens
#transformer.generation_logits_process NoRepeatNGramLogitsProcessor的_calc_banned_ngram_tokens目的是生成不重复的ngramimport torchfrom typing import List, Iterabledef _get_ngrams(ngram_size: int, prev_input_ids: torch.Tensor, num_hypos: int): generated_ngra..
2021-04-09 15:43:11
507
1
原创 rat-sql packedsequence lstm的理解
import itertoolsimport operatorfrom typing import Tupleimport torchclass RecurrentDropoutLSTMCell(torch.jit.ScriptModule): __constants__ = ['hidden_size'] def __init__(self, input_size, hidden_size, dropout=0.): super(RecurrentDrop.
2021-02-04 15:11:51
346
1
原创 rat-sql transformer
import copyimport mathimport torchimport torch.nn as nnimport torch.nn.functional as Fimport entmax# Adapted from# https://github.com/tensorflow/tensor2tensor/blob/0b156ac533ab53f65f44966381f6e147c7371eee/tensor2tensor/layers/common_attention.py.
2021-02-02 16:42:58
261
原创 rattosql RelationalTransformerUpdate.compute_relations
import itertoolsimport numpy as npimport torchdef clamp(value, abs_max): value = max(-abs_max, value) value = min(abs_max, value) return valueclass RelationalTransformerUpdate(torch.nn.Module): qq_max_dist = 2 cc_max_dist = 2.
2021-02-02 15:24:20
122
1
原创 ratsql PackedSequencePlus select用法
import torchimport numpy as npimport itertools, operatorclass PS(): def __init__(self, data): self.data=dataclass Findembedding(): def __init__(self,ins): """ 已知一个序列列表,将序列按长度排序后生成压缩后的packsequence的embedding, 根据.
2021-02-02 14:04:33
205
原创 rat-sql registry.py
import collectionsimport collections.abcimport inspectimport sys_REGISTRY = collections.defaultdict(dict)def register(kind, name):#注册类,kind为类的类型,name为类名 kind_registry = _REGISTRY[kind] def decorator(obj): if name in kind_registry.
2021-01-12 09:42:21
301
1
原创 ratsql LookupEmbeddings的理解
import torchimport numpy as npfrom torch import nnclass LookupEmbeddings(torch.nn.Module): def __init__(self, device, vocab, embedder, emb_size, learnable_words=[]): super().__init__() self._device = device self.vocab = voc.
2021-01-04 16:07:48
397
原创 pad_sequence,pack_padded_sequence,pad_packed_sequence
参照https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorchimport torchfrom torch import nnseq_batch = [torch.tensor([[1, 1], [2, 2], [3, 3], [4.
2020-12-10 17:35:46
150
原创 列表排序索引问题
import operatordef argsort(items, key=lambda x: x, reverse=False): """ :param items: 原始表List[List] :param key: 排序函数 :param reverse: 是否逆序 :return: orig_to_sort:返回排序后的索引列表,索引列表为原始表的索引值 sort_to_orig:返回排序后列item在原始表的索引值 """ .
2020-12-04 16:59:15
355
原创 长文本切成短句的bert+lstm训练过程
#学习率很重要,lr=2e-5训练集准确率0.99,lr=1e-3,训练集准确率0.6,loss降不下来。#lstm的sequence是变长的,注意测试设置batch合理大小,确保不爆内存#因为使用bertembedding,所以尽量切文本的sequence在500左右,确保内存合理使用???下述代码不是这么做的。import gluonnlp as nlpimport mxnet as mxfrom mxnet.gluon.block import HybridBlockfrom mxnet.
2020-11-24 14:47:55
2007
原创 中文命名实体识别mxnet_bertner_cn
代码参照https://nlp.gluon.ai/model_zoo/ner/index.html1.单gpu训练改成多gpu训练2.数据读取部分的修改,以及新增预测部分的数据处理3.预测部分,可以考虑对超过seq_len的句子进行按标点符号进行分割成多样本预测。4.ner_predict.py可独立存在用于预测(但是需要保存的模型参数)5.后续有可能新增crf模块。finetune_bertcn.py代码import argparseimport loggingimpor
2020-11-16 12:55:25
499
原创 self.params的使用注意点
from mxnet.gluon import nnfrom mxnet import ndclass MyDense(nn.HybridBlock): def __init__(self, units, in_units, **kwargs): super().__init__(**kwargs) self.embedding = nn.Embedding(3, 5) self.weight = self.params.get('weight.
2020-11-12 15:00:35
2764
1
原创 gluon.utils.split_and_load进行多gpu训练碰到的小问题MXNetError: Check failed: (*begin < *end): Invalid begin, en
import numpy as npfrom mxnet import gluon,npx,nddata = np.arange(15).reshape(3, 5)dataloader = gluon.data.DataLoader(data, batch_size=2, shuffle=False, last_batch='keep')devices = [npx.gpu(0), npx.gpu(1)]for da.
2020-11-11 10:16:23
672
原创 glue.truncate_seqs_equal, glue.concat_sequences
from gluonnlp.data.bert import glueseqs = [[1, 2, 3], [4, 5, 6]]print(glue.truncate_seqs_equal(seqs, 4))seqs = [[1, 2, 3], [4, 5, 6]]print(glue.truncate_seqs_equal(seqs, 5))seqs =[['is', 'this', 'jacksonville', '?'], ['no', 'it', 'is', 'not', '.'].
2020-11-09 14:30:43
201
原创 self attentive sentence embedding 分类模型的可解释性
#个人理解#代码参考https://nlp.gluon.ai/examples/sentiment_analysis/self_attentive_sentence_embedding.html#这个模型可以用于提取影响分类的关键词的抽取,算是半监督算法。#用的训练数据样本极不均衡,人工标注较少,错误样本较多,准确率85%(其他算法用不均衡抽样能达到99%),暂无改进计划。导致最终抽取出来的关键词并不好。#感觉本算法更适用于短文本分类的解释。import osimport jso.
2020-11-03 17:08:36
959
原创 gluonnlp.model.attention_cell接口示例
from gluonnlp.model import attention_cellfrom mxnet import ndimport mxnet as mximport numpy as npmx.random.seed(10000)att_score = nd.arange(12).reshape(2,2,3)print('att_score: ', att_score)mask = nd.random.randint(0,2,shape=(2,2,3)).astype('float32.
2020-10-15 17:41:25
226
原创 2016 google machine translation 英译中
2016 google machine translation 代码参考https://gluon-nlp.mxnet.io/examples/machine_translation/gnmt.html个人理解:encoder-decoder框架:encoder和decoder 采用多个双向RNN,接着又堆叠了多个单向RNN,并把对应层的隐状态给decoder进行初始化,值得注意的是encoder的双向RNN的隐状态结果只取后向RNN的隐状态结果给decoder初始化。encoder和decoder的.
2020-10-12 13:04:05
415
原创 tfidf的tf粗暴过滤相似文本的过程二(计算性能优化)
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import linear_kernelimport numpy as npall_list= ['大雨預報1:16pm:大雨正影響台北東部,市民應提高警覺', '大雨預報1:02pm:大雨正影響台北東部,市民應提高警覺', '大雨預報12:35pm:大雨正影響台北東部,市民應提高警覺', '大雨預報3:46pm:未.
2020-09-18 13:11:08
351
原创 tfidf的tf粗暴过滤相似文本的过程
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import linear_kernelimport numpy as npall_list= ['大雨預報1:16pm:大雨正影響台北東部,市民應提高警覺', '大雨預報1:02pm:大雨正影響台北東部,市民應提高警覺', '大雨預報12:35pm:大雨正影響台北東部,市民應提高警覺', '大雨預報3:46pm:未.
2020-09-17 14:55:07
247
原创 FixedBucketSampler二
import numpy as npimport gluonnlpfrom gluonnlp.data.sampler import ConstWidthBucket, _match_bucket_keys, _bucket_statsimport warningsclass Sampler(object): """Base class for samplers. All samplers should subclass `Sampler` and define `__iter.
2020-09-15 12:46:28
292
原创 FixedBucketSampler(一)_match_bucket_keys
import numpy as npdef _match_bucket_keys(bucket_keys, seq_lengths): """ :param bucket_keys: 每个区间的右侧值。 :param seq_lengths: 序列长度列表,存每个句子的长度的列表。 :return: 每个区间存放的样本id值,便于取出来 """ bucket_key_npy = np.array(bucket_keys, dtype=np.int32).
2020-09-15 12:44:20
228
原创 ExpWidthBucket指数级/ConstWidthBucket划分数据的区间(分箱)
# 根据最大值最小值,区间个数,用区间指数级增长函数求出bucket_keys值(每段区间右侧值)import mxnet as mximport mathINT_TYPES = mx.base.integer_typesclass BucketScheme: r"""Base class for generating bucket keys.""" def __call__(self, max_lengths, min_lengths, num_buckets): .
2020-09-14 18:56:09
213
原创 pkuseg与jieba分词效果比较,BERTBasicTokenizer
从效果上看,在网络用语上pkuseg web模式表现性能优于结巴。具体体现在1.时间粒度词分割 2010/102.网址分割 https://d.weibo.com/102803_ctg1_3288_-_ctg1_3288?from=faxian_hot&mod=fenlei#3.去除无效字符上,pkuseg会进行此类操作,jieba不会4. ...import pkusegseg = pkuseg.pkuseg() # 程序会自动下载所对应的细领域模型text = seg.cut(.
2020-09-11 11:07:05
964
原创 python小技巧:继承中的父承子业
#继承中的父承子业class A: def __init__(self): self.a = 1 self.b = 2 print('继承中的父承子业: ', self.c)class B(A): def __init__(self): self.c = 3 super(B, self).__init__()if __name__ == '__main__': b = B() 结果:.
2020-09-09 09:57:45
222
原创 nlp.data.batchify.CorpusBPTTBatchify的学习理解
import gluonnlp as nlpfrom gluonnlp.data.dataset import CorpusDataset,SimpleDatasetfrom mxnet import np,npximport mxnet as mximport mathnpx.set_np()#batch_size = 5bptt = 6def wordtoword_splitter(s): """按字分割""" return list(s)def _slic.
2020-09-04 17:56:48
420
原创 transform,_LazyTransformDataset,transform_first使用简单介绍
import ioimport osfrom gluonnlp.data.dataset import Dataset, SimpleDatasetclass Dataset(object): """Abstract dataset class. All datasets should have this interface. Subclasses need to override `__getitem__`, which returns the i-th element.
2020-09-04 10:23:39
432
原创 SimpleDataset和CorpusDataset的简单使用
import ioimport osfrom gluonnlp.data.dataset import Datasetdef line_splitter(s): """Split a string at newlines. 按行分割字符串 Parameters ---------- s : str The string to be split Returns -------- List[str] Li.
2020-09-04 09:59:05
580
原创 gluonnlp.vocab简单解析
# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under.
2020-08-26 17:54:24
836
原创 Bert WordpieceTokenizer
Bert WordpieceTokenizer字符片段的分词。个人理解分出vocab中的更小片段,分词思想从字符后往前遍历字符段与vocab进行匹配,匹配到则保留,然后检索字符片段剩下的片段。def tokenize(text): unk_token='<unk>' vocab=["un", "##aff", "##able"] output_tokens = [] for token in [text]: chars = list(token.
2020-08-21 18:23:10
3067
原创 不均衡样本的sampler构建 Imbalanced Dataset Sampler
from fastNLP.io import SST2Pipefrom fastNLP import DataSetIterfrom torchsampler import ImbalancedDatasetSamplerpipe = SST2Pipe()databundle = pipe.process_from_file()vocab = databundle.vocabs['words']print(databundle)print(databundle.datasets['train.
2020-08-12 10:06:51
1506
4
原创 理解mx.nd.sparse.csr_matrix函数
"""具体理解csr_matrix参照https://cloud.tencent.com/developer/article/1387734"""def csr_matrix(arg1, shape=None, ctx=None, dtype=None): """Creates a `CSRNDArray`, an 2D array with compressed sparse row (CSR) format. The CSRNDArray can be instantiated.
2020-06-12 13:53:00
343
原创 理解 gluonnlp.data.batchify的pad机制
import mathimport mxnet as mximport numpy as npimport warningsdef _pad_arrs_to_max_length(arrs, pad_axis, pad_val, use_shared_mem, dtype, round_to=None): """Inner Implementation of the Pad batchify 填充[arr,arr]列表的数组维度是该维度下的最大值 Parameters .
2020-06-09 21:38:06
424
原创 理解gluonnlp fasttext_ngram_hashes生成函数
from numba import njitimport numpy as npdef _fasttext_ngram_hashes(word, ns, bucket_size): """生成ns的ngram的word hash码""" hashes = [] max_n = np.max(ns) for i in range(len(word)): # pylint: disable=consider-using-enumerate if (wor..
2020-05-16 21:10:12
342
原创 理解gluonts的AddObservedValuesIndicator的用法
# class AddObservedValuesIndicator(SimpleTransformation):# """# Replaces missing values in a numpy array (NaNs) with a dummy value and adds# an "observed"-indicator that is ``1`` when ...
2020-02-21 17:38:26
329
原创 将时间规整成最近滞后的5分钟时间
print(pi_start_time)from datetime import datetime,timedeltadef normalize_minutes(dt,round_mins=5): mins = dt.minute - (dt.minute % round_mins) nearest5interval = datetime(dt.year, dt.month, ...
2019-12-26 10:36:07
508
原创 CBOW的pytorch实现过程
代码来源 少量中文注解 纯学习https://github.com/joosthub/PyTorchNLPBook/blob/master/chapters/chapter_5/5_2_CBOW/5_2_Continuous_Bag_of_Words_CBOW.ipynbimport jsonimport osfrom argparse import Namespacefrom tq...
2019-12-19 18:58:00
1324
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人