BERT for unsupervised text tasks

本文探讨了如何使用BERT及类似自注意力架构解决各种文本处理任务,包括通过上下文窗口改进句子相关性评估和生成大型文档的分布式表示,为NLP社区提供了有效的解决方案。

This post discusses how we use BERT and similar self-attention architectures to address various text crunching tasks at Ether Labs.

Self-attention architectures have caught the attention of NLP practitioners in recent years, first proposed in Vaswani et al., where the authors have used multi-headed self-attention architecture for machine translation tasks

BERT Architecture Overview

  • BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al.
  • Each word in BERT gets “n_layers*(num_heads*attn.vector) “ representations that capture the representation of the word in the current context
  • For example, in BERT base: n_layers = 12, N_heads = 12, attn.vector = dim(64)
  • In this case, we have 12X12X(64) representational sub-spaces for each word to leverage
  • This leaves us with a challenge and opportunity to leverage such rich representations unlike any other LM architectures proposed earlier

Sentence relatedness with BERT

BERT representations can be double-edged sword gives the richness in its representations. In our experiments with BERT, we have observed that it can often be misleading with conventional similarity metrics like cosine similarity. For example, consider pair-wise cosine similarities in below case (from the BERT model fine-tuned for HR-related discussions):

text1: Performance appraisals are both one of the most crucial parts of a successful business, and one of the most ignored.

text2: On the other, actual HR and business team leaders sometimes have a lackadaisical “I just do it because I have to” attitude.

text3: If your organization still sees employee appraisals as a concept they need to showcase just so they can “fit in” with other companies who do the same thing, change is the order of the day. How can you do that in a way that everyone likes?

text1<>text2–0.613270938396454

text1<>text3–0.634544332325459

text2<>text3–0.772294402122498

A metric that ranks text1<>text3 higher than any other pair would be desirable. How do we get there?

OOTB, BERT is pre-trained using two unsupervised tasks, Masked LM and Next Sentence Prediction (NSP) tasks.

Masked LM is a spin-up version of conventional language model training setup — next word prediction task. For more details, please refer to section 3.1 in the original paper.

Next Sentence Prediction (NSP) task is a novel approach proposed by authors to capture the relationship between sentences, beyond the similarity.

For the above text pair relatedness challenge, NSP seems to be an obvious fit and to extend its abilities beyond a single sentence, we have formulated a new training task.

From NSP to Context window

In a context window setup, we label each pair of sentences occurring within a window of n sentences as 1 and zero otherwise. For example, consider the following paragraph:

As a manager, it is important to develop several soft skills to keep your team charged. Invest time outside of work in developing effective communication skills and time management skills. Skills like these make it easier for your team to understand what you expect of them in a precise manner. Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Encourage them to give you feedback and ask any questions as well. Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems.

For context window n=3, we generate following training examples

Invest time outside of work in developing effective communication skills and time management skills. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

As a manager, it is important to develop several soft skills to keep your team charged. <SEP> Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. Label: 0

Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

This training paradigm enables the model to learn the relationship between sentences beyond the pair-wise proximity. After context window fine-tuning BERT on HR data, we got following pair-wise relatedness scores

text1<>text2–0.1215614

text1<>text3–0.899943

text2<>text3–0.480266

This captures the sentence relatedness beyond similarity. In practice, we use a weighted combination of cosine similarity and context window score to measure the relationship between two sentences.

Document Embeddings

Generating feature representations for large documents (for retrieval tasks) has always been a challenge for the NLP community. Approaches like concatenating sentence representations make them impractical for downstream tasks and averaging or any other aggregation approaches (like p-means word embeddings) fail beyond certain document limit. We have explored several ways to address these problems and found the following approaches to be effective:

BERT+RNN Encoder

We have set up a supervised task to encode the document representations taking inspiration from RNN/LSTM based sequence prediction tasks.

[step-1] extract BERT features for each sentence in the document

[step-2] train RNN/LSTM encoder to predict the next sentence feature vector in each time step

[step-3] use final hidden state of the RNN/LSTM as the encoded representation of the document

This approach works effectively for smaller documents and is not effective for larger documents due to the limitations of RNN/LSTM architectures.

Distributed Document Representations

Generating a single feature vector for an entire document fails to capture the whole essence of the document even when using BERT like architectures. We have reformulated the problem of Document embedding to identify the candidate text segments within the document which in combination captures the maximum information content of the document. We use the following approaches to get the distributed representations — Feature clustering, Feature Graph Partitioning

Feature clustering

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] run k-means clustering algorithm with relatedness score (discussed in the previous section) as a similarity metric on candidate document until convergence

[step-4] use the text segments closest to each centroid as the document embedding candidate

A general rule of thumb is to have a large chunk size and a smaller number of clusters. In practice, these values can be fixed for a specific problem type

Feature Graph Partitioning

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] build a graph with nodes as text chunks and relatedness score between nodes as edge scores

[step-4] run community detection algorithms (eg. The Louvain algorithm)to extract community subgraphs

[step-5] use graph metrics like node/edge centrality, PageRank to identify the influential node in each sub-graph — used as document embedding candidate

Conclusion

This post highlights some of the novel approaches to use BERT for various text tasks. These approaches can be easily adapted to various usecases with minimal effort. More to come on Language Models, NLP, Geometric Deep Learning, Knowledge Graphs, contextual search and recommendations. Stay tuned!!

 

【用BERT完成无监督文本任务】《BERT for unsupervised text tasks》by Venkata Dikshit 

 

【直流微电网】径向直流微电网的状态空间建模与线性化:一种耦合DC-DC变换器状态空间平均模型的方法 (Matlab代码实现)内容概要:本文介绍了径向直流微电网的状态空间建模与线性化方法,重点提出了一种基于耦合DC-DC变换器状态空间平均模型的建模策略。该方法通过对系统中多个相互耦合的DC-DC变换器进行统一建模,构建出整个微电网的集中状态空间模型,并在此基础上实施线性化处理,便于后续的小信号分析与稳定性研究。文中详细阐述了建模过程中的关键步骤,包括电路拓扑分析、状态变量选取、平均化处理以及雅可比矩阵的推导,最终通过Matlab代码实现模型仿真验证,展示了该方法在动态响应分析和控制器设计中的有效性。; 适合人群:具备电力电子、自动控制理论基础,熟悉Matlab/Simulink仿真工具,从事微电网、新能源系统建模与控制研究的研究生、科研人员及工程技术人员。; 使用场景及目标:①掌握直流微电网中多变换器系统的统一建模方法;②理解状态空间平均法在非线性电力电子系统中的应用;③实现系统线性化并用于稳定性分析与控制器设计;④通过Matlab代码复现和扩展模型,服务于科研仿真与教学实践。; 阅读建议:建议读者结合Matlab代码逐步理解建模流程,重点关注状态变量的选择与平均化处理的数学推导,同时可尝试修改系统参数或拓扑结构以加深对模型通用性和适应性的理解。
### Modern BERT Implementation and Resources BERT (Bidirectional Encoder Representations from Transformers) has evolved significantly since its introduction, leading to various modern implementations that enhance performance, efficiency, and applicability across different domains. For those interested in exploring or utilizing these advancements, several key resources and techniques are available. #### Advanced Implementations of BERT One notable advancement is Hugging Face's `transformers` library, which provides state-of-the-art models including RoBERTa, DistilBERT, ALBERT, among others[^3]. These variants offer improvements over the original BERT model by addressing specific limitations such as computational cost and generalization capabilities. For instance, when working with TensorFlow-based projects, one might explore repositories like `1234560o/Bert-model-code-interpretation`, where detailed explanations about how TensorFlow versions of BERT process inputs through layers within `modeling.py` can be found. Additionally, integrating pre-trained language models into custom applications often involves fine-tuning on domain-specific datasets. This approach leverages transfer learning principles, allowing developers to adapt powerful NLP architectures without extensive retraining efforts. #### Practical Applications Using Topic Modeling Techniques In scenarios requiring deeper text analysis beyond simple classification tasks, combining BERT embeddings with unsupervised methods becomes beneficial. Methods mentioned previously—such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF)—can complement transformer outputs effectively[^2]: ```python from sklearn.decomposition import NMF import numpy as np def apply_nmf(text_embeddings): nmf_model = NMF(n_components=10, init='random', random_state=0) W = nmf_model.fit_transform(np.array(text_embeddings)) return W, nmf_model.components_ ``` This code snippet demonstrates applying non-negative matrix factorization after obtaining sentence-level representations via BERT encoders. The resulting matrices provide insights into latent structures present within textual corpora while facilitating downstream clustering operations. #### Evaluating Model Performance Through Sampling Strategies To ensure robustness during development phases, testing functions against diverse input samples remains crucial. Generating synthetic embedding vectors simulates real-world conditions encountered post-deployment: ```python import tensorflow as tf def create_random_embedding(vocab_size, embed_dim): """Generates random word embeddings.""" rng = tf.random.get_global_generator() rng.reset_from_seed(23879) rand_embeds = rng.normal(shape=(vocab_size, embed_dim)) return rand_embeds.numpy() new_query_embs = create_random_embedding(100, 768) train_query_embs = create_random_embedding(500, 768) print("New Query Embeddings Shape:", new_query_embs.shape) print("Training Query Embeddings Shape:", train_query_embs.shape) ``` The above script illustrates creating randomized feature sets suitable for benchmarking purposes before deploying trained models into production environments[^1]. --related questions-- 1. How does incorporating topic modeling improve document understanding compared to traditional keyword extraction? 2. What advantages do newer BERT derivatives bring over earlier iterations regarding resource utilization? 3. Can you explain the significance behind choosing certain dimensions for generating random embeddings used in evaluation tests? 4. In what ways could fine-tuned BERT models benefit industries outside natural language processing?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值