StarSpace系列之一:tagspace

本文介绍了StarSpace在标签和单词嵌入上的应用,特别是在新闻话题分类任务中的使用。通过训练,模型能将文本映射到标签空间。文章提供了训练数据的详细信息,包括数据集构成和样本,以及模型保存和评估方法。此外,还讨论了如何使用不同格式的数据进行训练,并给出了应用示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题类型

TagSpace 单词、标签的嵌入
用途: 学习从短文到相关主题标签的映射,例如,在 这篇文章 中的描述。这是一个典型的分类应用。

模型: 通过学习两者的嵌入,学习的映射从单词集到标签集。 例如,输入“restaurant has great food <\tab> #restaurant <\tab> #yum”将被翻译成下图。(图中的节点是要学习嵌入的实体,图中的边是实体之间的关系。

在这里插入图片描述

训练数据

training:

training data

The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
新闻数据,4大类,12万篇。
World
Sports
Business
Sci/Tech

数据样例

The file classes.txt contains a list of classes corresponding to each label.

__label__2 , garca winds up best in tough going , given what sergio garca has achieved in his career already it is difficult to believe he is only 24 years old . he had a 67 yesterday , four under , to share the volvo masters lead with his fellow spaniard
__label__3 , us shares take a tumble on oil prices , new york , nov 23 ( afp ) - wall street shares slid on tuesday as oil prices surged higher and investors sensed weaknesses in the technology sector .
__label__4 , product review blackberry 7100t smartphone ( newsfactor ) , newsfactor - research in motion ' s ( nasdaq rimm ) quad-band \blackberry 7100t with \pda capabilities is a gsm/gprs ( 850/900/1800/1900 mhz ) cellular handset that can make and receive phone calls in more than 100 countries around the world .

训练

./classification_ag_news.sh
Downloading dataset ag_news
Compiling StarSpace
make: *** No targets specified and no makefile found.  Stop.
Start to train on ag_news data:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /tmp/starspace/data/ag_news.train
Read 5M words
Number of words 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值