Some Useful Corpora

本文概述了英语领域的特定文体语料库和资源,包括短信消息、聊天记录、新闻电讯、电子邮件、FAQ等,并提供了获取这些资源的链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/  (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs
( http://data.eol.ucar.edu/codiac/dss/id=92.124 ;
http://data.eol.ucar.edu/codiac/dss/id=88.044 ;
http://data.eol.ucar.edu/codiac/dss/id=107.010 )

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
here:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

- Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
( http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml ) . The NPS
chat corpus ( http://faculty.nps.edu/cmartell/NPSChat.htm ) is a POS
tagged chat corpus and the switchboard corpus
( http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html ) is a telephonic
conversation corpus.

- The Linguistics Data Consortium has a good deal of telephone
conversation - many files and a variety of languages. See
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets
( http://ir.dcs.gla.ac.uk/test_collections/ ) is not for free. Helpful wiki:
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
- Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

- The Göteborg Spoken Language Corpus and other corpora in
Swedish ( http://spraakbanken.gu.se/ )

- Genre: tweets = The  twitter  corpus associated with the paper
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf  is
here:  https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/  (not for free)

- Genre: Newswires: Reuters' Newswires collections =
http://trec.nist.gov/data/reuters/reuters.html

- Genre: emails = Enron corpus ( http://www.cs.cmu.edu/~enron/ );
categorized Enron emails ( http://sgi.nu/enron/corpora.php )

- Genre: emails = Junk email corpus
( http://clg.wlv.ac.uk/resources/junk-emails/index.php )

- Genre: FAQs = 200 FAQs
( http://www.itri.brighton.ac.uk/~Marina.Santini/#Download )

Resources:
- In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words 'organized' according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida ( http://w3.usf.edu/FreeAssociation/ ).
- Article: Hella Koo Finding:  Twitter  Dialect -
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/

- Genre: tweets = the suggestion is to use  Twitter  API to crawl  twitter
dataset.
- DiscoverText is a program you can use to scoop out  Twitter  feeds
really easily. Their website is here:
http://discovertext.com/defaultDT2.aspx
One can do a free 30 day trial and get a bunch of  Twitter  messages.

Note:
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
http://demeter.inf.ed.ac.uk/

资源下载链接为: https://pan.quark.cn/s/1bfadf00ae14 “STC单片机电压测量”是一个以STC系列单片机为基础的电压检测应用案例,它涵盖了硬件电路设计、软件编程以及数据处理等核心知识点。STC单片机凭借其低功耗、高性价比和丰富的I/O接口,在电子工程领域得到了广泛应用。 STC是Specialized Technology Corporation的缩写,该公司的单片机基于8051内核,具备内部振荡器、高速运算能力、ISP(在系统编程)和IAP(在应用编程)功能,非常适合用于各种嵌入式控制系统。 在源代码方面,“浅雪”风格的代码通常简洁易懂,非常适合初学者学习。其中,“main.c”文件是程序的入口,包含了电压测量的核心逻辑;“STARTUP.A51”是启动代码,负责初始化单片机的硬件环境;“电压测量_uvopt.bak”和“电压测量_uvproj.bak”可能是Keil编译器的配置文件备份,用于设置编译选项和项目配置。 对于3S锂电池电压测量,3S锂电池由三节锂离子电池串联而成,标称电压为11.1V。测量时需要考虑电池的串联特性,通过分压电路将高电压转换为单片机可接受的范围,并实时监控,防止过充或过放,以确保电池的安全和寿命。 在电压测量电路设计中,“电压测量.lnp”文件可能包含电路布局信息,而“.hex”文件是编译后的机器码,用于烧录到单片机中。电路中通常会使用ADC(模拟数字转换器)将模拟电压信号转换为数字信号供单片机处理。 在软件编程方面,“StringData.h”文件可能包含程序中使用的字符串常量和数据结构定义。处理电压数据时,可能涉及浮点数运算,需要了解STC单片机对浮点数的支持情况,以及如何高效地存储和显示电压值。 用户界面方面,“电压测量.uvgui.kidd”可能是用户界面的配置文件,用于显示测量结果。在嵌入式系统中,用
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值