Datamining Concepts-优快云博客

本文概述了信息技术领域的多个关键概念和术语，包括用例、词法分析、停止词、广泛查询与精确匹配、双盲实验、TF-IDF、随机游走、蒙特卡洛方法、启发式方法、召回与精度等，提供了对这些核心概念的理解和应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Use case: a use case is a list of steps, typically defining interactions between a role (known in UML as an "actor") and a system, to achieve a goal. The actor can be a human or an external system.
e.g.
Stemming: In linguisticmorphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.
Stop words: In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).
Broad query and exact match: in results of broad query, keywords can appear in any order, exact match otherwise.
double-blind experiment: an experimental procedure in which neither the subjects of the experiment nor the persons administering the experiment know the critical aspects of the experiment; "a double-blind procedure is used to guard against both experimenter bias and placebo effects"
tf-idf: term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. the number of times a term occurs in a document is called its term frequency. Tf–idf is the product of two statistics, term frequency and inverse document frequency. Then tf–idf is calculated as:

$\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t, D)$ ; idf(t,D)=log(|D|/(t出现的document数+1))

where ft(t,d) can be raw frequency of a term
Random walk: A random walk is a mathematical formalization of a path that consists of a succession of random steps.
Monte Carlo methods: 也称统计模拟方法,对应确定性方法(deterministic).
Heuristic: refers to experience-based techniques for problem solving, learning, and discovery. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, or common sense. The most fundamental heuristic is trial and error
recall and precision: precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. eg: Suppose a program for recognizing dogs in scenes identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9:
recall is 0.6; precision is 0.75
1 kilogram equals 2.2 pounds.
CMS: Content Management System.
SERP: Search Engine Results Page, is the actual result returned by a search engine in response to a keyword query.
Web Graph:created by all World Wide Web pages as nodes and hyperlinks as edges
Ad-hoc & A priori:
- Ad-hoc: It generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to adapted to other purposes
- A priori: A priori knoledge or justification is independent of experience
Heuristic: refers to experience-based techniques for problem solving, learning, and discovery that give a solution which is not guaranteed to be optimal. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution via mental shortcuts to ease the cognitive load of making a decision. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, stereotyping, or common sense.
True positive, etc:

Positive = identified and negative = rejected.

Therefore:
True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected

A power law relationship between two quantities x and y can be written as : y = ax^k. (a,k are constants)
A priori knowledge: "from the earlier", a posteriori is "from later"
s.t.: meaning, "such that", "subject to"
v
v
v
v
v