Package org.apache.mahout.classifier.sgd
一,接口概要
1,Interface Gradient
Provides the ability to inject a gradient into the SGD logistic
regresion. Typical uses of this are to use a ranking score such as
AUC instead of a normal loss function.
2,Interface PriorFunction
A prior is used to regularize the learning algorithm. This allows a
trade-off to be made between complexity of the model being learned
and the accuracy with which the model fits the training data. There
are different definitions of complexity which can be approximated
using different priors. For large sparse systems, such as text
classification, the L1 prior is often used which favors sparse
models.
3,Interface RecordFactory
A record factor understands how to convert a line of data into
fields and then into a vector.
二,类概要
1 public abstract class AbstractOnlineLogisticRe
通用的逻辑回归分类器定义,返回特征向量的概率,分类器使用1到n-1中的编码,第0个列没有被存储。
提供了基于SGD的算法来学习逻辑回归分类器,但是省略了所有的学习率。 Any extension of this abstract
class must define the overall and per-term annealing for
themselves.
All Implemented Interfaces:
OnlineLearner
Direct Known Subclasses:
OnlineLogisticRegression
1.1 lambda
public AbstractOnlineLogisticRe
Parameters:
lambda - New value of lambda, the weighting factor for the prior
distribution.
Returns:
This, so other configurations can be chained.
1.2 link
public Vector link(Vector v)Computes the inverse link function, by
default the logistic link function.
Parameters:
v - The output of the linear combination in a GLM. Note that the
value of v is disturbed.
Returns:
A version of v with the link function applied.
1.3 link
public double link(double r)Computes the binomial(二项式) logistic
inverse link function.
Parameters:
r - The value to transform.
Returns:
The logit of r.
1.4 classifyNoLink
public Vector classifyNoLink(Vector instance)Description copied
from class: AbstractVectorClassifier
Classify a vector, but don't apply the inverse link function. For
logistic regression and other generalized linear models, this is
just the linear part of the classification.
Overrides:
classifyNoLink in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of scores. If transformed by the link function, these will
become probabilities.
一个向量的得分,如果使用link function函数,这些会变为概率
1.5 classifyScalarNoLink
public double classifyScalarNoLink(Vector instance)
1.6 classify
public Vector classify(Vector instance)
Returns n-1 probabilities, one for each category but the 0-th. The
probability of the 0-th category is 1 - sum(this result).
Specified by:
classify in class AbstractVectorClassifier
Parameters:
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each of the first n-1
categories.
1.7 classifyScalar
public double classifyScalar(Vector instance)
Returns a single scalar probability in the case where we have two
categories. Using this method avoids an extra vector allocation as
opposed to calling classify() or an extra two vector allocations
relative to classifyFull().
对于两类情况下,返回单以标量概率
Specified by:
classifyScalar in class AbstractVectorClassifier
Parameters:
instance - The vector of features to be classified.
Returns:
The probability of the first of two categories.
Throws:
java.lang.IllegalArgumentException
See Also:
AbstractVectorClassifier
1.8 train
public void train(long trackingKey,
更新模型,使用一个目标变量值和一个特征向量
Updates the model using a particular target variable value and a
feature vector.
There may an assumption that if multiple passes through the
training data are necessary that the tracking key for a record will
be the same for each pass and that there will be a relatively large
number of distinct tracking keys and that the low-order bits of the
tracking keys will not correlate with any of the input variables.
This tracking key is used to assign training examples to different
test/training splits.
Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in
the computation of the update to the model.
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
1.9 train
public void train(long trackingKey,
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a
feature vector.
There may an assumption that if multiple passes through the
training data are necessary that the tracking key for a record will
be the same for each pass and that there will be a relatively large
number of distinct tracking keys and that the low-order bits of the
tracking keys will not correlate with any of the input variables.
This tracking key is used to assign training examples to different
test/training splits.
Examples of useful tracking keys include id-numbers for the
training records derived from a database id for the base table from
the which the record is derived, or the offset of the original data
record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
1.10 train
train
public void train(int actual,
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a
feature vector.
There may an assumption that if multiple passes through the
training data are necessary, then the training examples will be
presented in the same order. This is because the order of training
examples may be used to assign records to different data splits for
evaluation by cross-validation. Without the order invariance,
records might be assigned to training and test splits and error
estimates could be seriously affected.
If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.
Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
2 AdaptiveLogisticRegressi
维护一个普通的OnlineLogisticRegression
一个主意是学习器池实际维护一个CrossFoldLearners(包含数个OnlineLogisticRegression
这些池允许我们进行性能估计如果对数据做很多次时。如果有好的参数,你或许更喜欢运行一个有这些设置的CrossFoldLearne。
在这里合适的实用是AUC,AUC的实用意味着OnlineLogisticRegression
可以通过扩展OnlineAuc来处理非二分类案例。
构造方法:
public AdaptiveLogisticRegressi
方法概要
2.1 train
public void train(int actual,
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a
feature vector.
更新模型,使用目标变量和一个特征向量
There may an assumption that if multiple passes through the
training data are necessary, then the training examples will be
presented in the same order. This is because the order of training
examples may be used to assign records to different data splits for
evaluation by cross-validation. Without the order invariance,
records might be assigned to training and test splits and error
estimates could be seriously affected.
If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.
Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
2.2 train
public void train(long trackingKey,
Updates the model using a particular target variable value and a
feature vector.
There may an assumption that if multiple passes through the
training data are necessary that the tracking key for a record will
be the same for each pass and that there will be a relatively large
number of distinct tracking keys and that the low-order bits of the
tracking keys will not correlate with any of the input variables.
This tracking key is used to assign training examples to different
test/training splits.
Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
2.3 public void train(long trackingKey,
Updates the model using a particular target variable value and a
feature vector.
There may an assumption that if multiple passes through the
training data are necessary that the tracking key for a record will
be the same for each pass and that there will be a relatively large
number of distinct tracking keys and that the low-order bits of the
tracking keys will not correlate with any of the input variables.
This tracking key is used to assign training examples to different
test/training splits.
Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in
the computation of the update to the model.
actual - The value of the target variable. This value should be in
the half-open interval [0..n) where n is the number of target
categories.
instance - The feature vector for this example.
2.4 auc
public double auc()
What is the AUC for the current best member of the population. If
no member is best, usually because we haven't done any training
yet, then the result is set to NaN.
Returns:
The AUC of the best member of the population or NaN if we can't
figure that out.
3 Class CrossFoldLearner
public class CrossFoldLearner extends AbstractVectorClassifier
Does cross-fold validation of log-likelihood and AUC on several
online logistic regression models. Each record is passed to all but
one of the models for training and to the remaining model for
evaluation. In order to maintain proper segregation between the
different folds across training data iterations, data should either
be passed to this learner in the same order each time the training
data is traversed or a tracking key such as the file offset of the
training record should be passed with each training example.
本文介绍了Apache Mahout中SGD逻辑回归的核心组件与实现细节,包括梯度注入接口、正则化先验函数及记录工厂等接口作用,以及抽象在线逻辑回归分类器的特性与方法,如特征向量概率计算、链式配置选项、二项式逆链接函数等。此外,还探讨了Adaptive Logistic Regression和CrossFoldLearner的应用场景。
2320

被折叠的 条评论
为什么被折叠?



