mahout中的org.apache.mahout.clas…

最新推荐文章于 2024-08-07 10:01:51 发布

原创最新推荐文章于 2024-08-07 10:01:51 发布 · 814 阅读

0 ·

CC 4.0 BY-SA版权

mahout 专栏收录该内容

18 篇文章

订阅专栏

本文介绍了Apache Mahout中SGD逻辑回归的核心组件与实现细节，包括梯度注入接口、正则化先验函数及记录工厂等接口作用，以及抽象在线逻辑回归分类器的特性与方法，如特征向量概率计算、链式配置选项、二项式逆链接函数等。此外，还探讨了Adaptive Logistic Regression和CrossFoldLearner的应用场景。

Package org.apache.mahout.classifier.sgd
一，接口概要
1，Interface Gradient
Provides the ability to inject a gradient into the SGD logistic regresion. Typical uses of this are to use a ranking score such as AUC instead of a normal loss function.
2，Interface PriorFunction
A prior is used to regularize the learning algorithm. This allows a trade-off to be made between complexity of the model being learned and the accuracy with which the model fits the training data. There are different definitions of complexity which can be approximated using different priors. For large sparse systems, such as text classification, the L1 prior is often used which favors sparse models.
3，Interface RecordFactory
A record factor understands how to convert a line of data into fields and then into a vector.
二，类概要
1 public abstract class AbstractOnlineLogisticRegression extends AbstractVectorClassifier implements OnlineLearner
通用的逻辑回归分类器定义，返回特征向量的概率，分类器使用1到n-1中的编码，第0个列没有被存储。
提供了基于SGD的算法来学习逻辑回归分类器，但是省略了所有的学习率。 Any extension of this abstract class must define the overall and per-term annealing for themselves.
All Implemented Interfaces:
OnlineLearner
Direct Known Subclasses:
OnlineLogisticRegression
1.1 lambda
public AbstractOnlineLogisticRegression lambda(double lambda)Chainable configuration option.

Parameters:
lambda - New value of lambda, the weighting factor for the prior distribution.
Returns:
This, so other configurations can be chained.
1.2 link
public Vector link(Vector v)Computes the inverse link function, by default the logistic link function.

Parameters:
v - The output of the linear combination in a GLM. Note that the value of v is disturbed.
Returns:
A version of v with the link function applied.
1.3 link
public double link(double r)Computes the binomial(二项式) logistic inverse link function.

Parameters:
r - The value to transform.
Returns:
The logit of r.
1.4 classifyNoLink
public Vector classifyNoLink(Vector instance)Description copied from class: AbstractVectorClassifier
Classify a vector, but don't apply the inverse link function. For logistic regression and other generalized linear models, this is just the linear part of the classification.

Overrides:
classifyNoLink in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of scores. If transformed by the link function, these will become probabilities.
一个向量的得分，如果使用link function函数，这些会变为概率
1.5 classifyScalarNoLink
public double classifyScalarNoLink(Vector instance)
1.6 classify
public Vector classify(Vector instance)
Returns n-1 probabilities, one for each category but the 0-th. The probability of the 0-th category is 1 - sum(this result).

Specified by:
classify in class AbstractVectorClassifier
Parameters:
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each of the first n-1 categories.
1.7 classifyScalar
public double classifyScalar(Vector instance)
Returns a single scalar probability in the case where we have two categories. Using this method avoids an extra vector allocation as opposed to calling classify() or an extra two vector allocations relative to classifyFull().
对于两类情况下，返回单以标量概率
Specified by:
classifyScalar in class AbstractVectorClassifier
Parameters:
instance - The vector of features to be classified.
Returns:
The probability of the first of two categories.
Throws:
java.lang.IllegalArgumentException - If the classifier doesn't have two categories.
See Also:
AbstractVectorClassifier.classify(Vector)
1.8 train
public void train(long trackingKey,
                  java.lang.String groupKey,
                  int actual,
                  Vector instance)Description copied from interface: OnlineLearner
更新模型，使用一个目标变量值和一个特征向量
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
1.9 train
public void train(long trackingKey,
int actual,
Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.
Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
1.10 train
train
public void train(int actual,
Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

2 AdaptiveLogisticRegression
维护一个普通的OnlineLogisticRegression学习器池，池中的每一个元素都有不同的学习率。
一个主意是学习器池实际维护一个CrossFoldLearners（包含数个OnlineLogisticRegression对象）。
这些池允许我们进行性能估计如果对数据做很多次时。如果有好的参数，你或许更喜欢运行一个有这些设置的CrossFoldLearne。
在这里合适的实用是AUC，AUC的实用意味着OnlineLogisticRegression最合适二目标变量的分类问题。
可以通过扩展OnlineAuc来处理非二分类案例。
构造方法：
public AdaptiveLogisticRegression(int numCategories,
                                  int numFeatures,
                                  PriorFunction prior)
方法概要
2.1 train
public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
更新模型，使用目标变量和一个特征向量
There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
2.2 train
public void train(long trackingKey,
int actual,
Vector instance)Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
2.3 public void train(long trackingKey,
                  java.lang.String groupKey,
                  int actual,
                  Vector instance)Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.
There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Returns:
The AUC of the best member of the population or NaN if we can't figure that out.

3 Class CrossFoldLearner
public class CrossFoldLearner extends AbstractVectorClassifier implements OnlineLearner
Does cross-fold validation of log-likelihood and AUC on several online logistic regression models. Each record is passed to all but one of the models for training and to the remaining model for evaluation. In order to maintain proper segregation between the different folds across training data iterations, data should either be passed to this learner in the same order each time the training data is traversed or a tracking key such as the file offset of the training record should be passed with each training example.