【Estimation of the Number of Clusters】PG-means: learning the number of clusters in data in NIPS 个人理解

一、简介

题目: PG-means: learning the number of clusters in data
会议: NIPS 2006
任务: 估计无标签数据的类别数量 k k k并聚类。
Idea: 最初给定一个较小的 k k k(最小为1)在原始无标签数据上拟合高斯混合模型(gaussian mixture model, GMM);将无标签数据和GMM的参数(均值、斜方差)映射为一维,然后基于映射后的GMM参数进行采样;之后使用Kolmogorov-Smirnov (KS) test检验映射后的数据和采样是否匹配,若匹配,则迭代终止,否则,令 k  ⁣ =  ⁣ k  ⁣ +  ⁣ 1 k\!=\!k\!+\!1 k=k+1更新GMM并陆续进行映射、采样、检验。
Code: GitHub
Note: 虽然该方法叫PG-means,但与X-meansG-means基于 k k k-means不同,它是基于GMM的。

PG-means
如图,为PG-means的算法流程,下节做详细介绍。

二、详情

1. 算法步骤

输入:无标签数据 { X } d × n \{\pmb X\}_{d\times n} { X}d×n n n n是样本数, d d d是样本维度),置信阈值 α \alpha α,映射次数 p p p
输出:预测的类别数量和聚类结果。
(1)初始化 k = 1 k=1 k=1
(2)在 X \pmb X X上根据 k k k拟合GMM,GMM中有 k k k个均值 { μ } d × 1 \{\pmb\mu\}_{d\times 1}

In time series modeling, particularly when dealing with models that involve lagged values and the calculation of the log-likelihood function, the length of the endogenous variable plays a crucial role in ensuring accurate and reliable model estimation. The requirements for the length of the endogenous variable are primarily influenced by two factors: the number of lags included in the model and the number of burn-in observations needed for the initialization of the likelihood function. When considering the impact of lags, it is essential to recognize that each lagged value of the endogenous variable reduces the effective sample size by one observation. For instance, if a model includes p lags of the endogenous variable, then the first p observations cannot be used for estimation because they lack the necessary lagged values. Therefore, the effective sample size for estimation becomes $N - p$, where $N$ is the total number of observations in the dataset [^1]. Regarding the burn-in period, also known as the initialization phase, this refers to the initial set of observations that are discarded to allow the model's parameters to stabilize before the actual estimation begins. This is particularly important in models that require the initialization of the error terms, such as ARIMA models. The number of burn-in observations can vary depending on the specific model and the method of estimation used. For example, in the context of ARIMA models, the number of burn-in observations might be equal to the number of differencing operations applied to make the series stationary [^1]. The combination of lags and burn-in observations significantly affects the minimum required length of the endogenous variable. To ensure that there are enough observations left for estimation after accounting for both lags and burn-in, the total number of observations $N$ must satisfy the condition $N > p + b$, where $b$ represents the number of burn-in observations. This ensures that there is at least one observation available for the estimation process [^1]. For example, consider a scenario where a model includes 3 lags of the endogenous variable and requires 2 burn-in observations. The minimum number of observations required for the endogenous variable would be $3 + 2 + 1 = 6$. Here, the additional 1 accounts for the fact that at least one observation should be available for estimation after the lags and burn-in have been accounted for. Understanding these requirements is crucial for practitioners working with time series data, as failing to account for the reduction in the effective sample size due to lags and burn-in can lead to biased parameter estimates and incorrect standard errors, which in turn can affect the validity of hypothesis tests and confidence intervals . To illustrate the practical implications of these requirements, let's consider a simple AR(1) model. Suppose we have a dataset with $N = 100$ observations and we decide to include 1 lag of the endogenous variable. Additionally, we assume that 2 burn-in observations are required. In this case, the effective sample size for estimation would be $100 - 1 - 2 = 97$ observations. If the dataset had only $N = 5$ observations, including 1 lag would leave us with $5 - 1 - 2 = 2$ observations for estimation, which is likely insufficient for reliable parameter estimation. In summary, the length of the endogenous variable in a time series model must be sufficiently large to accommodate both the lags and the burn-in period required for the initialization of the likelihood function. Ensuring that the dataset meets these requirements is essential for obtaining valid and reliable estimates from the model . ### Example Code Snippet for Determining Minimum Required Length Here is a simple Python code snippet that calculates the minimum required length of the endogenous variable based on the number of lags and burn-in observations: ```python def minimum_required_length(lags, burn_in): """ Calculate the minimum required length of the endogenous variable. Parameters: lags (int): Number of lags included in the model. burn_in (int): Number of burn-in observations required for initialization. Returns: int: Minimum required length of the endogenous variable. """ return lags + burn_in + 1 # Example usage lags = 3 burn_in = 2 min_length = minimum_required_length(lags, burn_in) print(f"Minimum required length of the endogenous variable: {min_length}") ``` This code snippet defines a function `minimum_required_length` that takes the number of lags and the number of burn-in observations as inputs and returns the minimum required length of the endogenous variable. The example usage demonstrates how to calculate the minimum required length for a model with 3 lags and 2 burn-in observations, resulting in a minimum required length of 6 observations [^1].
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Fulin_Gao

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值