Means

from

Measures of Central Tendency: The means

J Pharmacol Pharmacother. 2011 Apr-Jun; 2(2): 140–142.

doi: 10.4103/0976-500X.81920

PMCID: PMC3127352

PMID: 21772786

Measures of central tendency: The mean

S. Manikandan

Author information Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

In any research, enormous data is collected and, to describe it meaningfully, one needs to summarise the same. The bulkiness of the data can be reduced by organising it into a frequency table or histogram.[1] Frequency distribution organises the heap of data into a few meaningful categories. Collected data can also be summarised as a single index/value, which represents the entire data. These measures may also help in the comparison of data.

Go to:

CENTRAL TENDENCY

Central tendency is defined as “the statistical measure that identifies a single value as representative of an entire distribution.”[2] It aims to provide an accurate description of the entire data. It is the single value that is most typical/representative of the collected data. The term “number crunching” is used to illustrate this aspect of data description. The mean, median and mode are the three commonly used measures of central tendency.

Go to:

MEAN

Mean is the most commonly used measure of central tendency. There are different types of mean, viz. arithmetic mean, weighted mean, geometric mean (GM) and harmonic mean (HM). If mentioned without an adjective (as mean), it generally refers to the arithmetic mean.

Arithmetic mean

Arithmetic mean (or, simply, “mean”) is nothing but the average. It is computed by adding all the values in the data set divided by the number of observations in it. If we have the raw data, mean is given by the formula

Where, ∑ (the uppercase Greek letter sigma), X refers to summation, refers to the individual value and n is the number of observations in the sample (sample size). The research articles published in journals do not provide raw data and, in such a situation, the readers can compute the mean by calculating it from the frequency distribution (if provided).

 Where, f is the frequency and X is the midpoint of the class interval and n is the number of observations.[3] The standard statistical notations (in relation to measures of central tendency) are mentioned in [Table 1]. Readers are cautioned that the mean calculated from the frequency distribution is not exactly the same as that calculated from the raw data. It approaches the mean calculated from the raw data as the number of intervals increase.[4]

Table 1

Standard statistical notations

Go to:

ADVANTAGES

The mean uses every value in the data and hence is a good representative of the data. The irony in this is that most of the times this value never appears in the raw data.

Repeated samples drawn from the same population tend to have similar means. The mean is therefore the measure of central tendency that best resists the fluctuation between different samples.[6]

It is closely related to standard deviation, the most common measure of dispersion.

Go to:

DISADVANTAGES

The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is small.[7] Therefore, it is not an appropriate measure of central tendency for skewed distribution.[8]

Mean cannot be calculated for nominal or nonnominal ordinal (==> not the same as ordered, see the example of stages of cancer) data. Even though mean can be calculated for numerical ordinal data, many times it does not give a meaningful value, e.g. stage of cancer.

Weighted mean

Weighted mean is calculated when certain values in a data set are more important than the others.[9] A weight wi is attached to each of the values xi to reflect this importance.

For example, When weighted mean is used to represent the average duration of stay by a patient in a hospital, the total number of cases presenting to each ward is taken as the weight.

Geometric Mean

It is defined as the arithmetic mean of the values taken on a log scale. It is also expressed as the nth root of the product of an observation.

GM is an appropriate measure when values change exponentially and in case of skewed distribution that can be made symmetrical by a log transformation. GM is more commonly used in microbiological and serological research. One important disadvantage of GM is that it cannot be used if any of the values are zero or negative.

Harmonic mean

It is the reciprocal of the arithmetic mean of the observations.

Alternatively, the reciprocal of HM is the mean of reciprocals of individual observations.

HM is appropriate in situations where the reciprocals of values are more useful. HM is used when we want to determine the average sample size of a number of groups, each of which has a different sample size.

==> see https://en.wikipedia.org/wiki/Harmonic_mean

for examples 

Go to:

DEGREE OF VARIATION BETWEEN THE MEANS

If all the values in a data set are the same, then all the three means (arithmetic mean, GM and HM) will be identical. As the variability in the data increases, the difference among these means also increases. Arithmetic mean is always greater than the GM, which in turn is always greater than the HM.[5]

The other measures of central tendency (median and mode) and the guidelines for selecting the appropriate measure of central tendency will be dealt with in the subsequent issue.

Go to:

Footnotes

Source of Support: Nil

Conflict of Interest: None declared

Go to:

REFERENCES

1. Manikandan S. Frequency distribution. J Phamacol Pharmacother. 2011;2:54–6. [PMC free article] [PubMed] [Google Scholar]

2. Gravetter FJ, Wallnau LB. Statistics for the behavioral sciences. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000. [Google Scholar]

3. Rao PS Sundar, Richard J. Introduction to biostatistics and research methods. 4th ed. New Delhi, India: Prentice Hall of India Pvt Ltd; 2006. [Google Scholar]

4. Sundaram KR, Dwivedi SN, Sreenivas V. Medical statistics principles and methods. 1st ed. New Delhi, India: BI Publications Pvt Ltd; 2010. [Google Scholar]

5. Norman GR, Streiner DL. Biostatistics the bare essentials. 2nd ed. Hamilton: BC Decker Inc; 2000. [Google Scholar]

6. Glaser AN. High Yield Biostatistics. 1st Ed. New Delhi, India: Lippincott Williams and Wilkins; 2000. [Google Scholar]

7. Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York: Mc-Graw Hill; 2004. [Google Scholar]

8. Swinscow TD, Campbell MJ. Statistics at square one. 10th ed. New Delhi, India: Viva Books Private Limited; 2003. [Google Scholar]

9. Petrie A, Sabin C. Medical statistics at a glance. 3rd ed. Oxford: Wiley-Blackwell; 2009. [Google Scholar]


Articles from Journal of Pharmacology & Pharmacotherapeutics are provided here courtesy of Wolters Kluwer -- Medknow Publications

08-13
K-means算法是一种无监督聚类方法,旨在将数据集划分为多个簇(clusters),使得同一簇内的数据点尽可能相似,而不同簇的数据点尽可能不同。该算法通过迭代优化目标函数,逐步调整簇中心和数据点的分配[^1]。 ### 算法步骤 1. **初始化簇中心**:从数据集中随机选择K个点作为初始簇中心,其中K是预设的簇数量[^2]。 2. **分配数据点**:对于每个数据点,计算其到所有簇中心的距离,并将其分配到最近的簇。 3. **更新簇中心**:重新计算每个簇的中心,通常使用该簇所有数据点的均值作为新的簇中心。 4. **重复步骤2和3**:直到簇中心不再显著变化,或者达到预设的迭代次数。 ### 算法特点 - **目标函数**:K-means算法优化的目标函数是所有数据点与其所属簇中心之间的平方距离之和。这个目标函数有助于避免局部最优解,并提供了一种评估聚类质量的标准[^1]。 - **收敛性**:K-means算法保证在有限次迭代后收敛,但可能收敛到局部最优解而非全局最优解。因此,通常建议多次运行算法并选择最优结果。 - **适用场景**:K-means适用于数据点呈球形分布且簇形状较为规则的情况。对于非球形或复杂结构的数据,可能需要使用其他聚类算法。 ### 实现示例 以下是一个使用Python实现K-means算法的简单示例: ```python import numpy as np def kmeans(X, K, max_iters=100): # 随机初始化簇中心 centroids = X[np.random.choice(X.shape[0], K, replace=False)] for _ in range(max_iters): # 分配数据点到最近的簇 distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2) labels = np.argmin(distances, axis=1) # 更新簇中心 new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)]) # 检查是否收敛 if np.allclose(centroids, new_centroids): break centroids = new_centroids return labels, centroids # 示例数据 X = np.random.rand(100, 2) # 运行K-means labels, centroids = kmeans(X, K=3) ``` 在上述代码中,`X`是输入数据矩阵,`K`是簇的数量,`max_iters`是最大迭代次数。函数返回每个数据点的标签以及最终的簇中心。 ### 注意事项 - **初始簇中心的影响**:由于K-means对初始簇中心敏感,建议使用多种初始化方法(如K-means++)来提高聚类质量。 - **选择K值**:选择合适的K值是K-means算法的关键问题之一。常用的方法包括肘部法则(Elbow Method)和轮廓系数(Silhouette Score)。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值