U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》_《c4.5: programs for machine learning》-优快云博客

本文链接：https://blog.youkuaiyun.com/appleyuchi/article/details/83834101

本文深入解析了UCF算法，一种用于计算子树误判计数的统计方法，特别是在决策树剪枝过程中的应用。介绍了算法的基本公式，包括如何确定系数Coeff，以及其在悲观错误率计算中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

when calculating
$U_{CF}$ (e,N)
CF: Confidence Level(here is 25%)
e:misclassifying counts of current subtree we focus on
N:counts of sub-datasets relevant to current subtree who is under judgment whether to be pruned or not.

“Pr” is short for “Pessimistic error rate”.

$U_{CF}$ (e,N)=Pr= $e+0.5+Coeff22+Coeff2⋅{(e+0.5）[1−e+0.5N]+Coeff24}N+Coeff2①\frac{e+0.5+ \frac{Coeff^2}{2}+ \sqrt{ Coeff^2·\{ (e+0.5）[1- \frac{e+0.5}{N} ]+ \frac{Coeff^2}{4} \} } } {N+Coeff^2}①$

The method to get $Coeff\ Coeff$ is as follows:

$Level−val[i−1]val[i]−val[i−1]②\frac{Coeff-Deviation[i-1]}{Deviation[i]-Deviation[i-1]}=\frac{Confidence \ Level-val[i-1]}{val[i]-val[i-1]}②$

Now Let’s analyse the meaning of ② and how to get “i” in formula ②.

What is Deviation and Val?
Val[] = { 0, 0.001, 0.005, 0.01, 0.05, 0.10, 0.20, 0.40, 1.00},
Deviation[] = {4.0, 3.09, 2.58, 2.33, 1.65, 1.28, 0.84, 0.25, 0.00};

for Normal Distribution：
P{X>Deviation[i]}=val[i],
for example:
P{X>2.58}=0.005
--------------------

Now let’s analyse the whole meaning of formula ②

the red line is $P(X>x)=∫x+∞12πe−x22dxP(X>x)=\int_x^{+∞}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}dx$

the blue line is $f(x)=12πe−x22f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$
and the two points are:
(deviation[i],val[i])=(0.25,0.4)
(deviation[i-1,val[i-1])=(0.84,0.2)

Let’s amplify the above figure to the following picture:
在这里插入图片描述
so tan C= $Level−Val[i−1]Deviation[i−1]−Coeff\frac{Val[i]-Val[i-1]}{Deviation[i-1]-Deviation[i]}=\frac{Confidence\ Level-Val[i-1]}{Deviation[i-1]-Coeff}$

then we can get ②and compute $C o e f f$ and $Coeff^2$

but how to get “i” in above formula?
when $\ level≤Val[i]$
is satisfied,
we get “i”.
when Confidence Level=0.25
i=7
val[i-1]=0.2
val[i]=0.4
deviation[i-1]=0.84
deviation[i]=0.25
substitute the above values in ②
we get Coeff=0.6925
then,
$Coeff^2=0.479556$
--------------------

**Now Let’s calculate **
$U_{0.25}(1,16)$
CF=0.25
e=1
N=16
$Coeff^2=0.479556$
We’ll get :
Pr=

$1+0.5+0.4795562+0.479556⋅{(1+0.5）[1−1+0.516]+0.4795564}16+0.479556\frac{1+0.5+ \frac{0.479556}{2}+ \sqrt{ 0.479556·\{ (1+0.5）[1- \frac{1+0.5}{16} ]+ \frac{0.479556}{4} \} } } {16+0.479556}$ =0.156681

N·Pr=16×0.156681=2.507
Compared with Quinlan’s book page 42th:

Now Let’s calculate
$U_{0.25}(1,168)$
CF=0.25
e=1
N=168

Pr= $1+0.5+0.4795562+0.479556⋅{(1+0.5）[1−1+0.5168]+0.4795564}168+0.479556=0.015536\frac{1+0.5+ \frac{0.479556}{2}+ \sqrt{ 0.479556·\{ (1+0.5）[1- \frac{1+0.5}{168} ]+ \frac{0.479556}{4} \} } } {168+0.479556}=0.015536$

$N·U_{0.25}(1,168)=N·Pr=168·0.015536=2.61$
Compared with Quinlan’s Book on Page 42th:
在这里插入图片描述

The code to veriy the above value:

/*************************************************************************/
/*									 */
/*  Statistical routines for C4.5					 */
/*  -----------------------------					 */
/*									 */
/*************************************************************************/
#include <stdio.h> 
#include <math.h>

									
/*************************************************************************/
/*									 */
/*  Compute the additional errors if the error rate increases to the	 */
/*  upper limit of the confidence level.  The coefficient is the	 */
/*  square of the number of standard deviations corresponding to the	 */
/*  selected confidence level.  (Taken from Documenta Geigy Scientific	 */
/*  Tables (Sixth Edition), p185 (with modifications).)			 */
/*									 */
/*************************************************************************/
float CF=0.25;

float Val[] = {  0,  0.001, 0.005, 0.01, 0.05, 0.10, 0.20, 0.40, 1.00},//概率
      Dev[] = {4.0,  3.09,  2.58,  2.33, 1.65, 1.28, 0.84, 0.25, 0.00};//这个就是分位点
      // P{X>Dev[i]}=val[i],
      // according to standardized normal distribution N(0,1)
      // for example:
      // P{X>2.58}=0.005,


float AddErrs(float N, float e)//叶子的悲观错误
{
    static float Coeff=0;
    float Val0, Pr;

    if ( ! Coeff )//compute only once when pruning trees
    {
	/*  Compute and retain the coefficient value, interpolating from
	    the values in Val and Dev  */

	int i;

	i = 0;
	while ( CF > Val[i] ) i++;//这里可以看到，是根据置信度对悲观错误数量进行计算的。
    //但是呢，也没有像理论中直接计算“悲观错误率”

	Coeff = Dev[i-1] +
		  (Dev[i] - Dev[i-1]) * (CF - Val[i-1]) /(Val[i] - Val[i-1]);
	Coeff = Coeff * Coeff;
    printf("Coeff=%f\n",Coeff);//CF定下来以后，Coeff就是定的
    }

    if ( e < 1E-6 )
    {
	return N * (1 - exp(log(CF) / N));
    }
    else
    if ( e < 0.9999 )
    {
	Val0 = N * (1 - exp(log(CF) / N));
	return Val0 + e * (AddErrs(N, 1.0) - Val0);//这里在进行递归调用
    }
    else
    if ( e + 0.5 >= N )
    {
	return 0.67 * (N - e);
    }
    else
    {
        float fenzi;
        fenzi=(
        e + 0.5 + Coeff/2+ sqrt
        ( 
            Coeff * 
                (
                (e + 0.5) * (1 - (e + 0.5)/N) + Coeff/4
                )
        ) 
            );

        printf("fenzi=%f\n",fenzi);
	Pr = (
        e + 0.5 + Coeff/2+ sqrt
        ( 
            Coeff * 
                (
                (e + 0.5) * (1 - (e + 0.5)/N) + Coeff/4
                )
        ) 
            )
             / (N + Coeff);
       printf("Pr=%f\n",Pr);
	return (N * Pr - e);
    }
}




int main()
{

float result=AddErrs(16, 1);
printf("result=%f\n",result);
    exit(0);


}

running method:

gcc -o stats stats.c -lm
./stats

the result is:
Coeff=0.479556
fenzi=2.582031
Pr=0.156681
result=1.506894