原文:http://metamerist.com/cbrt/cbrt.htm
In Search of a Fast Cube Root
Following are thoughts and observations regarding the machine extraction ofcube roots with a particular emphasis on speed of calculations.
Cube roots are a topic of interest, because they're used in many commonformulae and a fast cube root function isn't included with Microsoft VisualStudio.
In the absence of a special cube root function, a typical strategy iscalculation via a power function (e.g., pow(x, 1.0/3.0)).This can be problematic in terms of speed and in terms of accuracywhen negative numbers aren't handled properly.
All of the implementations of cube root functions I discovered rely on Newton'smethod. Notable implementations include William. M.Kahan's cube root function as found in the BSD sources (kahan'scbrt) and Ken Turkowski's as presented in an Apple technical report (CubeRoot).Kahan's implementation is particularly robust, handling negative numbers, NaN,INF and subnormals. Turkowski's paper includes interesting facts and insights.
Cube Root by Newton's method
Given anapproximation ‘a’ of the cube root of ‘R’,the following formula will produce ‘b’, a closerapproximation to the cube root of ‘R’.
or
Repeated application yields results progressively closer to the true cube rootof R, with the number of bits of precision doubling with each iteration.Consequently, the accuracy of the initial guess fed into the iteration isespecially important. Desired degrees of precision can be achieved with feweriterations when initial estimates are more accurate. Computationally, each iterationis relatively inexpensive: one divide, two multiplies and two adds.
Turkowski's approach factors the cube root of R into the product of a power oftwo and a cube root of a small value with a limited range. He then uses aquadratic approximation to estimate the cube root of the small value. Thisapproach yields an initial value with approximately 6 bits of precision to seedthe Newton-Raphson iteration.
Kahan uses a bit hack to generate an initial estimate with 5 bits of precision.The approach exploits the properties of IEEE 754 floating point numbers byleveraging the fact that their binary representation is close to a log2reppresentation. The following code demonstrates the technique applied for thedouble precision case. (Note: Kahan's code also includes support forhandling subnormal numbers. The following code assumes 32-bit integers).
inline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
General Nth Root Approximations
I've generalized this approach to provide initial approximations for nth rootsgenerally for IEEE 754 floats and doubles in the following:
template<int n>
inline float nth_rootf(floatx)
{
const int ebits = 8;
const int fbits = 23;
int& i = (int&) x;
const int bias = (1 << (ebits-1))-1;
i = (i - (bias << fbits)) / n + (bias << fbits);
return x;
}
template <int n>
inline double nth_rootd(doublex)
{
const int ebits = 11;
const int fbits = 52;
_int64& i = (_int64&) x;
const _int64 bias = (1 << (ebits-1))-1;
i = (i - (bias << fbits)) / n + (bias << fbits);
return x;
}
To use the code to generate an nth root approximation, n is specified as a C++template parameter. For example, a float approximation for the cube root of 5.0can be can be generated with the call nth_rootf<3>(5.0f). In my experience, thesefunctions provide estimates in the range of 5 to 6 bits of precision.
Halley's Method applied to CubeRoots
Recent investigations into faster converging cube root approximations led to amethod offered by Otis E. Lancaster in Machine Method for the Extraction ofa Cube Root1. Lancaster’s method is calculated relativelyefficiently, and it exhibits cubic convergence in contrast to the quadraticconvergence of Newton's method. It takes the following form (where ‘b’ is a closer approximation tothe cube root of R than ‘a’).
The method is actually an application of a method devised by comet discovererEdmond Halley (see Halley'sMethod). Over the years it hasbeen repeatedly rediscovered and attributed to others including Lambert andBailey. With slight massaging, the equation can be reduced to the following:
In this form, the calculation cost amounts to one divide, three multiplies andthree additions. Although this exceeds Newton-Raphson by a multiply and an add,speed improvements can be reaped thanks to the faster rate at whichHalley’s method converges and the relatively high cost of divide, whichis a factor in each iteration of either method.
Comparisonof Various Methods
Followingare times and accuracy estimates of various methods including the bit hackapproximation (cbrt_5f, cbrt_5d), up to three iterations of Halley’smethod and up to four iterations of Newton’s methods. These results wereobtained on an Intel Core Duo system using C++ compiled with Microsoft VisualStudio 2005.
Inthe tests, one million cube roots are calculated uniformly across the rangefrom 0.0 to 1.0. Times are reported in milliseconds. The second columnindicates an estimate of the minimum number of bits of precision in the worstcase found. The last column indicates an estimate of the average number of bitsof precision per evaluation.
Thecbrt_5f and cbrt_5d functions were used to seed the iterative methods.It’s important to note that a better approximation will converge evenmore quickly. Given the strategy used, a single iteration of Halley’smethod appears sufficient for 16 bits of precision. Also, note the costs ofvarious methods in comparison to the standard pow( ) function.
32-bit float tests
----------------------------------------
cbrt_5f 8.8 ms 5 mbp 6.223 abp
pow 144.5 ms 23 mbp 23.000 abp
halley x 1 31.8ms 15 mbp 18.961 abp
halley x 2 59.0ms 23 mbp 23.000 abp
newton x 1 23.4ms 10 mbp 12.525 abp
newton x 2 48.9ms 20 mbp 22.764 abp
newton x 3 72.0ms 23 mbp 23.000 abp
newton x 4 89.6ms 23 mbp 23.000 abp
64-bit double tests
----------------------------------------
cbrt_5d 23.5 ms 5 mbp 6.299 abp
pow 151.3 ms 52 mbp 52.000 abp
halley x 1 56.5ms 16 mbp 19.503 abp
halley x 2 95.3ms 47 mbp 51.575 abp
halley x 3 122.1 ms 51 mbp 51.965 abp
newton x 1 48.8ms 10 mbp 12.596 abp
newton x 2 78.6ms 20 mbp 25.205 abp
newton x 3 104.5 ms 40 mbp 47.681 abp
newton x 4 134.7 ms 51 mbp 51.992 abp
Thesource code is available here.
1. Otis E. Lancaster, Machine Method for the Extraction of Cube Root Journal ofthe American Statistical Association, Vol. 37, No. 217. (Mar., 1942), pp.112-115.
// CubeRoot.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <math.h>
// get accurate timer (Win32)
double GetTimer()
{
LARGE_INTEGER F, N;
QueryPerformanceFrequency(&F);
QueryPerformanceCounter(&N);
return double(N.QuadPart)/double(F.QuadPart);
}
typedef float (*cuberootfnf) (float);
typedef double (*cuberootfnd) (double);
// estimate bits of precision (32-bit float case)
inline int bits_of_precision(float a, float b)
{
const double kd = 1.0 / log(2.0);
if (a==b)
return 23;
const double kdmin = pow(2.0, -23.0);
double d = fabs(a-b);
if (d < kdmin)
return 23;
return int(-log(d)*kd);
}
// estiamte bits of precision (64-bit double case)
inline int bits_of_precision(double a, double b)
{
const double kd = 1.0 / log(2.0);
if (a==b)
return 52;
const double kdmin = pow(2.0, -52.0);
double d = fabs(a-b);
if (d < kdmin)
return 52;
return int(-log(d)*kd);
}
// cube root via x^(1/3)
float pow_cbrtf(float x)
{
return pow(x, 1.0f/3.0f);
}
// cube root via x^(1/3)
double pow_cbrtd(double x)
{
return pow(x, 1.0/3.0);
}
// cube root approximation using bit hack for 32-bit float
__forceinline float cbrt_5f(float f)
{
unsigned int* p = (unsigned int *) &f;
*p = *p/3 + 709921077;
return f;
}
// cube root approximation using bit hack for 64-bit float
// adapted from Kahan's cbrt
__forceinline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
// cube root approximation using bit hack for 64-bit float
// adapted from Kahan's cbrt
__forceinline double quint_5d(double d)
{
return sqrt(sqrt(d));
const unsigned int B1 = 71509416*5/3;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/5+B1;
return t;
}
// iterative cube root approximation using Halley's method (float)
__forceinline float cbrta_halleyf(const float a, const float R)
{
const float a3 = a*a*a;
const float b= a * (a3 + R + R) / (a3 + a3 + R);
return b;
}
// iterative cube root approximation using Halley's method (double)
__forceinline double cbrta_halleyd(const double a, const double R)
{
const double a3 = a*a*a;
const double b= a * (a3 + R + R) / (a3 + a3 + R);
return b;
}
// iterative cube root approximation using Newton's method (float)
__forceinline float cbrta_newtonf(const float a, const float x)
{
// return (1.0 / 3.0) * ((a + a) + x / (a * a));
return a - (1.0f / 3.0f) * (a - x / (a*a));
}
// iterative cube root approximation using Newton's method (double)
__forceinline double cbrta_newtond(const double a, const double x)
{
return (1.0/3.0) * (x / (a*a) + 2*a);
}
// cube root approximation using 1 iteration of Halley's method (double)
double halley_cbrt1d(double d)
{
double a = cbrt_5d(d);
return cbrta_halleyd(a, d);
}
// cube root approximation using 1 iteration of Halley's method (float)
float halley_cbrt1f(float d)
{
float a = cbrt_5f(d);
return cbrta_halleyf(a, d);
}
// cube root approximation using 2 iterations of Halley's method (double)
double halley_cbrt2d(double d)
{
double a = cbrt_5d(d);
a = cbrta_halleyd(a, d);
return cbrta_halleyd(a, d);
}
// cube root approximation using 3 iterations of Halley's method (double)
double halley_cbrt3d(double d)
{
double a = cbrt_5d(d);
a = cbrta_halleyd(a, d);
a = cbrta_halleyd(a, d);
return cbrta_halleyd(a, d);
}
// cube root approximation using 2 iterations of Halley's method (float)
float halley_cbrt2f(float d)
{
float a = cbrt_5f(d);
a = cbrta_halleyf(a, d);
return cbrta_halleyf(a, d);
}
// cube root approximation using 1 iteration of Newton's method (double)
double newton_cbrt1d(double d)
{
double a = cbrt_5d(d);
return cbrta_newtond(a, d);
}
// cube root approximation using 2 iterations of Newton's method (double)
double newton_cbrt2d(double d)
{
double a = cbrt_5d(d);
a = cbrta_newtond(a, d);
return cbrta_newtond(a, d);
}
// cube root approximation using 3 iterations of Newton's method (double)
double newton_cbrt3d(double d)
{
double a = cbrt_5d(d);
a = cbrta_newtond(a, d);
a = cbrta_newtond(a, d);
return cbrta_newtond(a, d);
}
// cube root approximation using 4 iterations of Newton's method (double)
double newton_cbrt4d(double d)
{
double a = cbrt_5d(d);
a = cbrta_newtond(a, d);
a = cbrta_newtond(a, d);
a = cbrta_newtond(a, d);
return cbrta_newtond(a, d);
}
// cube root approximation using 2 iterations of Newton's method (float)
float newton_cbrt1f(float d)
{
float a = cbrt_5f(d);
return cbrta_newtonf(a, d);
}
// cube root approximation using 2 iterations of Newton's method (float)
float newton_cbrt2f(float d)
{
float a = cbrt_5f(d);
a = cbrta_newtonf(a, d);
return cbrta_newtonf(a, d);
}
// cube root approximation using 3 iterations of Newton's method (float)
float newton_cbrt3f(float d)
{
float a = cbrt_5f(d);
a = cbrta_newtonf(a, d);
a = cbrta_newtonf(a, d);
return cbrta_newtonf(a, d);
}
// cube root approximation using 4 iterations of Newton's method (float)
float newton_cbrt4f(float d)
{
float a = cbrt_5f(d);
a = cbrta_newtonf(a, d);
a = cbrta_newtonf(a, d);
a = cbrta_newtonf(a, d);
return cbrta_newtonf(a, d);
}
double TestCubeRootf(const char* szName, cuberootfnf cbrt, double rA, double rB, int rN)
{
const int N = rN;
float dd = float((rB-rA) / N);
// calculate 1M numbers
int i=0;
float d = (float) rA;
double t = GetTimer();
double s = 0.0;
for(d=(float) rA, i=0; i<N; i++, d += dd)
{
s += cbrt(d);
}
t = GetTimer() - t;
printf("%-10s %5.1f ms ", szName, t*1000.0);
double maxre = 0.0;
double bits = 0.0;
double worstx=0.0;
double worsty=0.0;
int minbits=64;
for(d=(float) rA, i=0; i<N; i++, d += dd)
{
float a = cbrt((float) d);
float b = (float) pow((double) d, 1.0/3.0);
int bc = bits_of_precision(a, b);
bits += bc;
if (b > 1.0e-6)
{
if (bc < minbits)
{
minbits = bc;
worstx = d;
worsty = a;
}
}
}
bits /= N;
printf(" %3d mbp %6.3f abp\n", minbits, bits);
return s;
}
double TestCubeRootd(const char* szName, cuberootfnd cbrt, double rA, double rB, int rN)
{
const int N = rN;
double dd = (rB-rA) / N;
int i=0;
double t = GetTimer();
double s = 0.0;
double d = 0.0;
for(d=rA, i=0; i<N; i++, d += dd)
{
s += cbrt(d);
}
t = GetTimer() - t;
printf("%-10s %5.1f ms ", szName, t*1000.0);
double bits = 0.0;
double maxre = 0.0;
double worstx = 0.0;
double worsty = 0.0;
int minbits = 64;
for(d=rA, i=0; i<N; i++, d += dd)
{
double a = cbrt(d);
double b = pow(d, 1.0/3.0);
int bc = bits_of_precision(a, b); // min(53, count_matching_bitsd(a, b) - 12);
bits += bc;
if (b > 1.0e-6)
{
if (bc < minbits)
{
bits_of_precision(a, b);
minbits = bc;
worstx = d;
worsty = a;
}
}
}
bits /= N;
printf(" %3d mbp %6.3f abp\n", minbits, bits);
return s;
}
int _tmain(int argc, _TCHAR* argv[])
{
// a million uniform steps through the range from 0.0 to 1.0
// (doing uniform steps in the log scale would be better)
double a = 0.0;
double b = 1.0;
int n = 1000000;
printf("32-bit float tests\n");
printf("----------------------------------------\n");
TestCubeRootf("cbrt_5f", cbrt_5f, a, b, n);
TestCubeRootf("pow", pow_cbrtf, a, b, n);
TestCubeRootf("halley x 1", halley_cbrt1f, a, b, n);
TestCubeRootf("halley x 2", halley_cbrt2f, a, b, n);
TestCubeRootf("newton x 1", newton_cbrt1f, a, b, n);
TestCubeRootf("newton x 2", newton_cbrt2f, a, b, n);
TestCubeRootf("newton x 3", newton_cbrt3f, a, b, n);
TestCubeRootf("newton x 4", newton_cbrt4f, a, b, n);
printf("\n\n");
printf("64-bit double tests\n");
printf("----------------------------------------\n");
TestCubeRootd("cbrt_5d", cbrt_5d, a, b, n);
TestCubeRootd("pow", pow_cbrtd, a, b, n);
TestCubeRootd("halley x 1", halley_cbrt1d, a, b, n);
TestCubeRootd("halley x 2", halley_cbrt2d, a, b, n);
TestCubeRootd("halley x 3", halley_cbrt3d, a, b, n);
TestCubeRootd("newton x 1", newton_cbrt1d, a, b, n);
TestCubeRootd("newton x 2", newton_cbrt2d, a, b, n);
TestCubeRootd("newton x 3", newton_cbrt3d, a, b, n);
TestCubeRootd("newton x 4", newton_cbrt4d, a, b, n);
printf("\n\n");
getchar();
return 0;
}