Maximum/minimum representable integers.
The maximum representable integer is the largest integer
i
for which i+1>i
holds true.
Using the while
loop determine your maximum
integer and compare it with "int.MaxValue
".
Something like
int i=1; while(i+1>i) {i++;} Write("my max int = {0}\n",i);It can take some seconds to calculate.
The minimum representable integer is the most negative
integer i
for which i-1<i
holds
true.
Using the while
loop determine your minimum
integer and compare with int.MinValue
.
while
loop calculate the
machine epsilon for the types float
and double
.
Something like
double x=1; while(1+x!=1){x/=2;} x*=2; float y=1F; while((float)(1F+y) != 1F){y/=2F;} y*=2F;There seem to be no predefined values for this numbers in csharp (I couldn't find it in any case). However, in a IEEE 64-bit floating-point number (double), where 1bit is reserved for the sign and 11bits for exponent, there are 52bits remaining for the fraction, therefore the double machine epsilon must be about
System.Math.Pow(2,-52)
.
For single precision (float
) the machine epsilon should be about
System.Math.Pow(2,-23)
.
Check this.
Suppose tiny=epsilon/2
. Calculate the two sums,
sumA=1+tiny+tiny+...+tiny; sumB=tiny+tiny+...+tiny+1;which should seemingly be the same and print out the values
sumA-1
and sumB-1
. Someting like
int n=(int)1e6; double epsilon=Pow(2,-52); double tiny=epsilon/2; double sumA=0,sumB=0; sumA+=1; for(int i=0;i<n;i++){sumA+=tiny;} for(int i=0;i<n;i++){sumB+=tiny;} sumB+=1; WriteLine($"sumA-1 = {sumA-1:e} should be {n*tiny:e}"); WriteLine($"sumB-1 = {sumB-1:e} should be {n*tiny:e}");Explain why there is a difference.
double d1 = 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1; double d2 = 8*0.1;both doubles "d1" and "d2" should be equal 0.8 and then the "==" operator should produce the "true" result. However, try
WriteLine($"d1={d1:e15}"); WriteLine($"d2={d2:e15}"); WriteLine($"d1==d2 ? => {d1==d2}");and see that this is not the case (not in my box in any case). That is because the decimal number 0.1 cannot be represented exactly as a 52-digit binary number.
For this reason, one needs a more complex comparison algorithm. Two doubles in a finite digit representation can only be compared with the given absolute and/or relative precision (where the values for the precision actually depend on the task at hand and generally must be supplied by the user).
Therefore, implement a function with the signature
bool approx(double a, double b, double acc=1e-9, double eps=1e-9)that returns "
true
" if the numbers 'a' and 'b' are equal
either with absolute precision "acc",
|a-b| < accor with relative precision "epsilon",
|a-b|/Max(|a|,|b|) < epsand returns "
false
" otherwise.
Something like
public static bool approx (double a, double b, double acc=1e-9, double eps=1e-9){ if(Abs(b-a) < acc) return true; else if(Abs(b-a) < Max(Abs(a),Abs(b))*eps) return true; else return false; }