Question

是否存在一种安全的方法来可靠地确定整数类型T是否可以存储浮点整数值f（因此f == floor(f)）而不会发生溢出？

请记住，不能保证浮点类型F与IEC 559（IEEE 754）兼容，并且有符号整数溢出在C ++中是未定义的行为。我对一种解决方案感兴趣，该解决方案根据当前的C ++（在撰写本文时为C ++ 17）标准是正确的，并且可以避免未定义的行为。

下面的幼稚方法并不可靠，因为由于浮点舍入，不能保证类型F可以表示std::numeric_limits<I>::max()。

#include <cmath>
#include <limits>
#include <type_traits>

template <typename I, typename F>
bool is_safe_conversion(F x)
{
    static_assert(std::is_floating_point_v<F>);
    static_assert(std::is_integral_v<I>);

    // 'fmax' may have a different value than expected
    static constexpr F fmax = static_cast<F>(std::numeric_limits<I>::max());

    return std::abs(x) <= fmax; // this test may gives incorrect results
}

有什么主意吗？

Answer 1

是否存在一种安全的方法来可靠地确定整数类型T是否可以存储浮点整数值f？

是的。关键是使用浮点运算来测试f是否在T::MIN - 0.999...到T::MAX + 0.999...的范围内-没有舍入问题。奖励：舍入模式不适用。

共有3条故障路径：太大，太小，不是一个数字。

以下内容假设为int/double。我将保留用于OP的C ++模板。

使用浮点数学运算精确地形成精确的T::MAX + 1很容易，因为INT_MAX是Mersenne Number。（我们这里不是在谈论 Merenne Prime 。）

代码利用以下优势：
梅森数除以2并用整数数学表示，也是梅森数。
整数类型的2的幂的常数到浮点类型的转换可以确定是 exact 。

#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1)) 
// Below needed when -INT_MAX == INT_MIN
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))

形成精确的T::MIN - 1很难，因为其绝对值通常为2的幂，并且整数类型和FP类型的相对精度不确定。相反，代码可以减去2的精确乘方，然后与-1比较。

int double_to_int(double x) {
  if (x < DBL_INT_MAXP1) {
    #if -INT_MAX == INT_MIN
    // rare non-2's complement machine 
    if (x > DBL_INT_MINM1) {
      return (int) x;
    }
    #else
    if (x - INT_MIN > -1.0) {
      return (int) x;
    }
    #endif 
    Handle_Underflow();
  } else if (x > 0) {
    Handle_Overflow();
  } else {
    Handle_NaN();
  }
}

关于具有非二进制基数（FLT_RADIX != 2）的浮点类型

使用FLT_RADIX = 4, 8, 16 ...，转换也将是精确的。使用FLT_RADIX == 10时，代码至少可以精确到34位int，因为double必须精确地编码+/- 10 ^ 10。所以说FLT_RADIX == 10，64位int机器存在问题-低风险。基于内存，生产中的最后FLT_RADIX == 10是在十年前。

整数类型始终编码为2的补码（最常见），1s的补码或符号幅度。 INT_MAX始终是2减1的幂。 INT_MIN始终是-幂2或1。实际上，始终以2为底。

Answer 2

有什么主意吗？

template <typename I, typename F>
constexpr F maxConvertible()
{
    I i = std::numeric_limits<I>::max();
    F f = F(i);
    while(F(i) == f)
    {
        --i;
    }
    return F(i);
}

由于四舍五入，我们可能有一个太大的最大值，现在递减计数，直到我们得到下一个可表示的双精度数变小为止，它应该适合整数...

问题悬而未决：如果转换为双精度涉及四舍五入，则此方法很好用。但是，即使IEEE 754也允许使用不同的舍入模式（如果应用舍入到最接近的值，这应该是当前硬件中最常见的舍入模式，则将始终发生舍入...）。

我还没有找到可以安全地检测到舍入的解决方案（可能以后再添加；至少检测到“舍入到最接近的值”已经有一个解决方法here），如果发生这种情况，我们会在附近遇到一些负面的错误整数值的最大值和最小值，对于那些实际上进行向下舍入的少数奇特体系结构，您可能会认为这是“可接受的”。

与向上舍入或向下舍入无关，无论如何，有符号整数都有一种特殊情况：如果整数用二进制补码表示，并且比浮点值的尾数多位，则类型最小值将可以表示为浮点值，而更大的值则不能。抓住这种情况需要特殊处理。

Answer 3

此方法使用C（不是C ++，请参见第一个注释）标准中的浮点格式定义。知道有效位数（由numeric_limits::digits提供）和指数限制（由numeric_limits::max_exponent提供）之后，我们就可以准备精确的值作为端点。

我相信它会在所有符合标准的C ++实现中工作，但要遵循初始注释中所述的适度附加要求。它支持带或不带整数的浮点格式，其范围比目标整数格式宽或窄，并且具有任何舍入规则（因为它仅使用浮点运算并具有可精确表示的结果，因此永远不需要舍入）。 / p>

/*  This code demonstrates safe conversion of floating-point to integer in
    which the input floating-point value is converted to integer if and only if
    it is in the supported domain for such conversions (the open interval
    (Min-1, Max+1), where Min and Max are the mininum and maximum values
    representable in the integer type).  If the input is not in range, an error
    throw and no conversion is performed.  This throw can be replaced by any
    desired error-indication mechanism so that all behavior is defined.

    There are a few requirements not fully covered by the C++ standard.  They
    should be uncontroversial and supported by all reasonable C++
    implementations:

        The floating-point format is as described in C 2011 5.2.4.2.2 (modeled
        by the product of a sign, a number of digits in some base b, and base b
        raised to an exponent).  I do not see this explicitly specified in the
        C++ standard, but it is implied by the characteristics specified in
        std::numeric_limits.  (For example, C++ requires numeric_limits to
        provide the number of base-b digits in the floating-point
        representation, where b is the radix used, which means the
        representation must have base-b digits.)

        The following operations are exact in floating-point.  (All of them
        are elementary operations and have mathematical results that are
        exactly representable, so there is no need for rounding, and hence
        exact results are expected in any sane implementation.)

            Dividing by the radix of the floating-point format, within its
            range.

            Multiplying by +1 or -1.

            Adding or subtracting two values whose sum or difference is
            representable.

        std::numeric_limits<FPType>::min_exponent is not greater than
        -std::numeric_limits<FPType>::digits.  (The code can be modified to
        eliminate this requirement.)
*/


#include <iostream> //  Not needed except for demonstration.
#include <limits>


/*  Define a class to support safe floating-point to integer conversions.

    This sample code throws an exception when a source floating-point value is
    not in the domain for which a correct integer result can be produced, but
    the throw can be replaced with any desired code, such as returning an error
    indication in an auxiliary object.  (For example, one could return a pair
    consisting of a success/error status and the destination value, if
    successful.)

    FPType is the source floating-point type.
    IType is the destination integer type.
*/
template<typename FPType, typename IType> class FPToInteger
{
private:

    /*  Wrap the bounds we need in a static object so it can be easily
        initialized just once for the entire program.
    */
    static class StaticData
    {
    private:

        /*  This function helps us find the FPType values just inside the
            interval (Min-1, Max+1), where Min and Max are the mininum and
            maximum values representable in the integer type).

            It returns the FPType of the same sign of x+s that has the greatest
            magnitude less than x+s, where s is -1 or +1 according to whether x
            is non-positive or positive.
        */
        static FPType BiggestFPType(IType x)
        {
            /*  All references to "digits" in this routine refer to digits in
                base std::numeric_limits<FPType>::radix.  For example, in base
                3, 77 would have four digits (2212).  Zero is considered to
                have zero digits.

                In this routine, "bigger" and "smaller" refer to magnitude.  (3
                is greater than -4, but -4 is bigger than 3.) */

            //  Abbreviate std::numeric_limits<FPType>::radix.
            const int Radix = std::numeric_limits<FPType>::radix;

            //  Determine the sign.
            int s = 0 < x ? +1 : -1;

            //  Count how many digits x has.
            IType digits = 0;
            for (IType t = x; t; ++digits)
                t /= Radix;

            /*  If the FPType type cannot represent finite numbers this big,
                return the biggest finite number it can hold, with the desired
                sign.
            */
            if (std::numeric_limits<FPType>::max_exponent < digits)
                return s * std::numeric_limits<FPType>::max();

            //  Determine whether x is exactly representable in FPType.
            if (std::numeric_limits<FPType>::digits < digits)
            {
                /*  x is not representable, so we will return the next lower
                    representable value by removing just as many low digits as
                    necessary.  Note that x+s might be representable, but we
                    want to return the biggest FPType less than it, which, in
                    this case, is also the biggest FPType less than x.
                */

                /*  Figure out how many digits we have to remove to leave at
                    most std::numeric_limits<FPType>::digits digits.
                */
                digits = digits - std::numeric_limits<FPType>::digits;

                //  Calculate Radix to the power of digits.
                IType t = 1;
                while (digits--) t *= Radix;

                return x / t * t;
            }
            else
            {
                /*  x is representable.  To return the biggest FPType smaller
                    than x+s, we will fill the remaining digits with Radix-1.
                */

                //  Figure out how many additional digits FPType can hold.
                digits = std::numeric_limits<FPType>::digits - digits;

                /*  Put a 1 in the lowest available digit, then subtract from 1
                    to set each digit to Radix-1.  (For example, 1 - .001 =
                    .999.)
                */
                FPType t = 1;
                while (digits--) t /= Radix;
                t = 1-t;

                //  Return the biggest FPType smaller than x+s.
                return x + s*t;
            }
        }

    public:

        /*  These values will be initialized to the greatest FPType value less
            than std::numeric_limits<IType>::max()+1 and the least FPType value
            greater than std::numeric_limits<IType>::min()-1.
        */
        const FPType UpperBound, LowerBound;

        //  Constructor to initialize supporting data for FPTypeToInteger.
        StaticData()
            : UpperBound(BiggestFPType(std::numeric_limits<IType>::max())),
              LowerBound(BiggestFPType(std::numeric_limits<IType>::min()))
        {
            //  Show values, just for illustration.
            std::cout.precision(99);
            std::cout << "UpperBound = " << UpperBound << ".\n";
            std::cout << "LowerBound = " << LowerBound << ".\n";
        }

    } Data;


public:


    FPType value;


    //  Constructor.  Just remember the source value.
    FPToInteger(FPType x) : value(x) {}


    /*  Perform the conversion.  If the conversion is defined, return the
        converted value.  Otherwise, throw an exception.
    */
    operator IType()
    {
        if (Data.LowerBound <= value && value <= Data.UpperBound)
            return value;
        else
            throw "Error, source floating-point value is out of range.";
    }
};


template<typename FPType, typename IType>
    typename FPToInteger<FPType, IType>::StaticData
        FPToInteger<FPType, IType>::Data;


typedef double FPType;
typedef int    IType;


//  Show what the class does with a requested value.
static void Test(FPType x)
{
    try
    {
        IType y = FPToInteger<FPType, IType>(x);
        std::cout << x << " -> " << y << ".\n";
    }
    catch (...)
    {
        std::cout << x << " is not in the domain.\n";
    }
}


#include <cmath>


int main(void)
{
    std::cout.precision(99);

    //  Simple demonstration (not robust testing).
    Test(0);
    Test(0x1p31);
    Test(std::nexttoward(0x1p31, 0));
    Test(-0x1p31-1);
    Test(std::nexttoward(-0x1p31-1, 0));
}

Answer 4

你不能只是做

static_cast<F>(static_cast<I>(x)) == floor(x)

？

可靠的浮点/整数类型转换溢出检测

4 个答案: