Перечислять коэффициенты числа непосредственно в порядке возрастания без сортировки?

Существует ли эффективный алгоритм для перечисления множителей числа n в порядке возрастания, без сортировки? Под "эффективным" я имею в виду:

Алгоритм избегает поиска грубой силы для делителей, начиная с факторизации первичной мощности n.
Сложность алгоритма выполнения - O (d log₂ d) или лучше, где d - количество делителей n.
Пространственная сложность алгоритма O (d).
Алгоритм избегает операции сортировки. То есть факторы производятся по порядку, а не выходят из строя и сортируются впоследствии. Хотя перечисление с использованием простого рекурсивного подхода, а затем сортировка - O (d log₂ d), для всех обращений к памяти, участвующих в сортировке, существует очень уродливая стоимость.

Тривиальный пример: n = 360 = 2 ³ × 3² × 5, который имеет d = 24 фактора: {1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 18, 20, 24, 30, 36, 40, 45, 60, 72, 90, 120, 180, 360}.

Более серьезным примером является n = 278282512406132373381723386382308832000 = 2 × × 3 × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × 71 × 73 × 79, что имеет d = 318504960 факторов (очевидно, слишком много, чтобы перечислять здесь!). Между прочим, это число имеет наибольшее число факторов любого числа до 2 ^ 128.

Я мог бы поклясться, что несколько месяцев назад я увидел описание такого алгоритма с образцом кода, но теперь я не могу найти его нигде. Он использовал какой-то волшебный трюк для поддержания списка индексов предшественников в выходном списке для каждого основного фактора. (Обновление: я запутался в генерации факторов с номерами Хэмминга, которые работают аналогично.)

Update

В итоге я использовал решение, которое является O (d) во время выполнения, имеет чрезвычайно низкую накладную память, создавая вывод O (d) на месте и значительно быстрее, чем любой другой метод, о котором я знаю. Я отправил это решение в качестве ответа, с исходным кодом C. Это сильно оптимизированная, упрощенная версия красивого алгоритма, представленного здесь в Haskell другим автором, Will Ness. Я выбрал "Ответ" в качестве принятого ответа, поскольку он предоставил очень элегантное решение, соответствующее всем требованиям, как изначально указано.

Ответ 1

Мы можем объединить потоки кратных, созданных таким образом, чтобы в первую очередь не было дубликатов.

Начиная со списка [1] для каждого уникального простого множителя p, мы умножаем список на p итеративно k раз (где k - кратность p), чтобы получить k новые списки и объединить их вместе с этим входящим списком.

В Haskell

ordfactors n =          --   <----------------------<---<---<-----\
  foldr g [1]           -- g (19,1) (g (7,1) (g (3,2) (g (2,3) [1])))
    . reverse           -- [(19,1),(7,1),(3,2),(2,3)]              ^
    . primePowers       -- [(2,3),(3,2),(7,1),(19,1)]              |
    $ n                 -- 9576                    --->--->--->----/
       where
         g (p,k) xs = mergeAsTree 
                        . take (k+1)          -- take 3 [a,b,c,d..] = [a,b,c]
                        . iterate (map (p*))  -- iterate f x = [x, f x, f(f x),...]
                        $ xs

{-  g (2,3) [1] = let a0 = [1]        
                      a1 = map (2*) a0               -- [2]
                      a2 = map (2*) a1               -- [4]
                      a3 = map (2*) a2               -- [8]
                  in mergeAsTree [ a0, a1, a2, a3 ]  -- xs2 = [1,2,4,8]

    g (3,2) xs2 = let b0 = xs2                       -- [1,2,4,8]
                      b1 = map (3*) b0               -- [3,6,12,24]
                      b2 = map (3*) b1               -- [9,18,36,72]
                  in mergeAsTree [ b0, b1, b2 ]      -- xs3 = [1,2,3,4,6,8,9,12,...] 

    g (7,1) xs3 = mergeAsTree [ xs3, map (7*) xs3 ]  -- xs4 = [1,2,3,4,6,7,8,9,...]

    g (19,1) xs4 = mergeAsTree [ xs4, map (19*) xs4 ]  
-}
mergeAsTree xs   = foldi merge [] xs    -- [a,b]     --> merge a b
                                        -- [a,b,c,d] --> merge (merge a b) (merge c d)
foldi f z []     = z
foldi f z (x:xs) = f x (foldi f z (pairs f xs))
pairs f (x:y:t)  = f x y : pairs f t
pairs _ t        = t

foldi упорядочивает двоичный merge s (который предполагает, что дубликатов в потоках не объединяется, для упрощенной операции) в виде дерева для скорости; тогда как foldr работает линейно.

primePowers n | n > 1 =           -- 9576 => [(2,3),(3,2),(7,1),(19,1)]
  map (\xs -> (head xs,length xs)) --                                ^
    . group                       -- [[2,2,2],[3,3],[7],[19]]        |
    . factorize                   -- [2,2,2,3,3,7,19]                |
    $ n                           -- 9576             --->--->--->---/

group - стандартная функция, объединяющая смежные равные в списке вместе, а factorize - хорошо известная пробная версия, который генерирует простые множители числа в неубывающем порядке:

factorize n | n > 1 = go n (2:[3,5..])   -- 9576 = 2*2*2*3*3*7*19
   where                                 -- produce prime factors only
     go n (d:t)
        | d*d > n    = [n]
        | r == 0     =  d : go q (d:t)
        | otherwise  =      go n t
            where 
              (q,r)  = quotRem n d
factorize _ = []

(.) является оператором функциональной компоновки, соединяющим выход своей функции аргумента справа от входа в левую. Он (и $) можно читать вслух как "из".

Ответ 2

Короче: многократно вытаскивайте из кучи следующий наименьший коэффициент, а затем отбрасывайте каждый кратный этого фактора, который по-прежнему остается фактором n. Используйте трюк, чтобы избежать дублирования, так что размер кучи никогда не превышает d. Сложность времени равна O (kd log d), где k - число различных простых факторов.

Ключевое свойство, которое мы используем, состоит в том, что если x и y оба являются факторами n с y = x * p для некоторого фактора p >= 2 - т.е. если простые множители x являются собственным подмножеством простые множители y - тогда x < у. Это означает, что всегда безопасно выводить x до того, как любой такой y будет даже вставлен в кучу. Куча эффективно используется только для сравнения факторов, когда ни один из них не является подмножеством другого.

Первая попытка: Дублирует медленные события вниз

Будет полезно сначала описать алгоритм, который дает правильный ответ, но также создает много дубликатов:

Установить prev = NULL.
Вставьте 1 в кучу H.
Извлеките вершину кучи t из H. Если куча пуста, остановитесь.
Если t == prev, то goto 3. [EDIT: Fixed]
Вывод t.
Установить prev = t.
Для каждого отдельного простого множителя p из n:
- Если n% (t * p) == 0 (т.е. если t * p все еще является фактором n), нажимайте t * p на H.
Перейти к 3.

Единственная проблема с этим алгоритмом заключается в том, что он может генерировать один и тот же множитель много раз. Например, если n = 30, то коэффициент 15 будет генерироваться как дочерний фактор фактора 5 (путем умножения на простой коэффициент 3), а также как дочерний фактор фактора 3 (путем умножения на 5). Один из способов - заметить, что любые дубликаты должны быть считаны в непрерывном блоке, когда они достигают вершины кучи, поэтому вы можете просто проверить, равна ли вершина кучи справедливому извлеченному значению и сохранить извлечение и отбрасывание, если это так. Но возможен лучший подход:

Убийство дубликатов в источнике

Сколько способов может генерировать фактор x? Сначала рассмотрим случай, когда х не содержит простых множителей с кратностью > 1. В этом случае, если он содержит т различных простых факторов, то существуют m-1 "родительские" факторы, которые будут генерировать его как "ребенка" в предыдущем алгоритм - каждый из этих родителей состоит из некоторого подмножества m-1 из m простых факторов, причем остальным простым фактором является тот, который добавляется к ребенку. (Если x имеет простой множитель с кратностью > 1, то на самом деле есть m родителей.) Если бы у нас был способ решить, что именно один из этих родителей является "избранным", который на самом деле порождает x в качестве ребенка, и это правило привело к тесту, который может быть применен к каждому родителю в момент выталкивания родителя, тогда мы могли бы избежать создания каких-либо дубликатов в первую очередь.

Мы можем использовать следующее правило: для любого заданного x выберите потенциальный родитель y, который не имеет наибольшего из x m факторов. Это приводит к простому правилу: родительский элемент y производит дочерний элемент x тогда и только тогда, когда x = y * p для некоторого p, большего или равного любому простому коэффициенту, уже находящемуся в y. Это легко проверить: просто пропустите простые коэффициенты в порядке убывания, порождая детей для каждого, пока мы не нажмем основной фактор, который уже делит y. В предыдущем примере родительский элемент 3 будет производить 15, но родительский элемент 5 не будет (потому что 3 < 5) - так что 15 действительно будет выпущен только один раз. Для n = 30 полное дерево выглядит следующим образом:

Обратите внимание, что каждый фактор генерируется ровно один раз.

Новый, не требующий дублирования алгоритм выглядит следующим образом:

Вставьте 1 в кучу H.
Извлеките вершину кучи t из H. Если куча пуста, остановитесь.
Вывод t.
Для каждого отдельного простого множителя p из n в порядке убывания:
- Если n% (t * p) == 0 (т.е. если t * p все еще является фактором n), нажимайте t * p на H.
- Если t% p == 0 (т.е. если t уже содержит p как фактор), остановите.
Перейти к 2.

Ответ 3

Этот ответ дает реализацию C, которая, по моему мнению, является самой быстрой и наиболее эффективной с точки зрения памяти.

Обзор алгоритма. Этот алгоритм основан на подходе слияния снизу вверх, введенном Will Ness в другом ответе, но далее упрощается, так что списки, которые сливаются, на самом деле никогда не существуют где-либо в памяти. Элемент head каждого списка ухожен и хранится в небольшом массиве, а все остальные элементы списков строятся "на лету" по мере необходимости. Это использование "phantom lists" файлов воображения текущего кода значительно сокращает объем памяти, а также объем доступа к памяти, как чтение, так и запись, а также улучшает пространственную локальность, что, в свою очередь, значительно увеличивается скорость алгоритма. Факторы на каждом уровне записываются непосредственно в их конечное место отдыха в выходном массиве по порядку.

Основная идея состоит в том, чтобы произвести факторы, используя математическую индукцию по факторизации первичной мощности. Например, для создания коэффициентов 360 коэффициенты 72 вычисляются, а затем умножаются на соответствующие полномочия 5, в этом случае {1,5}; для получения коэффициентов 72 коэффициенты 8 вычисляются, а затем умножаются на соответствующие полномочия 3, в данном случае {1,3,9}; для получения коэффициентов 8 базовый случай 1 умножается на соответствующие полномочия 2, в этом случае {1,2,4,8}. Таким образом, на каждом индуктивном этапе вычисляется декартово произведение между множествами существующих факторов и множествами простых степеней, и результаты сводятся к целым числам посредством умножения.

Ниже приведено изображение для номера 360. Слева направо - ячейки памяти; одна строка представляет один временной шаг. Время представлено как текущее по вертикали.

Пространственная сложность. Временные структуры данных для создания факторов крайне малы: используется только O (log₂ (n)) пространство, где n - число, факторы которого генерируются. Например, в 128-битной реализации этого алгоритма в C число, например 333,939,014,887,358,848,058,068,063,658,770,598,400 (логарифм с базой-2 составляет ≈127,97) выделяет 5,1 ГБ для хранения списка его 318,504,960 факторов, но использует только 360 байт дополнительных накладных расходов для составления этого списка. В лучшем случае для любого 128-битного номера требуется примерно 5 КБ служебных данных.

Среда выполнения.. Время выполнения полностью зависит от показателей факторизации простой мощности (например, первичной подписи). Для достижения наилучших результатов наибольшие экспоненты должны обрабатываться сначала, а наименьшие показатели - последним, так что время выполнения доминирует на последних этапах слияниями с низкой сложностью, которые на практике часто оказываются бинарными слияниями. Асимптотическая среда исполнения O (c (n) d (n)), где d (n) - число делителей n и где c (n) - некоторая константа, зависящая от простой сигнатуры n. То есть каждый класс эквивалентности, связанный с простой сигнатурой, имеет другую константу. Первичные подписи, в которых доминируют маленькие экспоненты, имеют меньшие константы; первичные подписи, в которых доминируют большие экспоненты, имеют большие константы. Таким образом, сложность выполнения кластеризуется с помощью простой подписи.

Графики. Производительность исполнения была профилирована на 3,4 ГГц. Intel Core i7 для выборки из 66 591 значений n, имеющих d (n) коэффициенты для уникального d (n) до 160 миллионов. Наибольшее значение n, профилированное для этого графика, составило 14 550 525 518 294 259 162 244 162 727 840 640 000, что составляет 159 744 000 факторов за 2,35 секунды.

Количество отсортированных факторов, производимых в секунду, по существу, является инверсией вышеизложенного. В данных очевидна кластеризация с помощью простой подписи. На производительность также влияют размеры кеша L1, L2 и L3, а также ограничения физической оперативной памяти.

Исходный код. Приведенный ниже пример является концептуальной программой на языке программирования C (в частности, C11). Он был протестирован на x86-64 с помощью Clang/LLVM, хотя он также отлично работает с GCC.

/*==============================================================================

DESCRIPTION

   This is a small proof-of-concept program to test the idea of generating the
   factors of a number in ascending order using an ultra-efficient sortless
   method.


INPUT

   Input is given on the command line, either as a single argument giving the
   number to be factored or an even number of arguments giving the 2-tuples that
   comprise the prime-power factorization of the desired number.  For example,
   the number

      75600 = 2^4 x 3^3 x 5^2 x 7

   can be given by the following list of arguments:

      2 4 3 3 5 2 7 1

   Note:  If a single number is given, it will require factoring to produce its
   prime-power factorization.  Since this is just a small test program, a very
   crude factoring method is used that is extremely fast for small prime factors
   but extremely slow for large prime factors.  This is actually fine, because
   the largest factor lists occur with small prime factors anyway, and it is the
   production of large factor lists at which this program aims to be proficient.
   It is simply not interesting to be fast at producing the factor list of a
   number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
   it has only 32 factors.  Numbers with tens or hundreds of thousands of
   factors are much more interesting.


OUTPUT

   Results are written to standard output.  A list of factors in ascending order
   is produced, followed by runtime required to generate the list (not including
   time to print it).


AUTHOR

   Todd Lehman
   2015/05/10

*/

//-----------------------------------------------------------------------------
#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <math.h>
#include <assert.h>

//-----------------------------------------------------------------------------
typedef  unsigned int  uint;
typedef  uint8_t       uint8;
typedef  uint16_t      uint16;
typedef  uint32_t      uint32;
typedef  uint64_t      uint64;
typedef  __uint128_t   uint128;

#define  UINT128_MAX  (uint128)(-1)

#define  UINT128_MAX_STRLEN  39

//-----------------------------------------------------------------------------
#define  ARRAY_CAPACITY(x)  (sizeof(x) / sizeof((x)[0]))

//-----------------------------------------------------------------------------
// This structure encode a single prime-power pair for the prime-power
// factorization of numbers, for example 3 to the 4th power.

#pragma pack(push, 8)
typedef struct
{
  uint128  p;  // Prime.
  uint8    e;  // Power (exponent).
}
PrimePower;   // 24 bytes using 8-byte packing
#pragma pack(pop)

//-----------------------------------------------------------------------------
// Prime-power factorization structure.
//
// This structure is sufficient to represent the prime-power factorization of
// all 128-bit values.  The field names ω and Ω are dervied from the standard
// number theory functions ω(n) and Ω(n), which count the number of unique and
// non-unique prime factors of n, respectively.  The field name d is derived
// from the standard number theory function d(n), which counts the number of
// divisors of n, including 1 and n.
//
// The maximum possible value here of ω is 26, which occurs at
// n = 232862364358497360900063316880507363070 = 2 x 3 x 5 x 7 x 11 x 13 x 17 x
// 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x 73 x 79 x
// 83 x 89 x 97 x 101, which has 26 unique prime factors.
//
// The maximum possible value of Ω here is 127, which occurs at n = 2^127 and
// n = 2^126 x 3, both of which have 127 non-unique prime factors.
//
// The maximum possible value of d here is 318504960, which occurs at
// n = 333939014887358848058068063658770598400 = 2^9 x 3^5 x 5^2 x 7^2 x 11^2 x
// 13^2 x 17 x 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x
// 73 x 79.
//
#pragma pack(push, 8)
typedef struct
{
  PrimePower  f[32];  // Primes and their exponents.
  uint8       ω;      // Count of prime factors without multiplicity.
  uint8       Ω;      // Count of prime factors with multiplicity.
  uint32      d;      // Count of factors of n, including 1 and n.
  uint128     n;      // Value of n on which all other fields depend.
}
PrimePowerFactorization;  // 656 bytes using 8-byte packing
#pragma pack(pop)

#define  MAX_ω  26
#define  MAX_Ω  127

//-----------------------------------------------------------------------------
// Fatal error:  print error message and abort.

void fatal_error(const char *format, ...)
{
  va_list args;
  va_start(args, format);
  vfprintf(stderr, format, args);
  exit(1);
}

//------------------------------------------------------------------------------
uint128 uint128_from_string(const char *const str)
{
  assert(str != NULL);

  uint128 n = 0;
  for (int i = 0; isdigit(str[i]); i++)
    n = (n * 10) + (uint)(str[i] - '0');

  return n;
}

//------------------------------------------------------------------------------
void uint128_to_string(const uint128 n,
                       char *const strbuf, const uint strbuflen)
{
  assert(strbuf != NULL);
  assert(strbuflen >= UINT128_MAX_STRLEN + 1);

  // Extract digits into string buffer in reverse order.
  uint128 a = n;
  char *s = strbuf;
  do { *(s++) = '0' + (uint)(a % 10); a /= 10; } while (a != 0);
  *s = '\0';

  // Reverse the order of the digits.
  uint l = strlen(strbuf);
  for (uint i = 0; i < l/2; i++)
    { char t = strbuf[i]; strbuf[i] = strbuf[l-1-i]; strbuf[l-1-i] = t; }

  // Verify result.
  assert(uint128_from_string(strbuf) == n);
}

//------------------------------------------------------------------------------
char *uint128_to_static_string(const uint128 n, const uint i)
{
  static char str[2][UINT128_MAX_STRLEN + 1];
  assert(i < ARRAY_CAPACITY(str));
  uint128_to_string(n, str[i], ARRAY_CAPACITY(str[i]));
  return str[i];
}

//------------------------------------------------------------------------------
// Compute sorted list of factors, given a prime-power factorization.

uint128 *compute_factors(const PrimePowerFactorization ppf)
{
  const uint128 n =       ppf.n;
  const uint    d = (uint)ppf.d;
  const uint    ω = (uint)ppf.ω;

  if (n == 0)
    return NULL;

  uint128 *factors = malloc((d + 1) * sizeof(*factors));
  if (!factors)
    fatal_error("Failed to allocate array of %u factors.", d);
  uint128 *const factors_end = &factors[d];


  // --- Seed the factors[] array.

  factors_end[0] = 0;   // Dummy value to simplify looping in bottleneck code.
  factors_end[-1] = 1;  // Seed value.

  if (n == 1)
    return factors;


  // --- Iterate over all prime factors.

  uint range = 1;
  for (uint i = 0; i < ω; i++)
  {
    const uint128 p = ppf.f[i].p;
    const uint    e = ppf.f[i].e;

    // --- Initialize phantom input lists and output list.
    assert(e < 128);
    assert(range < d);
    uint128 *restrict in[128];
    uint128 pe[128], f[128];
    for (uint j = 0; j <= e; j++)
    {
      in[j] = &factors[d - range];
      pe[j] = (j == 0)? 1 : pe[j-1] * p;
      f[j] = pe[j];
    }
    uint active_list_count = 1 + e;
    range *= 1 + e;
    uint128 *restrict out = &factors[d - range];

    // --- Merge phantom input lists to output list, until all input lists are
    //     extinguished.
    while (active_list_count > 0)
    {
      if (active_list_count == 1)
      {
        assert(out == in[0]);
        while (out != factors_end)
          *(out++) *= pe[0];
        in[0] = out;
      }
      else if (active_list_count == 2)
      {
        // This section of the code is the bottleneck of the entire factor-
        // producing algorithm.  Other portions need to be fast, but this
        // *really* needs to be fast; therefore, it has been highly optimized.
        // In fact, it is by far most frequently the case here that pe[0] is 1,
        // so further optimization is warranted in this case.
        uint128 f0 = f[0], f1 = f[1];
        uint128 *in0 = in[0], *in1 = in[1];
        const uint128 pe0 = pe[0], pe1 = pe[1];

        if (pe[0] == 1)
        {
          while (true)
          {
            if (f0 < f1)
              { *(out++) = f0; f0 = *(++in0);
                if (in0 == factors_end) break; }
            else
              { *(out++) = f1; f1 = *(++in1) * pe1; }
          }
        }
        else
        {
          while (true)
          {
            if (f0 < f1)
              { *(out++) = f0; f0 = *(++in0) * pe0;
                if (in0 == factors_end) break; }
            else
              { *(out++) = f1; f1 = *(++in1) * pe1; }
          }
        }

        f[0] = f0; f[1] = f1;
        in[0] = in0; in[1] = in1;
      }
      else if (active_list_count == 3)
      {
        uint128 f0 = f[0], f1 = f[1], f2 = f[2];
        uint128 *in0 = in[0], *in1 = in[1], *in2 = in[2];
        const uint128 pe0 = pe[0], pe1 = pe[1], pe2 = pe[2];

        while (true)
        {
          if (f0 < f1)
          {
            if (f0 < f2)
              { *(out++) = f0; f0 = *(++in0) * pe0;
                if (in0 == factors_end) break; }
            else
              { *(out++) = f2; f2 = *(++in2) * pe2; }
          }
          else
          {
            if (f1 < f2)
              { *(out++) = f1; f1 = *(++in1) * pe1; }
            else
              { *(out++) = f2; f2 = *(++in2) * pe2; }
          }
        }

        f[0] = f0; f[1] = f1, f[2] = f2;
        in[0] = in0; in[1] = in1, in[2] = in2;
      }
      else if (active_list_count >= 3)
      {
        while (true)
        {
          // Chose the smallest multiplier.
          uint k_min = 0;
          uint128 f_min = f[0];
          for (uint k = 0; k < active_list_count; k++)
            if (f[k] < f_min)
              { f_min = f[k]; k_min = k; }

          // Write the output factor, advance the input pointer, and
          // produce a new factor in the array f[] of list heads.
          *(out++) = f_min;
          f[k_min] = *(++in[k_min]) * pe[k_min];
          if (in[k_min] == factors_end)
            { assert(k_min == 0); break; }
        }
      }

      // --- Remove the newly emptied phantom input list.  Note that this is
      //     guaranteed *always* to be the first remaining non-empty list.
      assert(in[0] == factors_end);
      for (uint j = 1; j < active_list_count; j++)
      {
        in[j-1] = in[j];
        pe[j-1] = pe[j];
         f[j-1] =  f[j];
      }
      active_list_count -= 1;
    }

    assert(out == factors_end);
  }


  // --- Validate array of sorted factors.
  #ifndef NDEBUG
  {
    for (uint k = 0; k < d; k++)
    {
      if (factors[k] == 0)
        fatal_error("Produced a factor of 0 at index %u.", k);

      if (n % factors[k] != 0)
        fatal_error("Produced non-factor %s at index %u.",
                    uint128_to_static_string(factors[k], 0), k);

      if ((k > 0) && (factors[k-1] == factors[k]))
        fatal_error("Duplicate factor %s at index %u.",
                    uint128_to_static_string(factors[k], 0), k);

      if ((k > 0) && (factors[k-1] > factors[k]))
        fatal_error("Out-of-order factors %s and %s at indexes %u and %u.",
                    uint128_to_static_string(factors[k-1], 0),
                    uint128_to_static_string(factors[k], 1),
                    k-1, k);
    }
  }
  #endif


  return factors;
}

//------------------------------------------------------------------------------
// Print prime-power factorization of a number.

void print_ppf(const PrimePowerFactorization ppf)
{
  printf("%s = ", uint128_to_static_string(ppf.n, 0));
  if (ppf.n == 0)
  {
    printf("0");
  }
  else
  {
    for (uint i = 0; i < ppf.ω; i++)
    {
      if (i > 0)
        printf(" x ");

      printf("%s", uint128_to_static_string(ppf.f[i].p, 0));

      if (ppf.f[i].e > 1)
        printf("^%"PRIu8"", ppf.f[i].e);
    }
  }
  printf("\n");
}

//------------------------------------------------------------------------------
int compare_powers_ascending(const void *const pf1,
                             const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.e < f2.e)?  -1:
          (f1.e > f2.e)?  +1:
                           0;  // Not an error; duplicate exponents are common.
}

//------------------------------------------------------------------------------
int compare_powers_descending(const void *const pf1,
                              const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.e < f2.e)?  +1:
          (f1.e > f2.e)?  -1:
                           0;  // Not an error; duplicate exponents are common.
}

//------------------------------------------------------------------------------
int compare_primes_ascending(const void *const pf1,
                             const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.p < f2.p)?  -1:
          (f1.p > f2.p)?  +1:
                           0;  // Error; duplicate primes must never occur.
}

//------------------------------------------------------------------------------
int compare_primes_descending(const void *const pf1,
                              const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.p < f2.p)?  +1:
          (f1.p > f2.p)?  -1:
                           0;  // Error; duplicate primes must never occur.
}

//------------------------------------------------------------------------------
// Sort prime-power factorization.

void sort_ppf(PrimePowerFactorization *const ppf,
              const bool primes_major,      // Best false
              const bool primes_ascending,  // Best false
              const bool powers_ascending)  // Best false
{
  int (*compare_primes)(const void *, const void *) =
    primes_ascending? compare_primes_ascending : compare_primes_descending;

  int (*compare_powers)(const void *, const void *) =
    powers_ascending? compare_powers_ascending : compare_powers_descending;

  if (primes_major)
  {
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
  }
  else
  {
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
  }
}

//------------------------------------------------------------------------------
// Compute prime-power factorization of a 128-bit value.  Note that this
// function is designed to be fast *only* for numbers with very simple
// factorizations, e.g., those that produce large factor lists.  Do not attempt
// to factor large semiprimes with this function.  (The author does know how to
// factor large numbers efficiently; however, efficient factorization is beyond
// the scope of this small test program.)

PrimePowerFactorization compute_ppf(const uint128 n)
{
  PrimePowerFactorization ppf;

  if (n == 0)
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
  }
  else if (n == 1)
  {
    ppf = (PrimePowerFactorization){ .f[0] = { .p = 1, .e = 1 },
                                     .ω = 1, .Ω = 1, .d = 1, .n = 1 };
  }
  else
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };

    uint128 m = n;
    for (uint128 p = 2; p * p <= m; p += 1 + (p > 2))
    {
      if (m % p == 0)
      {
        assert(ppf.ω <= MAX_ω);
        ppf.f[ppf.ω].p = p;
        ppf.f[ppf.ω].e = 0;
        while (m % p == 0)
          { m /= p; ppf.f[ppf.ω].e += 1; }
        ppf.d *= (1 + ppf.f[ppf.ω].e);
        ppf.Ω += ppf.f[ppf.ω].e;
        ppf.ω += 1;
      }
    }
    if (m > 1)
    {
      assert(ppf.ω <= MAX_ω);
      ppf.f[ppf.ω].p = m;
      ppf.f[ppf.ω].e = 1;
      ppf.d *= 2;
      ppf.Ω += 1;
      ppf.ω += 1;
    }
  }

  return ppf;
}

//------------------------------------------------------------------------------
// Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
// The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
// Primes must not exceed 2^128 - 1 and must not be repeated.  Exponents must
// not exceed 2^8 - 1, but can of course be repeated.  The constructed value
// must not exceed 2^128 - 1.

PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
{
  assert(pairs <= MAX_ω);

  PrimePowerFactorization ppf;

  if (pairs == 0)
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
  }
  else
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };

    for (uint i = 0; i < pairs; i++)
    {
      ppf.f[i].p = uint128_from_string(values[(i*2)+0]);
      ppf.f[i].e = (uint8)strtoumax(values[(i*2)+1], NULL, 10);

      // Validate prime value.
      if (ppf.f[i].p < 2)  // (Ideally this would actually do a primality test.)
        fatal_error("Factor %s is invalid.",
                    uint128_to_static_string(ppf.f[i].p, 0));

      // Accumulate count of unique prime factors.
      if (ppf.ω > UINT8_MAX - 1)
        fatal_error("Small-omega overflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.ω += 1;

      // Accumulate count of total prime factors.
      if (ppf.Ω > UINT8_MAX - ppf.f[i].e)
        fatal_error("Big-omega wverflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.Ω += ppf.f[i].e;

      // Accumulate total divisor count.
      if (ppf.d > UINT32_MAX / (1 + ppf.f[i].e))
        fatal_error("Divisor count overflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.d *= (1 + ppf.f[i].e);

      // Accumulate value.
      for (uint8 k = 1; k <= ppf.f[i].e; k++)
      {
        if (ppf.n > UINT128_MAX / ppf.f[i].p)
          fatal_error("Value overflow at factor %s.",
                      uint128_to_static_string(ppf.f[i].p, 0));
        ppf.n *= ppf.f[i].p;
      }
    }
  }

  return ppf;
}

//------------------------------------------------------------------------------
// Main control.  Parse command line and produce list of factors.

int main(const int argc, const char *const argv[])
{
  bool primes_major     = false;
  bool primes_ascending = false;
  bool powers_ascending = false;

  PrimePowerFactorization ppf;


  // --- Parse prime-power sort specifier (if present).

  uint value_base = 1;
  uint value_count = (uint)argc - 1;
  if ((argc > 1) && (argv[1][0] == '-'))
  {
    static const struct
    {
      char *str; bool primes_major, primes_ascending, powers_ascending;
    }
    sort_options[] =
    {
                        // Sorting criteria:
                        // ----------------------------------------
      { "ep", 0,0,0 },  // Exponents descending, primes descending
      { "Ep", 0,0,1 },  // Exponents ascending, primes descending
      { "eP", 0,1,0 },  // Exponents descending, primes ascending
      { "EP", 0,1,1 },  // Exponents ascending, primes ascending
      { "p",  1,0,0 },  // Primes descending (exponents irrelevant)
      { "P",  1,1,0 },  // Primes ascending (exponents irrelevant)
    };

    bool valid = false;
    for (uint i = 0; i < ARRAY_CAPACITY(sort_options); i++)
    {
      if (strcmp(&argv[1][1], sort_options[i].str) == 0)
      {
        primes_major     = sort_options[i].primes_major;
        primes_ascending = sort_options[i].primes_ascending;
        powers_ascending = sort_options[i].powers_ascending;
        valid = true;
        break;
      }
    }

    if (!valid)
      fatal_error("Bad sort specifier: \"%s\"", argv[1]);

    value_base += 1;
    value_count -= 1;
  }


  // --- Prime factorization from either a number or a raw prime factorization.

  if (value_count == 1)
  {
    uint128 n = uint128_from_string(argv[value_base]);
    ppf = compute_ppf(n);
  }
  else
  {
    if (value_count % 2 != 0)
      fatal_error("Odd number of arguments (%u) given.", value_count);
    uint pairs = value_count / 2;
    ppf = parse_ppf(pairs, &argv[value_base]);
  }


  // --- Sort prime factorization by either the default or the user-overridden
  //     configuration.

  sort_ppf(&ppf, primes_major, primes_ascending, powers_ascending);
  print_ppf(ppf);


  // --- Run for (as close as possible to) a fixed amount of time, tallying the
  //     elapsed CPU time.

  uint128 iterations = 0;
  double cpu_time = 0.0;
  const double cpu_time_limit = 0.10;
  uint128 memory_usage = 0;
  while (cpu_time < cpu_time_limit)
  {
    clock_t clock_start = clock();
    uint128 *factors = compute_factors(ppf);
    clock_t clock_end = clock();
    cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;
    memory_usage = sizeof(*factors) * ppf.d;

    if (++iterations == 0) //1)
    {
      for (uint32 i = 0; i < ppf.d; i++)
        printf("%s\n", uint128_to_static_string(factors[i], 0));
    }

    if (factors) free(factors);
  }


  // --- Print the average amount of CPU time required for each iteration.

  uint memory_scale = (memory_usage >= 1e9)? 9:
                      (memory_usage >= 1e6)? 6:
                      (memory_usage >= 1e3)? 3:
                                             0;
  char *memory_units = (memory_scale == 9)? "GB":
                       (memory_scale == 6)? "MB":
                       (memory_scale == 3)? "KB":
                                            "B";

  printf("%s  %"PRIu32" factors  %.6f ms  %.3f ns/factor  %.3f %s\n",
         uint128_to_static_string(ppf.n, 0),
         ppf.d,
         cpu_time/iterations * 1e3,
         cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
         (double)memory_usage / pow(10, memory_scale),
         memory_units);

  return 0;
}

Ответ 4

[Я отправляю этот ответ так же, как формальность для полноты. Я уже выбрал кого-то другого в качестве принятого ответа.]

Обзор алгоритма. При поиске самого быстрого способа генерации списка факторов в памяти (64-разрядные значения без знака в моем случае), я остановился на гибридном алгоритме, который реализует два -мерный размер ведра, который использует внутренние знания ключей сортировки (т.е. они являются целыми числами и поэтому могут быть вычислены). Конкретный метод является чем-то более близким к "ProxMapSort", но с двумя уровнями ключей (основной, второстепенный), а не только одним. Основной ключ - это просто логарифм базы-2 значения. Меньшим ключом является минимальное количество наиболее значимых цифр значения, необходимого для обеспечения разумного распространения на втором уровне ведер. Факторы производятся сначала во временном рабочем массиве несортированных факторов. Затем анализируется их распределение и выделяется и заполняется массив индексов ковша. Наконец, факторы хранятся непосредственно на месте в конечном отсортированном массиве, используя сортировку вставки. Подавляющее большинство ведер имеют только 1, 2 или 3 значения. Примеры приведены в исходном коде, который прилагается в нижней части этого ответа.

Пространственная сложность.. Использование памяти примерно в 4 раза выше, чем у решения на основе Quicksort, но это на самом деле довольно незначительно, поскольку максимальная память когда-либо использовалась в худшем случае (для 64-битного ввода) составляет 5.5 МБ, из которых 4.0 МБ удерживается всего лишь в течение нескольких миллисекунд.

Среда выполнения. Производительность намного лучше, чем ручное кодированное решение на основе Quicksort: для чисел с нетривиальным числом факторов он неформально примерно в 2,5 раза быстрее. На моем 3,4 ГГц. Intel i7, он производит 184 320 факторов 18 401 055 938 125 568 800 в отсортированном порядке в 0,0052 секунды или около 96 тактов на один фактор или около 35 миллионов факторов в секунду.

Графики. Производительность памяти и времени выполнения была профилирована для 47 616 канонических представителей классов эквивалентности простых подписей чисел до 2⁶⁴-1. Это так называемые "сильно факторизуемые числа" в 64-битном пространстве поиска.

Общая продолжительность выполнения ~ 2.5x лучше, чем решение на основе Quicksort для нетривиальных коэффициентов, показанное ниже на этом логарифмическом графике:

Количество отсортированных факторов, производимых в секунду, по существу, является инверсией вышеизложенного. Производительность на основе каждого фактора снижается после сладкого пятна около 2000 факторов, но не сильно. На производительность влияют размеры кеша L1, L2 и L3, а также количество уникальных первичных факторов числа, которое учитывается, что примерно совпадает с логарифмом входного значения.

Использование пиковой памяти - это прямая линия в этом логарифмическом графике, так как она пропорциональна логарифму базы-2 числа факторов. Обратите внимание, что использование пиковой памяти происходит только в течение очень короткого периода времени; кратковременные рабочие массивы отбрасываются в течение миллисекунд. После того, как временные массивы будут отброшены, остаётся окончательный список факторов, который является тем же минимальным использованием, что и в решении на основе Quicksort.

 /*==============================================================================

 DESCRIPTION

    This is a small proof-of-concept program to test the idea of "sorting"
    factors using a form of bucket sort.  The method is essentially a 2D version
    of ProxMapSort that has tuned for vast, nonlinear distributions using two
    keys (major, minor) rather than one.  The major key is simply the floor of
    the base-2 logarithm of the value, and the minor key is derived from the most
    significant bits of the value.


 INPUT

    Input is given on the command line, either as a single argument giving the
    number to be factored or an even number of arguments giving the 2-tuples that
    comprise the prime-power factorization of the desired number.  For example,
    the number

       75600 = 2^4 x 3^3 x 5^2 x 7

    can be given by the following list of arguments:

       2 4 3 3 5 2 7 1

    Note:  If a single number is given, it will require factoring to produce its
    prime-power factorization.  Since this is just a small test program, a very
    crude factoring method is used that is extremely fast for small prime factors
    but extremely slow for large prime factors.  This is actually fine, because
    the largest factor lists occur with small prime factors anyway, and it is the
    production of large factor lists at which this program aims to be proficient.
    It is simply not interesting to be fast at producing the factor list of a
    number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
    it has only 32 factors.  Numbers with tens or hundreds of thousands of
    factors are much more interesting.


 OUTPUT

    Results are written to standard output.  A list of factors in ascending order
    is produced, followed by runtime (in microseconds) required to generate the
    list (not including time to print it).


 STATISTICS

    Bucket size statistics for the 47616 canonical representatives of the prime
    signature equivalence classes of 64-bit numbers:

    ==============================================================
    Bucket size     Total count of factored       Total count of
         b          numbers needing size b      buckets of size b
    --------------------------------------------------------------
         1               47616 (100.0%)         514306458  (76.2%)
         2               47427  (99.6%)         142959971  (21.2%)
         3               43956  (92.3%)          16679329   (2.5%)
         4               27998  (58.8%)            995458   (0.1%)
         5                6536  (13.7%)             33427  (<0.1%)
         6                 400   (0.8%)               729  (<0.1%)
         7                  12  (<0.1%)                18  (<0.1%)
    --------------------------------------------------------------
         ~               47616 (100.0%)         674974643 (100.0%)
    --------------------------------------------------------------

    Thus, no 64-bit number (of the input set) ever requires more than 7 buckets,
    and the larger the bucket size the less frequent it is.  This is highly
    desirable.  Note that although most numbers need at least 1 bucket of size 5,
    the vast majority of buckets (99.9%) are of size 1, 2, or 3, meaning that
    insertions are extremely efficient.  Therefore, the use of insertion sort
    for the buckets is clearly the right choice and is arguably optimal for
    performance.


 AUTHOR

    Todd Lehman
    2015/05/08

 */

 #include <inttypes.h>
 #include <limits.h>
 #include <stdbool.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <stdarg.h>
 #include <string.h>
 #include <time.h>
 #include <math.h>
 #include <assert.h>

 typedef  unsigned int  uint;
 typedef  uint8_t       uint8;
 typedef  uint16_t      uint16;
 typedef  uint32_t      uint32;
 typedef  uint64_t      uint64;

 #define  ARRAY_CAPACITY(x)  (sizeof(x) / sizeof((x)[0]))

 //-----------------------------------------------------------------------------
 // This structure is sufficient to represent the prime-power factorization of
 // all 64-bit values.  The field names ω and Ω are dervied from the standard
 // number theory functions ω(n) and Ω(n), which count the number of unique and
 // non-unique prime factors of n, respectively.  The field name d is derived
 // from the standard number theory function d(n), which counts the number of
 // divisors of n, including 1 and n.
 //
 // The maximum possible value here of ω is 15, which occurs for example at
 // n = 7378677391061896920 = 2^3 x 3^2 x 5 x 7 x 11 x 13 x 17 x 19 x 23 x 29
 // 31 x 37 x 41 x 43 x 47, which has 15 unique prime factors.
 //
 // The maximum possible value of Ω here is 63, which occurs for example at
 // n = 2^63 and n = 2^62 x 3, both of which have 63 non-unique prime factors.
 //
 // The maximum possible value of d here is 184320, which occurs at
 // n = 18401055938125660800 = 2^7 x 3^4 x 5^2 x 7^2 x 11 x 13 x 17 x 19 x 23 x
 // 29 x 31 x 37 x 41.
 //
 // Maximum possible exponents when exponents are sorted in decreasing order:
 //
 //    Index   Maximum   Bits   Example of n
 //    -----   -------   ----   --------------------------------------------
 //        0        63      6   (2)^63
 //        1        24      5   (2*3)^24
 //        2        13      4   (2*3*5)^13
 //        3         8      4   (2*3*5*7)^8
 //        4         5      3   (2*3*5*7*11)^5
 //        5         4      3   (2*3*5*7*11*13)^4
 //        6         3      2   (2*3*5*7*11*13*17)^3
 //        7         2      2   (2*3*5*7*11*13*17*19)^2
 //        8         2      2   (2*3*5*7*11*13*17*19*23)^2
 //        9         1      1   (2*3*5*7*11*13*17*19*23*29)^1
 //       10         1      1   (2*3*5*7*11*13*17*19*23*29*31)^1
 //       11         1      1   (2*3*5*7*11*13*17*19*23*29*31*37)^1
 //       12         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41)^1
 //       13         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41*43)^1
 //       14         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41*43*47)^1
 //    -----   -------   ----   --------------------------------------------
 //       15        63     37
 //
 #pragma pack(push, 8)
 typedef struct
 {
   uint8   e[16];  // Exponents.
   uint64  p[16];  // Primes in increasing order.
   uint8   ω;      // Count of prime factors without multiplicity.
   uint8   Ω;      // Count of prime factors with multiplicity.
   uint32  d;      // Count of factors of n, including 1 and n.
   uint64  n;      // Value of n on which all other fields of this struct depend.
 }
 PrimePowerFactorization;  // 176 bytes with 8-byte packing
 #pragma pack(pop)

 #define  MAX_ω  15
 #define  MAX_Ω  63

 //-----------------------------------------------------------------------------
 // Fatal error:  print error message and abort.

 void fatal_error(const char *format, ...)
 {
   va_list args;
   va_start(args, format);
   vfprintf(stderr, format, args);
   exit(1);
 }

 //-----------------------------------------------------------------------------
 // Compute 64-bit 2-adic integer inverse.

 uint64 uint64_inv(const uint64 x)
 {
   assert(x != 0);

   uint64 y = 1;
   for (uint i = 0; i < 6; i++)  // 6 = log2(log2(2**64)) = log2(64)
     y = y * (2 - (x * y));

   return y;
 }

 //------------------------------------------------------------------------------
 // Compute 2 to arbitrary power.  This is just a portable and abstract way to
 // write a left-shift operation.  Note that the use of the UINT64_C macro here
 // is actually required, because the result of 1U<<x is not guaranteed to be a
 // 64-bit result; on a 32-bit compiler, 1U<<32 is 0 or is undefined.

 static inline
 uint64 uint64_pow2(x)
 {
   return UINT64_C(1) << x;
 }

 //------------------------------------------------------------------------------
 // Deduce native word size (int, long, or long long) for 64-bit integers.
 // This is needed for abstracting certain compiler-specific intrinsic functions.

 #if UINT_MAX == 0xFFFFFFFFFFFFFFFFU
   #define UINT64_IS_U
 #elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
   #define UINT64_IS_UL
 #elif ULLONG_MAX == 0xFFFFFFFFFFFFFFFFULL
   #define UINT64_IS_ULL
 #else
   //error "Unable to deduce native word size of 64-bit integers."
 #endif

 //------------------------------------------------------------------------------
 // Define abstracted intrinsic function for counting leading zeros.  Note that
 // the value is well-defined for nonzero input but is compiler-specific for
 // input of zero.

 #if   defined(UINT64_IS_U) && __has_builtin(__builtin_clz)
   #define UINT64_CLZ(x) __builtin_clz(x)
 #elif defined(UINT64_IS_UL) && __has_builtin(__builtin_clzl)
   #define UINT64_CLZ(x) __builtin_clzl(x)
 #elif defined(UINT64_IS_ULL) && __has_builtin(__builtin_clzll)
   #define UINT64_CLZ(x) __builtin_clzll(x)
 #else
   #undef UINT64_CLZ
 #endif

 //------------------------------------------------------------------------------
 // Compute floor of base-2 logarithm y = log_2(x), where x > 0.  Uses fast
 // intrinsic function if available; otherwise resorts to hand-rolled method.

 static inline
 uint uint64_log2(uint64 x)
 {
   assert(x > 0);

   #if defined(UINT64_CLZ)
     return 63 - UINT64_CLZ(x);
   #else
     #define S(k) if ((x >> k) != 0) { y += k; x >>= k; }
     uint y = 0; S(32); S(16); S(8); S(4); S(2); S(1); return y;
     #undef S
   #endif
 }

 //------------------------------------------------------------------------------
 // Compute major key, given a nonzero number.  The major key is simply the
 // floor of the base-2 logarithm of the number.

 static inline
 uint major_key(const uint64 n)
 {
   assert(n > 0);
   uint k1 = uint64_log2(n);
   return k1;
 }

 //------------------------------------------------------------------------------
 // Compute minor key, given a nonzero number, its major key, k1, and the
 // bit-size b of major bucket k1.  The minor key, k2, is is computed by first
 // removing the most significant 1-bit from the number, because it adds no
 // information, and then extracting the desired number of most significant bits
 // from the remainder.  For example, given the number n=1463 and a major bucket
 // size of b=6 bits, the keys are computed as follows:
 //
 //    Step 0:  Given number              n = 0b10110110111 = 1463
 //
 //    Step 1:  Compute major key:        k1 = floor(log_2(n)) = 10
 //
 //    Step 2:  Remove high-order 1-bit:  n' = 0b0110110111 = 439
 //
 //    Step 3:  Compute minor key:        k2 = n' >> (k1 - b)
 //                                          = 0b0110110111 >> (10 - 6)
 //                                          = 0b0110110111 >> 4
 //                                          = 0b011011
 //                                          = 27

 static inline
 uint minor_key(const uint64 n, const uint k1, const uint b)
 {
   assert(n > 0); assert(k1 >= 0); assert(b > 0);
   const uint k2 = (uint)((n ^ uint64_pow2(k1)) >> (k1 - b));
   return k2;
 }

 //------------------------------------------------------------------------------
 // Raw unsorted factor.

 #pragma push(pack, 4)

 typedef struct
 {
   uint64  n;   // Value of factor.
   uint32  k1;  // Major key.
   uint32  k2;  // Minor key.
 }
 UnsortedFactor;

 #pragma pop(pack)

 //------------------------------------------------------------------------------
 // Compute sorted list of factors, given a prime-power factorization.

 static uint64 memory_usage;

 uint64 *compute_factors(const PrimePowerFactorization ppf)
 {
   memory_usage = 0;

   if (ppf.n == 0)
     return NULL;

   uint64 *sorted_factors = calloc(ppf.d, sizeof(*sorted_factors));
   if (!sorted_factors)
     fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
   memory_usage += ppf.d * sizeof(*sorted_factors);

   UnsortedFactor *unsorted_factors = malloc(ppf.d * sizeof(*unsorted_factors));
   if (!unsorted_factors)
     fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
   memory_usage += ppf.d * sizeof(*unsorted_factors);


   // These arrays are indexed by the major key of a number.
   uint32 major_counts[64];   // Counts of factors in major buckets.
   uint32 major_spans[64];    // Counts rounded up to power of 2.
   uint32 major_bits[64];     // Base-2 logarithm of bucket size.
   uint32 major_indexes[64];  // Indexes into minor array.
   memset(major_counts,  0, sizeof(major_counts));
   memset(major_spans,   0, sizeof(major_spans));
   memset(major_bits,    0, sizeof(major_bits));
   memset(major_indexes, 0, sizeof(major_indexes));


   // --- Step 1:  Produce unsorted list of factors from prime-power
   //     factorization.  At the same time, count groups of factors by their
   //     major keys.
   {
     // This array is for counting in the multi-radix number system dictated by
     // the exponents of the prime-power factorization.  An invariant is that
     // e[i] <= ppf.e[i] for all i (0 < i <ppf.ω).
     uint8 e[MAX_ω];
     for (uint i = 0; i < ppf.ω; i++)
       e[i] = 0;

     // Initialize inverse-prime-powers.  This array allows for division by
     // p[i]**e[i] extremely quickly in the main loop below.  Note that 2-adic
     // inverses are not defined for even numbers (of which 2 is the only prime),
     // so powers of 2 must be handled specially.
     uint64 pe_inv[MAX_ω];
     for (uint i = 0; i < ppf.ω; i++)
     {
       uint64 pe = 1; for (uint j = 1; j <= ppf.e[i]; j++) pe *= ppf.p[i];
       pe_inv[i] = uint64_inv(pe);
     }

     uint64 n = 1;  // Current factor accumulator.
     for (uint k = 0; k < ppf.d; k++)   // k indexes into unsorted_factors[].
     {
       //printf("unsorted_factors[%u] = %"PRIu64"   j = %u\n", k, n, j);
       assert(ppf.n % n == 0);
       unsorted_factors[k].n = n;

       uint k1 = major_key(n);
       assert(k1 < ARRAY_CAPACITY(major_counts));
       unsorted_factors[k].k1 = k1;
       major_counts[k1] += 1;

       // Increment the remainder of the multi-radix number e[].
       for (uint i = 0; i < ppf.ω; i++)
       {
         if (e[i] == ppf.e[i])  // Carrying is occurring.
         {
           if (ppf.p[i] == 2)
             n >>= ppf.e[i];  // Divide n by 2 ** ppf.e[i].
           else
             n *= pe_inv[i];  // Divide n by ppf.p[i] ** ppf.e[i].

           e[i] = 0;
         }
         else  // Carrying is not occurring.
         {
           n *= ppf.p[i];
           e[i] += 1;
           break;
         }
       }
     }
     assert(n == 1);  // n always cycles back to 1, not to ppf.n.

     assert(unsorted_factors[ppf.d-1].n == ppf.n);
   }


   // --- Step 2:  Define the major bits array, the major spans array, the major
   //     index array, and count the total spans.

   uint32 total_spans = 0;
   {
     uint32 k = 0;
     for (uint k1 = 0; k1 < ARRAY_CAPACITY(major_counts); k1++)
     {
       uint32 count = major_counts[k1];
       uint32 bits = (count <= 1)? count : uint64_log2(count - 1) + 1;
       major_bits[k1] = bits;
       major_spans[k1] = (count > 0)? (UINT32_C(1) << bits) : 0;
       major_indexes[k1] = k;
       k += major_spans[k1];
     }
     total_spans = k;
   }


   // --- Step 3:  Allocate and populate the minor counts array.  Note that it
   //     must be initialized to zero.

   uint32 *minor_counts = calloc(total_spans, sizeof(*minor_counts));
   if (!minor_counts)
     fatal_error("Failed to allocate array of %"PRIu32" counts.", total_spans);
   memory_usage += total_spans * sizeof(*minor_counts);

   for (uint k = 0; k < ppf.d; k++)
   {
     const uint64 n = unsorted_factors[k].n;
     const uint k1 = unsorted_factors[k].k1;
     const uint k2 = minor_key(n, k1, major_bits[k1]);
     assert(k2 < major_spans[k1]);
     unsorted_factors[k].k2 = k2;
     minor_counts[major_indexes[k1] + k2] += 1;
   }


   // --- Step 4:  Define the minor indexes array.
   //
   // NOTE:  Instead of allocating a separate array, the earlier-allocated array
   // of minor indexes is simply repurposed here using an alias.

   uint32 *minor_indexes = minor_counts;  // Alias the array for repurposing.

   {
     uint32 k = 0;
     for (uint i = 0; i < total_spans; i++)
     {
       uint32 count = minor_counts[i];  // This array is the same array...
       minor_indexes[i] = k;            // ...as this array.
       k += count;
     }
   }


   // --- Step 5:  Populate the sorted factors array.  Note that the array must
   //              be initialized to zero earlier because values of zero are used
   //              as sentinels in the bucket lists.

   for (uint32 i = 0; i < ppf.d; i++)
   {
     uint64 n = unsorted_factors[i].n;
     const uint k1 = unsorted_factors[i].k1;
     const uint k2 = unsorted_factors[i].k2;

     // Insert factor into bucket using insertion sort (which happens to be
     // extremely fast because we know the bucket sizes are always very small).
     uint32 k;
     for (k = minor_indexes[major_indexes[k1] + k2];
          sorted_factors[k] != 0;
          k++)
     {
       assert(k < ppf.d);
       if (sorted_factors[k] > n)
         { uint64 t = sorted_factors[k]; sorted_factors[k] = n; n = t; }
     }
     sorted_factors[k] = n;
   }


   // --- Step 6:  Validate array of sorted factors.
   {
     for (uint32 k = 1; k < ppf.d; k++)
     {
       if (sorted_factors[k] == 0)
         fatal_error("Produced a factor of 0 at index %"PRIu32".", k);

       if (ppf.n % sorted_factors[k] != 0)
         fatal_error("Produced non-factor %"PRIu64" at index %"PRIu32".",
                     sorted_factors[k], k);

       if (sorted_factors[k-1] == sorted_factors[k])
         fatal_error("Duplicate factor %"PRIu64" at index %"PRIu32".",
                     sorted_factors[k], k);

       if (sorted_factors[k-1] > sorted_factors[k])
         fatal_error("Out-of-order factors %"PRIu64" and %"PRIu64" "
                     "at indexes %"PRIu32" and %"PRIu32".",
                     sorted_factors[k-1], sorted_factors[k], k-1, k);
     }
   }


   free(minor_counts);
   free(unsorted_factors);

   return sorted_factors;
 }

 //------------------------------------------------------------------------------
 // Compute prime-power factorization of a 64-bit value.  Note that this function
 // is designed to be fast *only* for numbers with very simple factorizations,
 // e.g., those that produce large factor lists.  Do not attempt to factor
 // large semiprimes with this function.  (The author does know how to factor
 // large numbers efficiently; however, efficient factorization is beyond the
 // scope of this small test program.)

 PrimePowerFactorization compute_ppf(const uint64 n)
 {
   PrimePowerFactorization ppf;

   if (n == 0)
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
   }
   else if (n == 1)
   {
     ppf = (PrimePowerFactorization){ .p = { 1 }, .e = { 1 },
                                      .ω = 1, .Ω = 1, .d = 1, .n = 1 };
   }
   else
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };

     uint64 m = n;
     for (uint64 p = 2; p * p <= m; p += 1 + (p > 2))
     {
       if (m % p == 0)
       {
         assert(ppf.ω <= MAX_ω);
         ppf.p[ppf.ω] = p;
         ppf.e[ppf.ω] = 0;
         while (m % p == 0)
           { m /= p; ppf.e[ppf.ω] += 1; }
         ppf.d *= (1 + ppf.e[ppf.ω]);
         ppf.Ω += ppf.e[ppf.ω];
         ppf.ω += 1;
       }
     }
     if (m > 1)
     {
       assert(ppf.ω <= MAX_ω);
       ppf.p[ppf.ω] = m;
       ppf.e[ppf.ω] = 1;
       ppf.d *= 2;
       ppf.Ω += 1;
       ppf.ω += 1;
     }
   }

   return ppf;
 }

 //------------------------------------------------------------------------------
 // Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
 // The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
 // Primes must not exceed 2^64 - 1.  Exponents must not exceed 2^8 - 1.  The
 // constructed value must not exceed 2^64 - 1.

 PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
 {
   assert(pairs <= MAX_ω);

   PrimePowerFactorization ppf;

   if (pairs == 0)
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
   }
   else
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };

     for (uint i = 0; i < pairs; i++)
     {
       ppf.p[i] = (uint64)strtoumax(values[(i*2)+0], NULL, 10);
       ppf.e[i] =  (uint8)strtoumax(values[(i*2)+1], NULL, 10);

       // Validate prime value.
       if (ppf.p[i] < 2)  // (Ideally this would actually do a primality test.)
         fatal_error("Factor %"PRIu64" is invalid.", ppf.p[i]);

       // Accumulate count of unique prime factors.
       if (ppf.ω > UINT8_MAX - 1)
         fatal_error("Small-omega overflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.ω += 1;

       // Accumulate count of total prime factors.
       if (ppf.Ω > UINT8_MAX - ppf.e[i])
         fatal_error("Big-omega wverflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.Ω += ppf.e[i];

       // Accumulate total divisor count.
       if (ppf.d > UINT32_MAX / (1 + ppf.e[i]))
         fatal_error("Divisor count overflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.d *= (1 + ppf.e[i]);

       // Accumulate value.
       for (uint8 k = 1; k <= ppf.e[i]; k++)
       {
         if (ppf.n > UINT64_MAX / ppf.p[i])
           fatal_error("Value overflow at factor %"PRIu64".", ppf.p[i]);
         ppf.n *= ppf.p[i];
       }
     }
   }

   return ppf;
 }

 //------------------------------------------------------------------------------
 // Main control.  Parse command line and produce list of factors.

 int main(const int argc, const char *const argv[])
 {
   PrimePowerFactorization ppf;

   uint values = (uint)argc - 1;  // argc is always guaranteed to be at least 1.

   if (values == 1)
   {
     ppf = compute_ppf((uint64)strtoumax(argv[1], NULL, 10));
   }
   else
   {
     if (values % 2 != 0)
       fatal_error("Odd number of arguments (%u) given.", values);
     uint pairs = values / 2;
     ppf = parse_ppf(pairs, &argv[1]);
   }

   // Run for (as close as possible to) a fixed amount of time, tallying the
   // elapsed CPU time.
   uint64 iterations = 0;
   double cpu_time = 0.0;
   const double cpu_time_limit = 0.05;
   while (cpu_time < cpu_time_limit)
   {
     clock_t clock_start = clock();
     uint64 *factors = compute_factors(ppf);
     clock_t clock_end = clock();
     cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;

     if (++iterations == 1)
     {
       for (uint32 i = 0; i < ppf.d; i++)
         printf("%"PRIu64"\n", factors[i]);
     }

     if (factors) free(factors);
   }

   // Print the average amount of CPU time required for each iteration.
   uint mem_scale = (memory_usage >= 1e9)? 9:
                    (memory_usage >= 1e6)? 6:
                    (memory_usage >= 1e3)? 3:
                                           0;
   char *mem_units = (mem_scale == 9)? "GB":
                     (mem_scale == 6)? "MB":
                     (mem_scale == 3)? "KB":
                                        "B";

   printf("%"PRIu64"  %"PRIu32" factors  %.6f ms  %.3f ns/factor  %.3f %s\n",
          ppf.n,
          ppf.d,
          cpu_time/iterations * 1e3,
          cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
          (double)memory_usage / pow(10, mem_scale),
          mem_units);

   return 0;
 }