Degenerate Conic

Natural Sorting

Sorting is one of the fundamental problems in computer science, so of course Fortran does not include any intrinsic sorting routine (we've got Bessel functions, though!) String sorting is a special case of this problem which includes various choices to consider, for example:

Natural or ASCII sorting
Case sensitive (e.g., 'A'<'a') or case insensitive (e.g., 'A'=='a')

finder

"Natural" sorting (also called "alphanumeric sorting" means to take into account numeric values in the string, rather than just comparing the ASCII value of each of the characters. This can produce an order that looks more natural to a human for strings that contain numbers. For example, in a "natural" sort, the string "case2.txt" will come before "case100.txt", since the number 2 comes before the number 100. For example, natural sorting is the method used to sort file names in the MacOS X Finder (see image at right). While, interestingly, an ls -l from a Terminal merely does a basic ASCII sort.

For string sorting routines written in modern Fortran, check out my GitHub project stringsort. This library contains routines for both natural and ASCII string sorting. Natural sorting is achieved by breaking up each string into chunks. A chunk consists of a non-numeric character or a contiguous block of integer characters. A case insensitive search is done by simply converting each character to lowercase before comparing them. I make no claim that the routines are particularly optimized. One limitation is that contiguous integer characters are stored as an integer(INT32) value, which has a maximum value of 2147483647. Although note that it is easy to change the code to use integer(INT64) variables to increase this range up to 9223372036854775807 if necessary. Eliminating integer size restrictions entirely is left as an exercise for the reader.

Consider the following test case:

character(len=*),dimension(6) :: &
str = [ 'z1.txt ', &
        'z102.txt', &
        'Z101.txt', &
        'z100.txt', &
        'z10.txt ', &
        'Z11.txt ' ]

This list can be sorted (at least) four different ways:

Case Insensitive

ASCII

z1.txt z10.txt z100.txt Z101.txt z102.txt Z11.txt

natural

z1.txt z10.txt Z11.txt z100.txt Z101.txt z102.txt

Case Sensitive

ASCII

Z101.txt Z11.txt z1.txt z10.txt z100.txt z102.txt

natural

Z11.txt Z101.txt z1.txt z10.txt z100.txt z102.txt

Each of these can be done using stringsort with the following subroutine calls:

call lexical_sort_recursive(str,case_sensitive=.false.)
call lexical_sort_natural_recursive(str,case_sensitive=.false.)

call lexical_sort_recursive(str,case_sensitive=.true.)
call lexical_sort_natural_recursive(str,case_sensitive=.true.)

*Original Quicksort algorithm by Tony Hoare, 1961 (Communications of the ACM)*

The routines use the quicksort algorithm, which was originally created for sorting strings (specifically words in Russian sentences so they could be looked up in a Russian-English dictionary). The algorithm is easily implemented in modern Fortran using recursion (non-recursive versions were also available before recursion was added to the language in Fortran 90). Quicksort was named one of the top 10 algorithms of the 20th century by the ACM (Fortran was also on the list).

Topological Sorting

Topological sorting can be used to determine the order in which a collection of interdependent tasks must be performed. For example, when building a complex modern Fortran application, there can be many modules with complex interdependencies (via use association). A Fortran building system (like FoBiS, Foray, or SCons) will perform a topological sort to determine the correct compilation order. In graph theory, the collection of tasks are vertices of a directed graph. If there are no circular dependencies, then it is a directed acyclic graph (DAG).

An example modern Fortran implementation of a topological sorting algorithm is given here. The only public class is dag, which is used to define the graph and perform the sort. The toposort method performs a "depth-first" traversal of the graph using the recursive subroutine dfs. Each vertex is only visited once, and so the algorithm runs in linear time. The routine also triggers an error message for circular dependencies.

module toposort_module
!! Topological sorting, using a recursive
!! depth-first method. The vertices are
!! integer values from (1, 2, ..., nvertices)

implicit none

private

type :: vertex
    !! a vertex of a directed acyclic graph (DAG)
    integer,dimension(:),allocatable :: edges
    integer :: ivertex = 0 !vertex number
    logical :: checking = .false.
    logical :: marked = .false.
    contains
    procedure :: set_edges
end type vertex

type,public :: dag
    !! a directed acyclic graph (DAG)
    type(vertex),dimension(:),allocatable :: vertices
    contains
    procedure :: set_vertices => dag_set_vertices
    procedure :: set_edges => dag_set_edges
    procedure :: toposort => dag_toposort
end type dag

contains

subroutine set_edges(me,edges)
    !! specify the edge indices for this vertex
    implicit none
    class(vertex),intent(inout) :: me
    integer,dimension(:),intent(in) :: edges
    allocate(me%edges(size(edges)))
    me%edges = edges
end subroutine set_edges

subroutine dag_set_vertices(me,nvertices)
    !! set the number of vertices in the dag
    implicit none
    class(dag),intent(inout) :: me
    integer,intent(in) :: nvertices !! number of vertices
    integer :: i
    allocate(me%vertices(nvertices))
    me%vertices%ivertex = [(i,i=1,nvertices)]
end subroutine dag_set_vertices

subroutine dag_set_edges(me,ivertex,edges)
    !! set the edges for a vertex in a dag
    implicit none
    class(dag),intent(inout) :: me
    integer,intent(in) :: ivertex !! vertex number
    integer,dimension(:),intent(in) :: edges
    call me%vertices(ivertex)%set_edges(edges)
end subroutine dag_set_edges

subroutine dag_toposort(me,order)
    !! main toposort routine

    implicit none

    class(dag),intent(inout) :: me
    integer,dimension(:),allocatable,intent(out) :: order

    integer :: i,n,iorder

    n = size(me%vertices)
    allocate(order(n))
    iorder = 0 ! index in order array
    do i=1,n
        if (.not. me%vertices(i)%marked) &
            call dfs(me%vertices(i))
    end do

contains

    recursive subroutine dfs(v)
        !! depth-first graph traversal

        type(vertex),intent(inout) :: v

        integer :: j

        if (v%checking) then
            error stop 'Error: circular dependency.'
        else
            if (.not. v%marked) then
                v%checking = .true.
                if (allocated(v%edges)) then
                    do j=1,size(v%edges)
                        call dfs(me%vertices(v%edges(j)))
                    end do
                end if
                v%checking = .false.
                v%marked = .true.
                iorder = iorder + 1
                order(iorder) = v%ivertex
            end if
        end if

    end subroutine dfs

end subroutine dag_toposort

end module toposort_module

An example use of this module is given below. Here, we define a graph with five vertices: task 2 depends on 1, task 3 depends on both 1 and 5, and task 4 depends on 5.

program main

    use toposort_module
    implicit none

    type(dag) :: d
    integer,dimension(:),allocatable :: order

    call d%set_vertices(5)
    call d%set_edges(2,[1])   ! 2 depends on 1
    call d%set_edges(3,[5,1]) ! 3 depends on 5 and 1
    call d%set_edges(4,[5])   ! 4 depends on 5
    call d%toposort(order)

    write(*,*) order

end program main

The result is:

Which is the evaluation order. As far as I can find, the above code is the only modern Fortran topological sorting implementation available on the internet. There is a FORTRAN 77 subroutine here, and a Fortran 90-ish one here (however, neither of them check for circular dependencies).

Sorting with Magic Numbers

Updated July 7, 2014

Poorly commented sourcecode is one of my biggest pet peeves. Not only should the comments explain what the routine does, but where the algorithm came from and what its limitations are. Consider ISORT, a non-recursive quicksort routine from SLATEC, written in 1976. In it, we encounter the following code:

R = 0.375E0
...
IF (R .LE. 0.5898437E0) THEN
    R = R+3.90625E-2
ELSE
    R = R-0.21875E0
ENDIF

What are these magic numbers? An internet search for "0.5898437" yields dozens of hits, many of them different versions of this algorithm translated into various other programming languages (including C, C++, and free-format Fortran). Note: the algorithm here is the same one, but also prefaces this code with the following helpful comment:

! And now...Just a little black magic...

Frequently, old code originally written in single precision can be improved on modern computers by updating to full-precision constants (replacing 3.14159 with a parameter computed at compile time as acos(-1.0_wp) is a classic example). Is this the case here? Sometimes WolframAlpha is useful in these circumstances. It tells me that a possible closed form of 0.5898437 is: \(\frac{57}{100\pi}+\frac{13\pi}{100} \approx 0.58984368009\). Hmmmmm... The reference given in the header is no help [1], it doesn't contain this bit of code at all.

It turns out, what this code is doing is generating a pseudorandom number to use as the pivot element in the quicksort algorithm. The code produces the following repeating sequence of values for R:

This places the pivot point near the middle of the set. The source of this scheme is from Reference [2], and is also mentioned in Reference [3]. According to [2]:

These four decimal constants, which are respectively 48/128, 75.5/128, 28/128, and 5/128, are rather arbitrary.

The other magic numbers in this routine are the dimensions of these variables:

INTEGER IL(21), IU(21)

These are workspace arrays used by the subroutine, since it does not employ recursion. But, since they have a fixed size, there is a limit to the size of the input array this routine can sort. What is that limit? You would not know from the documentation in this code. You have to go back to the original reference [1] (where, in fact, these arrays only had 16 elements). There, it explains that the arrays IL(K) and IU(K) permit sorting up to \(2^{k+1}-1\) elements (131,071 elements for the k=16 case). With k=21, that means the ISORT routine will work for up to 4,194,303 elements. So, keep that in mind if you are using this routine.

There are many other implementations of the quicksort algorithm (which was declared one of the top 10 algorithms of the 20th century). In the LAPACK quicksort routine DLASRT, k=32 and the "median of three" method is used to select the pivot. The quicksort method in R is the same algorithm as ISORT, except that k=40 (this code also has the advantage of being properly documented, unlike the other two).

References

R. C. Singleton, "Algorithm 347: An Efficient Algorithm for Sorting With Minimal Storage", Communications of the ACM, Vol. 12, No. 3 (1969).
R. Peto, "Remark on Algorithm 347", Communications of the ACM, Vol. 13 No. 10 (1970).
R. Loeser, "Survey on Algorithms 347, 426, and Quicksort", ACM Transactions on Mathematical Software (TOMS), Vol. 2 No. 3 (1976).

posted at 01:25 · Programming · Fortran SLATEC Sorting

Algorithms • Modern Fortran Programming • Orbital Mechanics

May 30, 2016

Natural Sorting

Case Insensitive

ASCII

natural

Case Sensitive

ASCII

natural

See also

May 17, 2015

Topological Sorting

See also

Jul 05, 2014

Sorting with Magic Numbers

References