Natural Sorting

Sorting is one of the fundamental problems in computer science, so of course Fortran does not include any intrinsic sorting routine (we’ve got Bessel functions, though!) String sorting is a special case of this problem which includes various choices to consider, for example:

  • Natural or ASCII sorting
  • Case sensitive (e.g., ‘A’<'a') or case insensitive (e.g., 'A'=='a')

finder“Natural” sorting (also called “alphanumeric sorting” means to take into account numeric values in the string, rather than just comparing the ASCII value of each of the characters. This can produce an order that looks more natural to a human for strings that contain numbers. For example, in a “natural” sort, the string “case2.txt” will come before “case100.txt”, since the number 2 comes before the number 100. For example, natural sorting is the method used to sort file names in the MacOS X Finder (see image at right). While, interestingly, an ls -l from a Terminal merely does a basic ASCII sort.

For string sorting routines written in modern Fortran, check out my GitHub project stringsort. This library contains routines for both natural and ASCII string sorting. Natural sorting is achieved by breaking up each string into chunks. A chunk consists of a non-numeric character or a contiguous block of integer characters. A case insensitive search is done by simply converting each character to lowercase before comparing them. I make no claim that the routines are particularly optimized. One limitation is that contiguous integer characters are stored as an integer(INT32) value, which has a maximum value of 2147483647. Although note that it is easy to change the code to use integer(INT64) variables to increase this range up to 9223372036854775807 if necessary. Eliminating integer size restrictions entirely is left as an exercise for the reader.

Consider the following test case:

character(len=*),dimension(6) :: &
    str = [ 'z1.txt  ', &
            'z102.txt', &
            'Z101.txt', &
            'z100.txt', &
            'z10.txt ', &
            'Z11.txt '  ]

This list can be sorted (at least) four different ways:

Case Insensitive

ASCII

z1.txt
z10.txt
z100.txt
Z101.txt
z102.txt
Z11.txt

natural

z1.txt
z10.txt
Z11.txt
z100.txt
Z101.txt
z102.txt

Case Sensitive

ASCII

Z101.txt
Z11.txt
z1.txt
z10.txt
z100.txt
z102.txt

natural

Z11.txt
Z101.txt
z1.txt
z10.txt
z100.txt
z102.txt

Each of these can be done using stringsort with the following subroutine calls:

 call lexical_sort_recursive(str,case_sensitive=.false.)
 call lexical_sort_natural_recursive(str,case_sensitive=.false.)

 call lexical_sort_recursive(str,case_sensitive=.true.)
 call lexical_sort_natural_recursive(str,case_sensitive=.true.)
quicksort

Original Quicksort algorithm by Tony Hoare, 1961 (Communications of the ACM)

The routines use the quicksort algorithm, which was originally created for sorting strings (specifically words in Russian sentences so they could be looked up in a Russian-English dictionary). The algorithm is easily implemented in modern Fortran using recursion (non-recursive versions were also available before recursion was added to the language in Fortran 90). Quicksort was named one of the top 10 algorithms of the 20th century by the ACM (Fortran was also on the list).

See also

Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*