index                 package:rindex                 R Documentation

_I_n_d_e_x_i_n_g _v_e_c_t_o_r_s

_D_e_s_c_r_i_p_t_i_o_n:

     Indexing allows to extract small fractions of large vectors very
     quickly.

_U_s_a_g_e:

      index(x, uni = NULL, batch = NULL, verbose = FALSE)
      c.index(...)
      indexAddTree(obj, batch = NULL)
      indexDelTree(obj)
      indexAutobatch(n, batch = 64)

_A_r_g_u_m_e_n_t_s:

       x: a vector (currently only character) 

     uni: set to TRUE or FALSE to save checking for duplicates (default
          NULL for checking) 

   batch: set to preferred batch size to influence RAM - speed -
          trade-off (default NULL for 'indexAutobatch') 

 verbose: set to TRUE to report timing of index creation (default
          FALSE) 

       n: number of elements to build a tree on 

     obj: an object of class 'index' 

     ...: objects of class 'index' 

_D_e_t_a_i_l_s:


            *Basic functions*  _creating, modifying and removing indices_
                        index  create an index object and build a tree
                      c.index  concatenate index objects (currently not tuned)
               indexAutobatch  calculates the optimal index resolution (size of a leaf) given number of elements and desired batch size (default 64)
                 indexAddTree  build or rebuild tree ('Calloc')
                 indexDelTree  remove tree ('Free')
                         'rm'  removing index object removes tree at next garbage collection 'gc'
          *Index information*  _information, printing and retrieving all values_
                 'indexNodes'  returns number of tree nodes
                 'indexBytes'  returns indes size in bytes
                'print.index'  prints index info and optionally tree
                  'str.index'  removes class and calls NextMethod("str")
               'length.index'  identical to 'length' of original vector
                'names.index'  currently forbidden
              'names<-.index'  currently forbidden
               *Basic access*  _information, printing and retrieving all values_
                 'sort.index'  identical to 'sort' of original vector, but much faster
                'order.index'  identical to 'order' of original vector, but much faster
                    '[.index'  'index[]' returns original vector, subsetting works identical to susetting original vector '[' (via 'Next.Method')
                  '[<-.index'  currently forbidden
                'is.na.index'  identical to 'is.na' of original vector, but much faster
           *Low level search*  _low level search functions return positions in _index order_ (sorted)_
                  'indexFind'  finding exact values in index
              'indexFindlike'  finding values in index that begin like search value (character indices only)
           *Mid level search*  _mid level search functions return positions in _index order_ (sorted)_
          'indexFindInterval'  finding a sequence of exact or approximate values
                 'indexMatch'  finding positions of vector of search values
          *High level search*  _high level search functions return positions in _original order_ (unsorted)_
                    'indexEQ'  index EQual value
                    'indexNE'  index NotEqual value
                    'indexLT'  index LowerThan value
                    'indexGT'  index GreaterThan value
                    'indexLE'  index LowerEqual value
                    'indexGE'  index GreaterEqual value
       *High level operators*  _high level operators return logical vectors in _original order_ (unsorted)_
                   '==.index'  index EQual value
                   '!=.index'  index NotEqual value
                    '<.index'  index LowerThan value
                    '>.index'  index GreaterThan value
                   '<=.index'  index LowerEqual value
                   '>=.index'  index GreaterEqual value
             *match and %in%*  _high level matching and %in% behave as expected_
                      'match'  redefined version of 'match' automatically recognizing index tables
                       '%in%'  redefined version of '%in%' (redefinition needed for finding redefined 'match' in spite of namespaces)

_V_a_l_u_e:

     An object of class 'index', i.e. a list with components 

     val: sorted vector of values

     pos: integer vector of original positions of sorted values

       n: number of values (including NAs)

     nNA: number of NAs

   batch: resolution of tree

     uni: logical flagging the index as unique (TRUE) or non-unique
          (FALSE)

    tree: external pointer to C-tree

_T_h_e_o_r_y:

     Linear search has O(n) time complexity. By using an index tree,
     search time can be reduced to O(log(n)). An index that can be used
     with any R-vector 'x' of length 'n' needs to store the original
     values together with the original positions, i.e. 'list(val=x,
     pos=order(x))', thus requires - strongly simplified - 2*n RAM. If
     we store this information in a binary tree, each value pair
     {val,pos} is stored in its own node together with two pointers,
     the memory requirements are - strongly simplified - 4*n RAM. The
     b-tree stores more than one value and two pointers in one node and
     thus minimizes the number of nodes that need to be read (from
     disk). However, used in RAM, b-trees increase the search time
     because they impurify logarithmic search across nodes with linear
     search within nodes. By contrast, the t-tree is optimized for RAM:
     it stores only two pointers but many sorted values within each
     node: for branching only min and max values need to be searched,
     linear search is only required at the final leaf node. However,
     realizing a  t-tree within R's memory model requires additional
     overhead for the SEXPREC data structures. We avoid that by
     defining a static read only tree (and save implementation of
     insert and delete operations). The b*tree and t*tree versions of
     the mentioned indices reduce the size of the search nodes by
     storing the data itself in the leafnodes only. This leads to some
     redundant storage but speeds up search. Some implementations
     connect the leafnodes by extra pointers to speed up linear search.
     If we take this principle to the extreme, we can save these extra
     pointers and merge all leave nodes into one single big leaf. By
     doing so we loose the ability to update the index, but we gain a
     static read-only tree structure that supports very fast linear
     search as well as logarithmic search.


                .
              /   
            .       .
          /      /      C-tree
        _________________  R-val
        _________________  R-pos

     We implement this efficiently by storing the sorted values vector
     together with the order positions as standard R (SEXPREC) objects
     and add a pure C tree that is built from pointered struct nodes.
     The leaf nodes do use integer addressing instead of pointers to
     identify the associated part of the SEXPREC vector (pointers can't
     be used because the R garbage collector may move the vector). The
     topnode  is linked into R using an external pointer. The tree
     itself can be removed explicitely from memory using
     'indexDelTree'. If the index object containing the external
     pointer is deleted, the tree will be freed from memory at the next
     garbage collection through its finalization method. If the index
     object does not contain a valid external pointer to the tree -
     e.g. when loading an index object from disk - the tree will be
     quickly transparently rebuild or can be build explicitely via
     'indexAddTree'.

_B_e_n_e_f_i_t_s:

        *  much faster access to small fractions of large vectors (or
           few rownames of long data.frames)

        *  index allows and handles duplicated values

        *  index allows and handles NAs (unique version of index can
           still have more than one NA)

        *  index sort order is identical to (implemented via) 'order'
           with 'na.last=TRUE'.

        *  sorted values and original positions are stored simply as
           standard R vectors

        *  values in original sequence can be retrieved via '[' from
           the index (original vector not required, thus no duplication
           of values)

        *  several vector methods give expected results ('length',
           'is.na', 'sort', 'order', '[', 'Comparison operators')

        *  the tree needs minimum RAM due to lean C structs (index is
           close to 2*n)

        *  the tree resolution can be user-defined (RAM - speed -
           trade-off)

        *  the tree will not consume disk space when the index is saved

        *  the tree is build up transparently when needed the first
           time (e.g. after loading the index from disk)

        *  the tree is build up very quickly because the main work -
           the sorting - needs not to be repeated

        *  the tree is removed transparently from the garbage collector
           when the index is deleted

        *  this type of tree should perfectly complement the new disk
           vectors (packages 'ff' and 'R.huge')

_L_i_m_i_t_a_t_i_o_n_s:

        *  currently indexing is only available for character vectors
           (not logical, integer, real, complex)

        *  currently indices cannot be stored as columns of data frames

        *  functions that want to retrieve the index values in original
           order need to call 'indexname[]' if the complete vector is
           needed

        *  building the index is rather slow (as slow as sorting in R
           is, building the tree is very fast)

        *  like all indices the index will reduce performance if a
           large fraction of the vector is accessed.

        *  not all methods have been maximally tuned via C-code (e.g.
           'c.index').

_O_p_e_n _q_u_e_s_t_i_o_n_s:

        *  Will R core officially export C entry points to UTF-8
           resistant strcmp (STRCOLL) and strncmp (nothing yet)?

        *  Shall we make order() generic to allow dispatch of
           order.index like we have already dispatch of sort.index?

        *  Shall we - in R base - modify match and %in% to avoid
           masking them by rindex?

        *  Can we call the evaluator from C in order to make
           <>=comparisons independent of atomic mode (UTF-8 etc.) or
           will this kill to much performance?

        *  Can we call the evaluator from C for [-accessing the vector
           elements in order to generalize the index to ff or R.huge or
           will this kill to much performance?

_N_o_t_e:

     For each 'FOOindexFOO' related to the 'index' class a function
     'FOOrindexFOO' exists related to the 'rindex' class. The 'rindex'
     class is a pure R-prototype and is kept for regression-testing
     using 'binregtest' from package 'regtest'. The regression tests
     are in 'dontshow' sections in the examples of this help. If you
     run 'example(index)' the regression tests will (unavoidably)
     trigger warnings.

_A_u_t_h_o_r(_s):

     Jens Oehlschlgel

_R_e_f_e_r_e_n_c_e_s:

     Tobin J. Lehman, Michael J. Carey (1986) A Study of Index
     Structures for Main Memory Database Management Systems.
     Proceedings of the 12th International Conference on Very Large
     Data Bases, 294 - 303. 
        Pfaff, Ben (2004). An Introduction to Binary Search Trees and
     Balanced Trees, Libavl Binary Search Tree Library. Free Software
     Foundation, Inc.

_S_e_e _A_l_s_o:

     'order', 'sort' , 'match'

_E_x_a_m_p_l_e_s:

       #library(rindex)
       x <- sample(c(rep("c", 5), paste("d", 1:5, sep=""), c(letters[c(1,2,5,6)], NA)))

       cat("\n")
       cat("creating an index from atomic vector\n")
       i <- index(x)
       i
       cat("creating an index by combining indices\n")
       i2 <- c(i,i)
       i2
       cat("if the index (or index$tree) is removed, the C-tree is removed at the next garbage collection\n")
       cat("the index tree can also removed and created explicitely\n")
       i <- indexDelTree(i)
       i
       i <- indexAddTree(i, batch=3)
       print(i, tree=TRUE)
       indexNodes(i)
       indexBytes(i)

       cat("\n")
       cat("extracting the original vector\n")
       i[]
       cat("subsetting works as expected\n")
       i[1:3]
       cat("accessing the sorted data is much faster\n")
       sort(i)[1:3]
       cat("accessing the ordering is also much faster (order.index is not dispatched since order is not yet generic)\n")
       order.index(i)[1:3]

       identical(is.na(i),is.na(x))
       identical(length(i),length(x))

       cat("\n")
       cat("LOW LEVEL SEARCH returns position in SORTED VECTOR\n")
       cat("low level search for position of lowest instance of value in the index\n")
       indexFind(i, "c")
       cat("low level search for position of highest instance of value in the index\n")
       indexFind(i, "c", findlow=FALSE)
       cat("low level search for position of lowest instance beginning like search value\n")
       indexFindlike(i,"d")
       cat("low level search for position of highest instance beginning like search value\n")
       indexFindlike(i,"d",findlow=FALSE)

       cat("\n")
       cat("MID LEVEL SEARCH also returns position in SORTED VECTOR\n")
       cat("mid level search for a set of values\n")
       indexMatch(i,c("c","f"), findlow=TRUE)  # giving parameter findlow= suppresses the warning issued on non-unique indices
       sort(i)[indexMatch(i,c("c","f"), findlow=TRUE)]
       i[indexMatch(i,c("c","f"), findlow=TRUE, what="pos")]
       indexMatch(i,c("c","f"), findlow=TRUE, what="val")
       indexMatch(i,c("c","f"), findlow=TRUE, what="pos")
       indexMatch(i,c("c","f"), findlow=FALSE, what="pos")

       cat("mid level search for interval of values\n")
       indexFindInterval(i,"b","f")
       cat("by default the searched endpoints are included\n")
       sort(i)[indexFindInterval(i,"b","f")]
       cat("but they can be excluded\n")
       sort(i)[indexFindInterval(i,"b","f",high.include=FALSE)]
       cat("by default the searched endpoints need not to be present\n")
       sort(i)[indexFindInterval(i,"a1","e1")]
       cat("but this can be required\n")
       sort(i)[indexFindInterval(i,"a1","e1",low.exact=TRUE)]
       cat("each of the searched endpoints can be defined via indexFindlike\n")
       sort(i)[indexFindInterval(i,"c","d",FUN=indexFindlike)]

       cat("\n")
       cat("HIGH LEVEL SEARCH returns POSITION(s) IN ORIGINAL VECTOR but in SEQUENCE OF INDEX\n")
       indexEQ(i,"d3")
       indexNE(i,"d3")
       indexLT(i,"d3")
       indexLE(i,"d3")
       indexGT(i,"d3")
       indexGE(i,"d3")
       cat("searching for several values returns a list\n")
       indexEQ(i,c("b","c","z",NA))
       indexEQ(i,c("b","c","z",NA), what="val")

       cat("\n")
       cat("HIGH LEVEL OPERATORS returns TRUE/FALSE AT ORIGINAL VECTOR POSITIONS\n")
       i=="d3"
       i!="d3"
       i<"d3"
       i<="d3"
       i>"d3"
       i>="d3"

       cat("HIGH LEVEL match AND %in% behave as expected\n")
       match(c("b","c","z",NA), x)
       match(c("b","c","z",NA), i)
       c("b","c","z",NA) %in% x
       c("b","c","z",NA) %in% i



     ## Not run: 

        cat("function timefactor helps with timing\n")

        n <-  1000000
        x <- sample(1:n)
        names(x) <- paste("a", x, sep="")
        d <- data.frame(x=as.vector(x), row.names=names(x))

        nsub  <- 100
        i <- sample(1:n, nsub)
        ni <- names(x)[i]

        ind <- index(names(x), verbose=TRUE)
        ind

        # test vectors
        cat("character susetting is by magnitude slower than integer subsettting\n")
        timefactor( x[ni] , x[i] , 10, 10000)
        cat("character susetting is approx as slow as matching\n")
        timefactor( x[ni] , x[match(ni,names(x))] , 10, 10)
        cat("for small fractions of n indexing is much faster\n")
        timefactor( x[ni] , x[indexMatch(ind,ni)] , 10, 100)

        # test dataframes
        cat("character susetting is by magnitude slower than integer subsettting\n")
        timefactor( d[ni,] , d[i,] , 1, 100)
        cat("obvious implementation problem (in R-2.5.1 subsetting is much slower than via matching)\n")
        timefactor( d[ni,] , d[match(ni,rownames(d)),] , 1, 1)
        cat("for small fractions of n indexing is much faster\n")
        timefactor( d[ni,] , d[indexMatch(ind,ni),] , 1, 10)

     ## End(Not run)

