gcl                   package:gcl                   R Documentation

_G_C_L: _a _f_u_z_z_y _r_u_l_e _c_l_a_s_s_i_f_i_e_r _g_e_n_e_r_a_t_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     'gcl' is an R function that computes a fuzzy rules classifier
     given numeric input data as the data frame or matrix 'mydata'.
     'gcl' returns an R function that implements the computed
     classifier.

_U_s_a_g_e:

     classifier <- gcl(mydata, nlev=3, filter=1.2, multi=NULL, gcl.verbose=F, ...)
     classifier <- sgcl(mydata, cb=gcl, s.fold=4, s.verbose=FALSE, s.eval=acc.eval, ...) 
     classifier <- tcl(mydata, t.nlev = 3, g = gainr, inf.lim = 0.5, ...)

_A_r_g_u_m_e_n_t_s:

  mydata: The input data frame or matrix must have column names.   The
          last column is  taken  to  contain the class labels. All
          entries but the entries in the last column must be numerical.

    nlev: =<integer larger than 1>
           Default value: 3
           Sets how many fuzzy sets the values in each columns will be
          rep- resented by. The fuzzy sets have triangular shape and
          are deter- mined  by  three  numbers, the first 0 crossing,
          the 1 crossing, and the last 0 crossing. Memberships before
          the first and  after the  last  are  0.

  filter: =<positive  real  number in open unit interval>
           Default value: 1.2
           What data to use for empirical filtering of the rules 
          following the  rule generation stage. The objective of the
          filtering is to remove redundant rules. The data used  for 
          this  is  determined according to the following rules: If
          filter is NULL, no filtering is done. If filter is a matrix
          or data frame, this will be used. If  filter  is an index
          vector (boolean or integer), the rows in the data indexed by
          the index vector are used for filtering. If filter is a
          positive real number, a subset of  the  data  set will  be
          sampled from the data supplied such that each row has a
          probability equal to 1 minus the fractional filter value, 
          i.e., 1 - (filter - floor(filter)), to be used for
          construction of the rules. If filter < 1, then the data not
          used for  rule  computa- tion  will  be  used for rule
          filtering, i.e., compute redundant rules and remove these. If
          filter >= 1, then all the  data  will be used for filtering.

   multi: =<NULL or positive integer>
           Default value: NULL
           If  multi  is NULL, rules are created from the entire input
          data set.  If multi is not null, the input data is 
          partitioned  into multi  equally  sized  sets.  Rules are
          created from each of the (multi - 1) possibilities of forming
          unions of (multi  -  1)  of these sets. The concatenation of
          the resulting lists of rules is taken as the output of the
          rule generation stage.

gcl.verbose: =<TRUE or FALSE>
           Default value: TRUE
           Make gcl output a little info while running.

      cb: =<classifier builder function>
           Default value: gcl
           Which classifier builder to use.

  s.fold: =<positive integer>
           Default value: 4
           How many-fold the cross validation is to be  in sgcl.

  s.eval: =<function(classifier function, data) returning a numeric
          matrix>
           Default value: acc.eval
           computing accuracy The evaluator used by sgcl.

s.verbose: =<TRUE or FALSE>
           Default value: FALSE
           Make sgcl output a little info while running.

  t.nlev: =<integer larger than 1, or 0>
           Default value: 3
           Sets how many fuzzy sets the values in each columns will be
          rep- resented by. The fuzzy sets have triangular shape and
          are deter- mined  by  three  numbers, the first 0 crossing,
          the 1 crossing, and the last 0 crossing. Memberships before
          the first and  after the  last  are  0. Can be set to 0 in
          order to build a non-fuzzy classification tree.

       g: =<function taking two vectors of equal length returning a
          number>
           Default value: gainr
           The splitting function used by tcl.  Two implemented choices
          are gain  and  gainr.  gain is the information theoretic
          function of the same name, gainr is the gain ratio function.

 inf.lim: =<non-negative real number>
           Default value: 0.5
           If the information content in the outcome attribute is less
          than this  limit for the current partition class under
          consideration, tcl will not split further.

_D_e_t_a_i_l_s:

     *gcl*

     This function computes a fuzzy rules classifier given numeric
     input data as the data frame or matrix mydata. 

     The algorithm for doing so is described in Vinterbo et al., 2005. 

     When applied, 'gcl' returns another R function that implements the
     found classifier. This computed classifier function takes one
     argument, a vector, matrix or data frame to be classified, and
     outputs a vector of class memberships for each input vector,
     matrix or data frame row. (See examples section below). 

     Even though the paper cited above is on classification using gene
     expression data, numerical data in general can be used. For
     instance


     > library(gcl)
     > library(datasets)
     > data(iris)
     > classifier <- gcl(iris, nlev=5)
     > acc.eval(classifier, iris)

     computes a fuzzy rule classifier for Edgar Anderson's Iris Data
     set and evaluates the classifier accuracy on the same data set. 

     The function 'gcl' can also be given an optional argument 'cfun =
     function(attribute.values,outcomes,...)' that given a vector
     attribute.values and a vector outcomes determines the inclusion
     cost that should be associated with the attribute that has the
     values found in attribute.values. An example could be
     'function(a,b) 1/abs(cor(a,b))' that associates less cost with an
     attribute that has a higher absolute value correlation with the
     outcome. Note that the values given to the function cfun are the
     values for the attribute after discretization. 

     *computed classifier*

     The computed classifier is a function that takes one argument, the
     numeric vector, matrix or data frame to be classified. When
     applied it outputs a vector of class memberships for each input
     vector, matrix or data frame row. The input data has to have
     (column) names compatible with the names of the data from which
     the classifier function was generated. Otherwise, the classifier
     function cannot operate. 

     The data supplied to the computed classifier function cannot
     contain non-numeric data. Specifically, if a classifier input data
     frame contains a non-numeric class labels column (typically a
     factor), this must be removed before application. Much like:


       > classifier(inputdata[-ncol(inputdata)])

     if the offending column is the last one. 

     The computed classifier function can be "dumped" to file by using
     R's 'dump' function. If classifier is the name of the computed
     function,  then


       > dump("classifier","classifier.r")

     creates a file 'classifier.r' containing the R source code of the
     function classifier. This source code can then be distributed and
     will work as a stand-alone program. 

     If the computed classifier function is supplied with no, or a
     'NULL', argument, it will return a documentation string. The
     content of this string is decided by the value of the
     'gcl.decorate' option at the time of the gcl call. If
     'getOption("gcl.decorate")' returns 1, the string contains the
     fuzzy rules in a human readable format, if it returns 2 (default),
     each rule is also followed by the three numbers determining the
     membership functions of each antecedent fuzzy proposition. If
     returns NULL, no information about the rules is generated. This
     might be used to save space and loading time. 

     The computed classifier function returned has three attributes
     that can be accessed by the 'attributes()' function. They are
     'summary.gcl.rnum', 'summary.gcl.amean', 'summary.gcl.natt' and
     'summary.gcl.nlev'.  If 'getOption("gcl.decorate")' returns a
     positive number, they contain the number of rules in the
     classifier, the average number of attributes in the rule
     antecedents, the number of distinct attributes found in the rules,
     and the value of the 'nlev' parameter passed to the 'gcl'
     function. The classifier function object returned by 'tcl' has
     similar attributes. 

     *sgcl*

     The function 'sgcl' partitions the input data 'mydata' into two
     data sets, training and holdout. It then performs a n-fold (given
     by the parameter 's.fold') cross validation over the training set,
     using the classifier builder 'cb' (default 'gcl') to generate 
     classifiers. This process results in classifiers c_i for i =
     1,2,...,n with associated performance measures p_i. Each
     classifier c_i generated  during the cross validation is applied
     to the holdout data set, resulting in associated performance
     measure q_i. For each classifier c_i, the expression

                 (q_i + p_i)/2 * 1/(1 + |q_i - p_i|)

     is evaluated, and the classifier that maximizes this expression is
     returned by 'sgcl'. The rationale for this is that we want the
     classifier with the best consistent performance. In addition to
     the arguments listed above, 'sgcl' takes the arguments that 'cb'
     and 'cv' take. The default performance measure used by sgcl is
     accuracy as computed by 'acc.eval'. Ties are broken arbitrarily.  

     *tcl*

     The experimental function 'tcl' computes a classification tree
     classifier using a recursive partitioning algorithm similar to
     ID3.

_V_a_l_u_e:

     The functions 'gcl', 'tcl', and 'sgcl' return a function
     representing the computed classifier. 

     The computed classifier function returns a matrix with as many
     columns as the original data had class labels, 'NULL', or a text
     string representing a description of the classifier.

_N_o_t_e:

     If the column names do not match between the original data and the
     data to be classified by the computed function, the error  'Error
     in x[[ind]] : subscript out of bounds' is likely.

     Note that applying sgcl to small data sets is not advisable as the
     data is split repeatedly, making the learning and filtering sets
     even smaller.

_A_u_t_h_o_r(_s):

     Staal A. Vinterbo (C) 2007
      staal@dsg.harvard.edu

_R_e_f_e_r_e_n_c_e_s:

     Vinterbo, S.A.; Kim, E. and Ohno-Machado, L. _Small, fuzzy and
     interpretable gene expression based classifiers_. Bioinformatics,
     2005, 21, 1964-1970. <URL:
     http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/9/1964>

_S_e_e _A_l_s_o:

     <URL: http://www.r-project.org/>

_E_x_a_m_p_l_e_s:

     ## run the demo
     demo(gcldemo)

     ## play with the iris data set:
     ## Not run: 
     library(datasets)
     data(iris)
     classifier <- gcl(iris, nlev=5)
     acc.eval(classifier, iris)
     ## End(Not run)

     ## compare performance of gcl and tcl
     ## Not run: 
     library(datasets)
     data(iris)
     cv52(iris, gcl, tcl, acc.eval, nlev=5, t.nlev=5)
     ## End(Not run)

     ## or a little more complex
     library(gcl)
     count <- matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE)
     xordata <- cbind(count, apply(count, 1, function(x) xor(x[1],x[2])))
     colnames(xordata) <- c("Bit.1", "Bit.2", "XOR")
     cf <- gcl(xordata,2,c())
     cat(cf())
     ## Not run: 
     # should produce something like:
     Generated by gcl v1.06c Sat Nov 12 19:25:12 2005.
      nlev=2, filtering: no filtering took place
      rule generation: no subsampling.
      (c) Copyright 2005, Staal Vinterbo, all rights reserved.
     Bit.1=2 & Bit.2=2 => XOR=0 [ 0 1 Inf ],[ 0 1 Inf ]
     Bit.1=2 & Bit.2=1 => XOR=1 [ 0 1 Inf ],[ -Inf 0 1 ]
     Bit.1=1 & Bit.2=2 => XOR=1 [ -Inf 0 1 ],[ 0 1 Inf ]
     Bit.1=1 & Bit.2=1 => XOR=0 [ -Inf 0 1 ],[ -Inf 0 1 ]
     ## End(Not run)
     v <- c(0,1)
     names(v) <- colnames(xordata)[1:2]
     cf(v)
     ## Not run: 
     # produces:
                 0 1
            [1,] 0 1
     dump("cf", "cf.r")
     rm(cf)
     source("cf.r")
     cf(v)
     # produces:
                 0 1
            [1,] 0 1
     ## End(Not run)

