interestMeasure            package:arules            R Documentation

_C_a_l_c_u_l_a_t_i_n_g _v_a_r_i_o_u_s _a_d_d_i_t_i_o_n_a_l _i_n_t_e_r_e_s_t _m_e_a_s_u_r_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Provides the generic function 'interestMeasure' and the needed S4
     method  to calculate various additional interest measures for
     existing sets of itemsets or rules.

_U_s_a_g_e:

     interestMeasure(x, method, transactions = NULL, ...)

_A_r_g_u_m_e_n_t_s:

       x: a set of itemsets or rules. 

  method: name of the interest measure (see details for  available
          measures).

transactions: the transaction data set used to mine  the associations. 

     ...: further arguments for the measure calculation. 

_D_e_t_a_i_l_s:

     For itemsets the following measures are implemented:  

     "_a_l_l_C_o_n_f_i_d_e_n_c_e" (see, Omiencinski, 2003) is defined on itemsets as
          the minimum confidence of all possible rule generated from
          the itemset.

     "_c_r_o_s_s_S_u_p_p_o_r_t_R_a_t_i_o" (see, Xiong et al., 2003) is defined on
          itemsets as the ratio of the support of the least frequent
          item to the support of the most frequent item.  Cross-support
          patterns have a ratio smaller than a set threshold. Normally
          many found patterns are cross-support patterns which contain
          frequent as well as rare items. Such patterns often tend to
          be spurious.

     For rules the following measures are implemented:  

     "_c_h_i_S_q_u_a_r_e" (see Liu et al. 1999). The chi-square statistic  to
          test for independence between the lhs and rhs of the rule.
          The critical value of the chi-square distribution with 1
          degree of  freedom (2x2 contengency table) at alpha=0.05  is
          3.84; higher chi-square values indicate that the lhs and the
          rhs are not independent.  

     "_c_o_s_i_n_e" (see Tan et al. 2004) equivalent to the IS measure. 
          Range: 0...1. 

     "_c_o_n_v_i_c_t_i_o_n" (see Brin et al. 1997) defined as  P(X)P(not Y)/P(X
          and not Y).  Range: 0.5...1... Inf (1 indicates unrelated
          items).

     "_g_i_n_i" gini index (see Tan et al. 2004). Range: 0...1.

     "_h_y_p_e_r_L_i_f_t" (see, Hahsler et al., 2005) is an adaptation of the
          lift measure which is more robust for low counts. It is based
          on the idea that under independence the count c_{XY} of the
          transactions which contain all items in a rule X -> Y follows
          a hypergeometric distribution  (represented by the random
          variable C_{XY}) with the parameters given by the counts  c_X
          and  c_Y.

          Lift is defined for the rule X -> Y as:

         lift(X -> Y) = P(X+Y)/(P(X)*P(Y)) = c_XY / E[C_XY],

          where E[C_{XY}] = c_X c_Y / m with m being the number of
          transactions in the database.

          Hyper-lift is defined as:

                hyperlift(X -> Y) = c_XY / Q_d[C_XY],

          where  Q_d[C_XY] is the quantile of the hypergeometric
          distribution given by d. The quantile can be given as
          parameter 'd' (default: 'd=0.99'). Range: 0... Inf.


     "_h_y_p_e_r_C_o_n_f_i_d_e_n_c_e" (based on Hahsler et al., 2005) calculates the
          confidence level that we observe too high/low counts  for
          rules X -> Y using the hypergeometric model. Since the counts
          are drawn from a hypergeometric distribution  (represented by
          the random variable C_{XY}) with known parameters given by
          the counts  c_X and  c_Y, we can calculate a confidence
          interval for the observed counts  c_{XY} stemming from the
          distribution. Hyperconfidence reports the confidence level 
          (significance level if 'significance=TRUE' is used) for

          _c_o_m_p_l_e_m_e_n_t_s - 1 - P[C_{XY} >= c_{XY} | c_X, c_Y]

          _s_u_b_s_t_i_t_u_t_e_s - 1 - P[C_{XY} < c_{XY} | c_X, c_Y].

          A confidence level of, e.g., > 0.95 indicates that there is
          only a  5% chance that the count for the rule was generated
          randomly.

          Per default complementary effects are mined, substitutes can
          be found by using the parameter 'complements = FALSE'. 
          Range: 0...1.

     "_i_m_p_r_o_v_e_m_e_n_t" (see Bayardo et al. 2000) the  improvement of a rule
          is  the minimum difference between its confidence and the
          confidence of any proper sub-rule with the same consequent.
          Range: 0...1.

     "_l_e_v_e_r_a_g_e" (see Piatetsky-Shapiro 1991) defined as P(X->Y) -
          (P(X)P(Y)). It measures the difference of X and Y appearing
          together in the data set  and what would be expected if X and
          Y where statistically dependent.  Range: {0...1}.

     "_p_h_i" the correlation coefficient phi  (see Tan et al. 2004)
          Range: -1 (perfect neg. correlation) to +1 (perfect pos.
          correlation).

     "_o_d_d_s_R_a_t_i_o" (see Tan et al. 2004). The odds of finding X in
          transactions which contain Y divided by the odds of finding X
          in transactions which do not contain Y. Range: 0...1... Inf (
          1 indicates that Y is not associated to X). 

     Note that for calculating the interest measures  support (for
     rules also confidence and lift) stored in the quality slot of 'x'
     are needed. These measures are returned by the mining algorithms
     implemented in  this package. Note also, that the calculation of
     some measures is quite slow since we do not have access to the
     original itemset structure which  was used for mining.

_V_a_l_u_e:

     A numeric vector containing the values of the interest measure 
     for each association in the set of associations 'x'.

_R_e_f_e_r_e_n_c_e_s:

     R. Bayardo, R. Agrawal, and D. Gunopulos (2000). Constraint-based
     rule mining in large, dense databases.  _Data Mining and Knowledge
     Discovery_, 4(2/3):217-240, 2000.

     Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur
     (1997). Dynamic itemset counting and implication rules for market
     basket data. In _SIGMOD 1997, Proceedings ACM SIGMOD International
     Conference on Management of Data_, pages 255-264, Tucson, Arizona,
     USA.

     Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2005). 
     _Implications of probabilistic data modeling for rule mining_. 
     Report 14, Research Report Series, Department of Statistics and
     Mathematics, Wirtschaftsuniversitaet Wien, Augasse 2-6, 1090 Wien,
     Austria.

     Bing Liu, Wynne Hsu, and Yiming Ma (1999). Pruning and summarizing
     the discovered associations. In _KDD '99: Proceedings of the fifth
     ACM SIGKDD international conference on Knowledge discovery and
     data mining_, pages 125-134.  ACM Press, 1999.

     Edward R. Omiecinski (2003). Alternative interest measures for
     mining associations in databases. _IEEE Transactions on Knowledge
     and Data Engineering_, 15(1):57-69, Jan/Feb 2003.

     Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava (2004).
     Selecting the right objective measure for association analysis.
     _Information Systems_, 29(4):293-313.

     Piatetsky-Shapiro, G. (1991). Discovery, analysis, and
     presentation of strong rules. In: _Knowledge Discovery in
     Databases_, pages 229-248.

     Hui Xiong, Pang-Ning Tan, and Vipin Kumar (2003). Mining strong
     affinity association patterns in data sets with skewed support
     distribution. In Bart Goethals and Mohammed J. Zaki, editors,
     _Proceedings of the IEEE International Conference on Data Mining_,
     November 19-22, 2003, Melbourne, Florida, pages 387-394.

_S_e_e _A_l_s_o:

     'itemsets-class', 'rules-class'

_E_x_a_m_p_l_e_s:

     data("Income")
     rules <- apriori(Income)

     quality(rules) <- cbind(quality(rules), 
             hyperConfidence = interestMeasure(rules, method = "hyperConfidence", 
             Income))

             
     inspect(head(SORT(rules, by = "hyperConfidence")))

