clean                 package:dprep                 R Documentation

_D_a_t_a_s_e_t _C_l_e_a_n_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     A function to eliminate rows and columns that have  a percentage
     of missing values greater than the allowed  tolerance.

_U_s_a_g_e:

     clean(w, tol.col = 0.5, tol.row = 0.3, name = "")

_A_r_g_u_m_e_n_t_s:

       w: the dataset to be examined and cleaned

 tol.col: maximum ratio of missing values allowed in columns. The
          default value is 0.5. Columns with a larger ratio of missing
          will be eliminated unless they have been determined to be
          relevant attributes.  

 tol.row: maximum ratio of missing values allowed in rows. The default
          value is 0.3. Rows with a ratio of missing that is larger
          that the established tolerance will be eliminated. 

    name: name of the dataset to be used for the optional report

_D_e_t_a_i_l_s:

     This function can create an optional report on the cleaning
     process if the comment symbols are removed from the last lines of
     code. The report is returned to the workspace, where it can be
     reexamined as needed. The report objects name begins with:
     Clean.rep.

_V_a_l_u_e:

       w: the original dataset, with missing values that were in
          relevant variables imputed

_A_u_t_h_o_r(_s):

     Caroline Rodriguez

_R_e_f_e_r_e_n_c_e_s:

     Acuna, E. and Rodriguez, C. (2004). The treatment of missing
     values and its effect in the classifier accuracy. In D. Banks,  L.
     House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classification,
     Clustering and Data Mining Applications. Springer-Verlag
     Berlin-Heidelberg, 639-648.

_S_e_e _A_l_s_o:

     'ce.impute'

_E_x_a_m_p_l_e_s:

     #-----Dataset cleaning-----
     data(hepatitis)
     hepa.cl=clean(hepatitis,05,.03,name="hepatitis-clean")

