upData                 package:Hmisc                 R Documentation

_U_p_d_a_t_e _a _D_a_t_a _F_r_a_m_e _o_r _C_l_e_a_n_u_p _a _D_a_t_a _F_r_a_m_e _a_f_t_e_r _I_m_p_o_r_t_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     'cleanup.import' will correct errors and shrink the size of data
     frames created by the S-Plus 'File ... Import' dialog or by other
     methods such as 'scan' and 'read.table'.  By default, double
     precision numeric variables are changed to single precision
     (S-Plus only) or to integer when they contain no fractional
     components.  Infinite values or values greater than 1e20 in
     absolute value are set to NA.  This solves problems of importing
     Excel spreadsheets that contain occasional character values for
     numeric columns, as S-Plus converts these to 'Inf' without
     warning.  There is also an option to convert variable names to
     lower case and to add labels to variables. The latter can be made
     easier by importing a CNTLOUT dataset created by SAS PROC FORMAT
     and using the 'sasdict' option as shown in the example below. 
     'cleanup.import' can also transform character or factor variables
     to dates.

     'upData' is a function facilitating the updating of a data frame
     without attaching it in search position one.  New variables can be
     added, old variables can be modified, variables can be removed or
     renamed, and '"labels"' and '"units"' attributes can be provided. 
     Various checks are made for errors and inconsistencies, with
     warnings issued to help the user.  Levels of factor variables can
     be replaced, especially using the 'list' notation of the standard
     'merge.levels' function.  Unless 'force.single' is set to 'FALSE',
     'upData' also converts double precision vectors to single
     precision (if not under R), or to integer if no fractional values
     are present in a vector.

     Both 'cleanup.import' and 'upData' will fix a problem with data
     frames created under S-Plus before version 5 that are used in
     S-Plus 5 or later.  The problem was caused by use of the 'label'
     function to set a variable's class to '"labelled"'.  These classes
     are removed as the S version 4 language does not support multiple
     inheritance.  Failure to run data frames through one of the two
     functions when these conditions apply will result in simple
     numeric variables being set to 'factor' in some cases.  Extraneous
     '"AsIs"' classes are also removed.

     For S-Plus, a function 'exportDataStripped' is provided that
     allows exporting of data to other systems  by removing attributes
     'label, imputed, format, units', and 'comment'.  It calls
     'exportData' after stripping these attributes.  Otherwise
     'exportData' will fail.

     'csv.get' reads comma-separated text data files, allowing optional
     translation to lower case for variable names after making them
     valid S names.  Original possibly non-legal names are taken to be
     variable labels.  Character or factor variables containing dates
     can be converted to date variables.  'cleanup.import' is invoked
     to finish the job.

_U_s_a_g_e:

     cleanup.import(obj, labels, lowernames=FALSE, 
                    force.single=TRUE, force.numeric=TRUE, rmnames=TRUE,
                    big=1e20, sasdict, pr, datevars=NULL, dateformat='

     upData(object, ..., 
            rename, drop, labels, units, levels,
            force.single=TRUE, lowernames=FALSE, moveUnits=FALSE)

     exportDataStripped(data, ...)

     csv.get(file, lowernames=FALSE, datevars=NULL, dateformat='%d%b%Y', ...)

_A_r_g_u_m_e_n_t_s:

     obj: a data frame or list

  object: a data frame or list

    data: a data frame

force.single: By default, double precision variables are converted to
          single precision (in S-Plus only) unless
          'force.single=FALSE'. 'force.single=TRUE' will also convert
          vectors having only integer values to have a storage mode of
          integer, in R or S-Plus. 

force.numeric: Sometimes importing will cause a numeric variable to be
          changed to a factor vector.  By default, 'cleanup.import'
          will check each factor variable to see if the levels contain
          only numeric values and '""'.  In that case, the variable
          will be converted to numeric, with '""' converted to NA.  Set
          'force.numeric=FALSE' to prevent this behavior.  

 rmnames: set to `F' to not have `cleanup.import' remove `names' or
          `.Names' attributes from variables 

  labels: a character vector the same length as the number of variables
          in 'obj'.  These character values are taken to be variable
          labels in the same order of variables in 'obj'. For 'upData',
          'labels' is a named list or named vector with variables in no
          specific order. 

lowernames: set this to 'TRUE' to change variable names to lower case.
          'upData' does this before applying any other changes, so
          variable names given inside arguments to 'upData' need to be
          lower case if 'lowernames==TRUE'.  

     big: a value such that values larger than this in absolute value
          are set to missing by 'cleanup.import' 

 sasdict: the name of a data frame containing a raw imported SAS PROC
          CONTENTS CNTLOUT= dataset.  This is used to define variable
          names and to add attributes to the new data frame specifying
          the original SAS dataset name and label. 

      pr: set to 'TRUE' or 'FALSE' to force or prevent printing of the
          current variable number being processed.  By default, such
          messages are printed if the product of the number of
          variables and number of observations in 'obj' exceeds
          500,000. 

datevars: character vector of names (after 'lowernames' is applied) of
          variables to consider as a factor or character vector
          containing dates in a format matching 'dateformat'

dateformat: for 'cleanup.import' is the input format (see 'strptime')

     ...: for 'upData', one or more expressions of the form
          'variable=expression', to derive new variables or change old
          ones. For 'exportDataStripped', optional arguments that are
          passed to 'exportData'.  For 'csv.get', arguments to pass to
          'read.csv'. 

  rename: list or named vector specifying old and new names for
          variables.  Variables are renamed before any other operations
          are done.  For example, to rename variables 'age' and 'sex'
          to respectively 'Age' and 'gender', specify
          'rename=list(age="Age", sex="gender")' or
          'rename=c(age=...)'.  

    drop: a vector of variable names to remove from the data frame 

   units: a named vector or list defining '"units"' attributes of
          variables, in no specific order 

  levels: a named list defining '"levels"' attributes for factor
          variables, in no specific order.  The values in this list may
          be character vectors redefining 'levels' (in order) or
          another list (see 'merge.levels' if using S-Plus). 

moveUnits: set to 'TRUE' to look for units of measurements in variable
          labels and move them to a '"units"' attribute.  If an
          expression in a label is enclosed in parentheses or brackets
          it is assumed to be units if 'moveUnits=TRUE'. 

    file: a file name to import

_V_a_l_u_e:

     a new data frame

_A_u_t_h_o_r(_s):

     Frank Harrell, Vanderbilt University

_S_e_e _A_l_s_o:

     'sas.get', 'data.frame', 'describe', 'label', 'read.csv',
     'strptime', 'POSIXct'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     dat <- read.table('myfile.asc')
     dat <- cleanup.import(dat)
     ## End(Not run)
     dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)
     dat2 <- upData(dat, x=x^2, x=x-5, m=x/10, 
                    rename=c(a='x'), drop='z',
                    labels=c(x='X', y='test'),
                    levels=list(y=list(a='a',b=c('b1','b2'))))
     dat2
     describe(dat2)
     dat <- dat2    # copy to original name and delete dat2 if OK
     rm(dat2)

     # If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,
     # the LABELs from this dataset can be added to the data.  Let's also
     # convert names to lower case for the main data file
     ## Not run: 
     mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)
     ## End(Not run)

