##cleaningdata_assignment1

Assignment for getting and cleaning data course

Methodology:

The run_analysis function is the entry point for generating both data sets.

Data set 1

The generation of the first data set follows this methodolgy:

generate a list of column names. The first column name will be "subject" (this column will be sourced from the subject files). The last column will be named "class" (the classification of the observation, sourced from the y data files). In between will be the X data (feature observation data, sourced from the X files). The column names for all of these will be read from the features.txt file. As part of this process, the column names for the features are scrubbed by removing the "()" and converting any "-" to "." instead. A more thorough approach would have been to leverage the information in the feature_info file and create a hard-coded list of labels. The approach used here was more programatic, as a time-saver, but the results are not optimal.
Create a data frame for the test data, a second data from for the y data, and combine the two using rbind to obtain a sinlge large data set. A single function, processDirectory, is used to form a data frame from the combination of an X, Y, and subject file within a given directory (either test or train, in this case)
The columns to retain are determined as follows: the columsn representing the subject and their activity for the observation (the "subject", "class", and "activity" columns) are retained. For the observation columns, the column is removed from the data set if it does not contiain the text "mean()" (case insensitive) or the text "std". Note that we do remove the meanFreq() variables, and they are not semantically the same as the mean() variables (see feature descriptions for data set).
An additional activity column is added to the end of the data frame. This is a tranlsation of the "class" column to a readable acitivty name value. The names for the activity are read from the activity_label.txt file. It is assumed that the position of an acitivity name in that file corresponds to the number (index) of each y value in the testing and traingin sets.

Data set 2

The second data set is generated via the generate_means function, which is called separately. The input to this method should be the data set output by the run_analaysis method.

The methodolgy for the second data set generation is as follows:

melt the dataset on subject + activity, using all of the selected fields from features.txt as the measurement variables. This results in each measurement being separated by the combincation of subject + activity
Take the melted dataset and apply the dcast operation to apply the mean operation to each varible set grouped by subject + activity

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
unit_test_data		unit_test_data
.gitignore		.gitignore
CodeBook.md		CodeBook.md
Human_Activity_Recognition_Dataset1.txt		Human_Activity_Recognition_Dataset1.txt
Human_Activity_Recognition_Dataset2.txt		Human_Activity_Recognition_Dataset2.txt
README.md		README.md
run_analysis.R		run_analysis.R
run_analysis_tests.R		run_analysis_tests.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

##cleaningdata_assignment1

Methodology:

Data set 1

Data set 2

About

Uh oh!

Releases

Packages

Languages

nadelman/cleaningdata_assignment1

Folders and files

Latest commit

History

Repository files navigation

##cleaningdata_assignment1

Methodology:

Data set 1

Data set 2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages