| 
 | 1 | +---  | 
 | 2 | +title: "Data Scientist’s Toolbox Course Notes"  | 
 | 3 | +author: "Xing Su"  | 
 | 4 | +output: pdf_document  | 
 | 5 | +---  | 
 | 6 | + | 
 | 7 | +## CLI (Command Line Interface)  | 
 | 8 | + | 
 | 9 | +* `/` = root directory  | 
 | 10 | +* `~` = home directory  | 
 | 11 | +* `pwd` = print working directory (current directory)  | 
 | 12 | +* `clear` = clear screen  | 
 | 13 | +* `ls` = list stuff  | 
 | 14 | +    *  `-a` = see all (hidden)  | 
 | 15 | +    *  `-l` = details  | 
 | 16 | +* `cd` = change directory  | 
 | 17 | +* `mkdir` = make directory  | 
 | 18 | +* `touch` = creates an empty file  | 
 | 19 | +* `cp` = copy  | 
 | 20 | +    * `cp <file> <directory>` = copy a file to a directory  | 
 | 21 | +    * `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory  | 
 | 22 | +            * `-r` = recursive  | 
 | 23 | +* `rm` = remove  | 
 | 24 | +    * `-r` = remove entire directories (no undo)  | 
 | 25 | +* `mv` = move  | 
 | 26 | +    * `move <file> <directory>` = move file to directory  | 
 | 27 | +    * `move <fileName> <newName>` = rename file  | 
 | 28 | +* `echo` = print arguments you give/variables  | 
 | 29 | +* `date` = print current date   | 
 | 30 | + | 
 | 31 | + | 
 | 32 | + | 
 | 33 | +## GitHub  | 
 | 34 | + | 
 | 35 | +* **Workflow**   | 
 | 36 | +    1. make edits in workspace  | 
 | 37 | +    2. update index/add files  | 
 | 38 | +    3. commit to local repo   | 
 | 39 | +    4. push to remote repository  | 
 | 40 | +* `git add .` = add all new files to be tracked  | 
 | 41 | +* `git add -u` = updates tracking for files that are renamed or deleted  | 
 | 42 | +* `git add -A` = both of the above  | 
 | 43 | +    * ***Note**: `add` is performed before committing*  | 
 | 44 | +* `git commit -m "message"` = commit the changes you want to be saved to the local copy  | 
 | 45 | +* `git checkout -b branchname` = create new branch  | 
 | 46 | +* `git branch` = tells you what branch you are on  | 
 | 47 | +* `git checkout master` = move back to the master branch  | 
 | 48 | +* `git pull` = merge you changes into other branch/repo (pull request, sent to owner of the repo)  | 
 | 49 | +* `git push` = commit local changes to remote (GitHub)  | 
 | 50 | + | 
 | 51 | + | 
 | 52 | + | 
 | 53 | +## Markdown  | 
 | 54 | + | 
 | 55 | +* `##` = signifies secondary heading (bold big font)  | 
 | 56 | +* `###` = signifies tertiary heading (slightly smaller font than secondary, not bold)  | 
 | 57 | +* `*` = bullet list item  | 
 | 58 | + | 
 | 59 | + | 
 | 60 | + | 
 | 61 | +## R Packages  | 
 | 62 | + | 
 | 63 | +* Primary location for R packages --> CRAN  | 
 | 64 | +* `available.packages()` = all packages available  | 
 | 65 | +* `head(rownames(a),3)` = returns first three names of a  | 
 | 66 | +* `install.packages("nameOfPackage")` = install single package  | 
 | 67 | +* `install.packages(c("nameOfPackage", "nameOfPackage", "nameOfPackage")` = install multiple package  | 
 | 68 | +* Bioconductor Packages:  | 
 | 69 | +    *  `source("https://bioconductor.org/biocLite.R")`  | 
 | 70 | +    *  `biocLite()` = install bioconductor packages  | 
 | 71 | +* `library(packagename)` = load package  | 
 | 72 | +* `search()` = see all functions in package after loading  | 
 | 73 | + | 
 | 74 | + | 
 | 75 | + | 
 | 76 | +## Types of Data Science Questions  | 
 | 77 | + | 
 | 78 | +* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic***  | 
 | 79 | +* **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram)  | 
 | 80 | +* **Exploratory analysis** = discovering connections (correlation does not = causation)  | 
 | 81 | +* **Inferential analysis** = use data conclusions from smaller population for the broader group  | 
 | 82 | +* **Predictive analysis** = use data on one object to predict values for another (if X predicts Y, does not = X cause Y)  | 
 | 83 | +* **Causal analysis** = how does changing one variable affect another, using randomized studies, Strong assumptions, golden standard for statistical analysis  | 
 | 84 | +* **Mechanistic analysis** = understand exact changes in variables in other variables, modeled by empirical equations (engineering/physics  | 
 | 85 | + | 
 | 86 | + | 
 | 87 | + | 
 | 88 | +## Data  | 
 | 89 | +* **Data** = values of qualitative or quantitative variables, belonging to a set of items (usually population)  | 
 | 90 | +* **Variables** = measurement/characteristic of an item (qualitative vs quantitative)  | 
 | 91 | +* **Data** = not always structured, usually raw file, different formats  | 
 | 92 | +* Most important thing is question, then it is data  | 
 | 93 | +* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)  | 
 | 94 | + | 
 | 95 | +## Experimental Design  | 
 | 96 | +* Formulate you question in advance   | 
 | 97 | +* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly  | 
 | 98 | +* ***[Inference]*** **Variability** = lower variability + clearer differences = decision  | 
 | 99 | +* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)  | 
 | 100 | +    * dealing with confounding: fix variables, stratify (all options), randomize  | 
 | 101 | +* ***[Prediction]*** collection observations for different variable values, build predictive functions  | 
 | 102 | +    *  similar problems of probability/sampling and confounding variables  | 
 | 103 | +* ***[Prediction]*** Difficult to understand where observation is from from different distributions. (size of effects important)  | 
 | 104 | +* ***[Prediction]*** Positive/negative statuses: True positive, false positive, false negative, true negative  | 
 | 105 | +    * **Sensitivity** = Pr(positive test | disease)  | 
 | 106 | +    * **Specificity** = Pr(negative test | no disease)  | 
 | 107 | +    * **Positive Predictive Value** = Pr(disease | positive test)  | 
 | 108 | +    * **Negative Predictive Value** = Pr(no disease | negative test)  | 
 | 109 | +    * **Accuracy** = Pr(correct outcome)  | 
 | 110 | +* **Data dredging** = use data to fit hypothesis   | 
 | 111 | +* **Good experiments** = have replication, measure variability, generalize problem, transparent  | 
 | 112 | +* Prediction is not inference, and be ware of data dredging  | 
 | 113 | + | 
0 commit comments