|  | 
|  | 1 | +--- | 
|  | 2 | +title: "Data Scientist’s Toolbox Course Notes" | 
|  | 3 | +author: "Xing Su" | 
|  | 4 | +output: pdf_document | 
|  | 5 | +--- | 
|  | 6 | + | 
|  | 7 | +## CLI (Command Line Interface) | 
|  | 8 | + | 
|  | 9 | +* `/` = root directory | 
|  | 10 | +* `~` = home directory | 
|  | 11 | +* `pwd` = print working directory (current directory) | 
|  | 12 | +* `clear` = clear screen | 
|  | 13 | +* `ls` = list stuff | 
|  | 14 | +    *  `-a` = see all (hidden) | 
|  | 15 | +    *  `-l` = details | 
|  | 16 | +* `cd` = change directory | 
|  | 17 | +* `mkdir` = make directory | 
|  | 18 | +* `touch` = creates an empty file | 
|  | 19 | +* `cp` = copy | 
|  | 20 | +    * `cp <file> <directory>` = copy a file to a directory | 
|  | 21 | +    * `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory | 
|  | 22 | +            * `-r` = recursive | 
|  | 23 | +* `rm` = remove | 
|  | 24 | +    * `-r` = remove entire directories (no undo) | 
|  | 25 | +* `mv` = move | 
|  | 26 | +    * `move <file> <directory>` = move file to directory | 
|  | 27 | +    * `move <fileName> <newName>` = rename file | 
|  | 28 | +* `echo` = print arguments you give/variables | 
|  | 29 | +* `date` = print current date  | 
|  | 30 | + | 
|  | 31 | + | 
|  | 32 | + | 
|  | 33 | +## GitHub | 
|  | 34 | + | 
|  | 35 | +* **Workflow**  | 
|  | 36 | +    1. make edits in workspace | 
|  | 37 | +    2. update index/add files | 
|  | 38 | +    3. commit to local repo  | 
|  | 39 | +    4. push to remote repository | 
|  | 40 | +* `git add .` = add all new files to be tracked | 
|  | 41 | +* `git add -u` = updates tracking for files that are renamed or deleted | 
|  | 42 | +* `git add -A` = both of the above | 
|  | 43 | +    * ***Note**: `add` is performed before committing* | 
|  | 44 | +* `git commit -m "message"` = commit the changes you want to be saved to the local copy | 
|  | 45 | +* `git checkout -b branchname` = create new branch | 
|  | 46 | +* `git branch` = tells you what branch you are on | 
|  | 47 | +* `git checkout master` = move back to the master branch | 
|  | 48 | +* `git pull` = merge you changes into other branch/repo (pull request, sent to owner of the repo) | 
|  | 49 | +* `git push` = commit local changes to remote (GitHub) | 
|  | 50 | + | 
|  | 51 | + | 
|  | 52 | + | 
|  | 53 | +## Markdown | 
|  | 54 | + | 
|  | 55 | +* `##` = signifies secondary heading (bold big font) | 
|  | 56 | +* `###` = signifies tertiary heading (slightly smaller font than secondary, not bold) | 
|  | 57 | +* `*` = bullet list item | 
|  | 58 | + | 
|  | 59 | + | 
|  | 60 | + | 
|  | 61 | +## R Packages | 
|  | 62 | + | 
|  | 63 | +* Primary location for R packages --> CRAN | 
|  | 64 | +* `available.packages()` = all packages available | 
|  | 65 | +* `head(rownames(a),3)` = returns first three names of a | 
|  | 66 | +* `install.packages("nameOfPackage")` = install single package | 
|  | 67 | +* `install.packages(c("nameOfPackage", "nameOfPackage", "nameOfPackage")` = install multiple package | 
|  | 68 | +* Bioconductor Packages: | 
|  | 69 | +    *  `source("https://bioconductor.org/biocLite.R")` | 
|  | 70 | +    *  `biocLite()` = install bioconductor packages | 
|  | 71 | +* `library(packagename)` = load package | 
|  | 72 | +* `search()` = see all functions in package after loading | 
|  | 73 | + | 
|  | 74 | + | 
|  | 75 | + | 
|  | 76 | +## Types of Data Science Questions | 
|  | 77 | + | 
|  | 78 | +* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic*** | 
|  | 79 | +* **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram) | 
|  | 80 | +* **Exploratory analysis** = discovering connections (correlation does not = causation) | 
|  | 81 | +* **Inferential analysis** = use data conclusions from smaller population for the broader group | 
|  | 82 | +* **Predictive analysis** = use data on one object to predict values for another (if X predicts Y, does not = X cause Y) | 
|  | 83 | +* **Causal analysis** = how does changing one variable affect another, using randomized studies, Strong assumptions, golden standard for statistical analysis | 
|  | 84 | +* **Mechanistic analysis** = understand exact changes in variables in other variables, modeled by empirical equations (engineering/physics | 
|  | 85 | + | 
|  | 86 | + | 
|  | 87 | + | 
|  | 88 | +## Data | 
|  | 89 | +* **Data** = values of qualitative or quantitative variables, belonging to a set of items (usually population) | 
|  | 90 | +* **Variables** = measurement/characteristic of an item (qualitative vs quantitative) | 
|  | 91 | +* **Data** = not always structured, usually raw file, different formats | 
|  | 92 | +* Most important thing is question, then it is data | 
|  | 93 | +* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data) | 
|  | 94 | + | 
|  | 95 | +## Experimental Design | 
|  | 96 | +* Formulate you question in advance  | 
|  | 97 | +* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly | 
|  | 98 | +* ***[Inference]*** **Variability** = lower variability + clearer differences = decision | 
|  | 99 | +* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation) | 
|  | 100 | +    * dealing with confounding: fix variables, stratify (all options), randomize | 
|  | 101 | +* ***[Prediction]*** collection observations for different variable values, build predictive functions | 
|  | 102 | +    *  similar problems of probability/sampling and confounding variables | 
|  | 103 | +* ***[Prediction]*** Difficult to understand where observation is from from different distributions. (size of effects important) | 
|  | 104 | +* ***[Prediction]*** Positive/negative statuses: True positive, false positive, false negative, true negative | 
|  | 105 | +    * **Sensitivity** = Pr(positive test | disease) | 
|  | 106 | +    * **Specificity** = Pr(negative test | no disease) | 
|  | 107 | +    * **Positive Predictive Value** = Pr(disease | positive test) | 
|  | 108 | +    * **Negative Predictive Value** = Pr(no disease | negative test) | 
|  | 109 | +    * **Accuracy** = Pr(correct outcome) | 
|  | 110 | +* **Data dredging** = use data to fit hypothesis  | 
|  | 111 | +* **Good experiments** = have replication, measure variability, generalize problem, transparent | 
|  | 112 | +* Prediction is not inference, and be ware of data dredging | 
|  | 113 | + | 
0 commit comments