Skip to content

Commit ff4dbf6

Browse files
committed
Added all notes
1 parent 373279b commit ff4dbf6

File tree

87 files changed

+31700
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+31700
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: "Data Scientist’s Toolbox Course Notes"
3+
author: "Xing Su"
4+
output: pdf_document
5+
---
6+
7+
## CLI (Command Line Interface)
8+
9+
* `/` = root directory
10+
* `~` = home directory
11+
* `pwd` = print working directory (current directory)
12+
* `clear` = clear screen
13+
* `ls` = list stuff
14+
* `-a` = see all (hidden)
15+
* `-l` = details
16+
* `cd` = change directory
17+
* `mkdir` = make directory
18+
* `touch` = creates an empty file
19+
* `cp` = copy
20+
* `cp <file> <directory>` = copy a file to a directory
21+
* `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory
22+
* `-r` = recursive
23+
* `rm` = remove
24+
* `-r` = remove entire directories (no undo)
25+
* `mv` = move
26+
* `move <file> <directory>` = move file to directory
27+
* `move <fileName> <newName>` = rename file
28+
* `echo` = print arguments you give/variables
29+
* `date` = print current date
30+
31+
32+
33+
## GitHub
34+
35+
* **Workflow**
36+
1. make edits in workspace
37+
2. update index/add files
38+
3. commit to local repo
39+
4. push to remote repository
40+
* `git add .` = add all new files to be tracked
41+
* `git add -u` = updates tracking for files that are renamed or deleted
42+
* `git add -A` = both of the above
43+
* ***Note**: `add` is performed before committing*
44+
* `git commit -m "message"` = commit the changes you want to be saved to the local copy
45+
* `git checkout -b branchname` = create new branch
46+
* `git branch` = tells you what branch you are on
47+
* `git checkout master` = move back to the master branch
48+
* `git pull` = merge you changes into other branch/repo (pull request, sent to owner of the repo)
49+
* `git push` = commit local changes to remote (GitHub)
50+
51+
52+
53+
## Markdown
54+
55+
* `##` = signifies secondary heading (bold big font)
56+
* `###` = signifies tertiary heading (slightly smaller font than secondary, not bold)
57+
* `*` = bullet list item
58+
59+
60+
61+
## R Packages
62+
63+
* Primary location for R packages --> CRAN
64+
* `available.packages()` = all packages available
65+
* `head(rownames(a),3)` = returns first three names of a
66+
* `install.packages("nameOfPackage")` = install single package
67+
* `install.packages(c("nameOfPackage", "nameOfPackage", "nameOfPackage")` = install multiple package
68+
* Bioconductor Packages:
69+
* `source("https://bioconductor.org/biocLite.R")`
70+
* `biocLite()` = install bioconductor packages
71+
* `library(packagename)` = load package
72+
* `search()` = see all functions in package after loading
73+
74+
75+
76+
## Types of Data Science Questions
77+
78+
* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic***
79+
* **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram)
80+
* **Exploratory analysis** = discovering connections (correlation does not = causation)
81+
* **Inferential analysis** = use data conclusions from smaller population for the broader group
82+
* **Predictive analysis** = use data on one object to predict values for another (if X predicts Y, does not = X cause Y)
83+
* **Causal analysis** = how does changing one variable affect another, using randomized studies, Strong assumptions, golden standard for statistical analysis
84+
* **Mechanistic analysis** = understand exact changes in variables in other variables, modeled by empirical equations (engineering/physics
85+
86+
87+
88+
## Data
89+
* **Data** = values of qualitative or quantitative variables, belonging to a set of items (usually population)
90+
* **Variables** = measurement/characteristic of an item (qualitative vs quantitative)
91+
* **Data** = not always structured, usually raw file, different formats
92+
* Most important thing is question, then it is data
93+
* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)
94+
95+
## Experimental Design
96+
* Formulate you question in advance
97+
* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
98+
* ***[Inference]*** **Variability** = lower variability + clearer differences = decision
99+
* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)
100+
* dealing with confounding: fix variables, stratify (all options), randomize
101+
* ***[Prediction]*** collection observations for different variable values, build predictive functions
102+
* similar problems of probability/sampling and confounding variables
103+
* ***[Prediction]*** Difficult to understand where observation is from from different distributions. (size of effects important)
104+
* ***[Prediction]*** Positive/negative statuses: True positive, false positive, false negative, true negative
105+
* **Sensitivity** = Pr(positive test | disease)
106+
* **Specificity** = Pr(negative test | no disease)
107+
* **Positive Predictive Value** = Pr(disease | positive test)
108+
* **Negative Predictive Value** = Pr(no disease | negative test)
109+
* **Accuracy** = Pr(correct outcome)
110+
* **Data dredging** = use data to fit hypothesis
111+
* **Good experiments** = have replication, measure variability, generalize problem, transparent
112+
* Prediction is not inference, and be ware of data dredging
113+

1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.html

Lines changed: 226 additions & 0 deletions
Large diffs are not rendered by default.
191 KB
Binary file not shown.

0 commit comments

Comments
 (0)