Skip to content

Commit 52f6240

Browse files
committed
Merge pull request sux13#21 from leduke2000/master
minor changes to GETDATA
2 parents 3833e62 + 3bb0e2d commit 52f6240

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

3_GETDATA/Getting and Cleaning Data Course Notes.Rmd

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ $\pagebreak$
120120
* `xpathSApply(rootNode, "//price", xmlValue)` = get the values of all elements with tag "price"
121121
* **extract content by attributes**
122122
* `doc <- htmlTreeParse(url, useInternal = True)`
123-
* `scores <- xpathSApply(doc, "//li@class='score'", xmlvalue)` = look for li elements with `class = "score"` and return their value
123+
* `scores <- xpathSApply(doc, "//li[@class='score']", xmlvalue)` = look for li elements with `class = "score"` and return their value
124124

125125

126126

@@ -153,14 +153,14 @@ $\pagebreak$
153153
## data.table
154154
* inherits from `data.frame` (external package) $\rightarrow$ all functions that accept `data.frame` work on `data.table`
155155
* can be much faster (written in C), ***much much faster*** at subsetting/grouping/updating
156-
* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c(a, b, c), each = 3), z = rnorm(9)`
156+
* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c("a","b","c"), each = 3), z = rnorm(9))`
157157
* `tables()` = returns all data tables in memory
158158
* shows name, nrow, MB, cols, key
159159
* some subset works like before = `dt[2, ], dt[dt$y=="a",]`
160160
* `dt[c(2, 3)]` = subset by rows, rows 2 and 3 in this case
161161
* **column subsetting** (modified for `data.table`)
162162
* argument after comma is called an ***expression*** (collection of statements enclosed in `{}`)
163-
* `dt[, list(means(x), sum(z)]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
163+
* `dt[, list(mean(x), sum(z))]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
164164
* `dt[, table(y)]` = get table of y value (perform any functions)
165165
* **add new columns**
166166
* `dt[, w:=z^2]`
@@ -176,9 +176,9 @@ $\pagebreak$
176176
* **special variables**
177177
* `.N` = returns integer, length 1, containing the number (essentially count)
178178
* `dt <- data.table (x=sample(letters[1:3], 1E5, TRUE))` = generates data table
179-
* `dt[, .N by =x]` = creates a table to count observations by the value of x
179+
* `dt[, .N, by =x]` = creates a table to count observations by the value of x
180180
* **keys** (quickly filter/subset)
181-
* *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each 100), y = rnorm(300))` = generates data table
181+
* *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each = 100), y = rnorm(300))` = generates data table
182182
* `setkey(dt, x)` = set the key to the x column
183183
* `dt['a']` = returns a data frame, where x = 'a' (effectively filter)
184184
* **joins** (merging tables)
@@ -187,9 +187,9 @@ $\pagebreak$
187187
* `setkey(dt1, x); setkey(dt2, x)` = sets the keys for both data tables to be column x
188188
* `merge(dt1, dt2)` = returns a table, combine the two tables using column x, filtering to only the values that match up between common elements the two x columns (i.e. 'a') and the data is merged together
189189
* **fast reading of files**
190-
* *example*: `big_df <- data.frame(norm(1e6), norm(1e6))` = generates data table
190+
* *example*: `big_df <- data.frame(rnorm(1e6), rnorm(1e6))` = generates data table
191191
* `file <- tempfile()` = generates empty temp file
192-
* `write.table(big.df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t". quote = FALSE)` = writes the generated data from big.df to the empty temp file
192+
* `write.table(big_df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t", quote = FALSE)` = writes the generated data from big.df to the empty temp file
193193
* `fread(file)` = read file and load data = much faster than `read.table()`
194194

195195

@@ -202,7 +202,7 @@ $\pagebreak$
202202
* free/widely used open sources database software, widely used for Internet base applications
203203
* each row = record
204204
* data are structured in databases $\rightarrow$ series tables (dataset) $\rightarrow$ fields (columns in dataset)
205-
* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu)` = open a connection to the database
205+
* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")` = open a connection to the database
206206
* `db = "hg19"` = select specific database
207207
* `MySQL()` can be replaced with other arguments to use other data structures
208208
* `dbGetQuery(db, "show databases;")` = return the result from the specified SQL query executed through the connection
@@ -473,7 +473,7 @@ $\pagebreak$
473473
## Subsetting and Sorting
474474
* **subsetting**
475475
* `x <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = (11:15))` = initiates a data frame with three names columns
476-
* `x <- x[sample(1:5)` = this scrambles the rows
476+
* `x <- x[sample(1:5),]` = this scrambles the rows
477477
* `x$var2[c(2,3)] = NA` = setting the 2nd and 3rd element of the second column to NA
478478
* `x[1:2, "var2"]` = subsetting the first two row of the the second column
479479
* `x[(x$var1 <= 3 | x$var3 > 15), ]` = return all rows of x where the first column is less than or equal to three or where the third column is bigger than 15

0 commit comments

Comments
 (0)