You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: notebooks/WhyScala.md
+24-15Lines changed: 24 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,7 @@
7
7
8
8
* Scala Days NYC, May 5th, 2016
9
9
* GOTO Chicago, May 24, 2016
10
+
* Strata + Hadoop World London, June 3, 2016
10
11
* Scala Days Berlin, June 16th, 2016
11
12
12
13
See also the [Spark Notebook](http://spark-notebook.io) version of this content, available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science).
@@ -17,6 +18,8 @@ While Python and R are traditional languages of choice for Data Science, [Spark]
17
18
18
19
However, using one language for all work has advantages like simplifying the software development process, such as building, testing, and deploying techniques, coding conventions, etc.
19
20
21
+
If you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do).
22
+
20
23
So, what are the advantages, as well as disadvantages of Scala?
21
24
22
25
## 1. Functional Programming Plus Objects
@@ -37,7 +40,7 @@ Scala also implements some _functional_ features using _object-oriented inherita
37
40
***R:** As a Statistics language, R is more functional than object-oriented.
38
41
***Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced).
39
42
40
-
There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning:
43
+
There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically:
41
44
42
45
### 1a. Traits vs. Interfaces
43
46
Scala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance).
@@ -47,7 +50,7 @@ Imagine that you want to define reusable logging code and mix it into other clas
47
50
Scala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes.
48
51
49
52
### 1b. Java Streams
50
-
When you use the Java 8 collections, you can convert the the tradition collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
53
+
When you use the Java 8 collections, you can convert the traditional collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
51
54
52
55
### The Virtue of Functional Collections:
53
56
Let's examine how concisely we can operate on a collection of values in Scala and Spark.
@@ -84,7 +87,7 @@ Let's compare a Scala collections calculation vs. the same thing in Spark; how m
Note that for the numbers between 1 and 100, inclusive, exactly 1/3 of them are prime!
90
93
@@ -95,7 +98,7 @@ Note how similar the following code is to the previous example. After constructi
95
98
However, because Spark collections are "lazy" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them!
Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples.
@@ -123,19 +126,19 @@ What about the other languages?
123
126
124
127
## 2. Interpreter (REPL)
125
128
126
-
In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks possible!
129
+
In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible!
127
130
128
131
What about the other languages?
129
132
130
133
***Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments.
131
134
***R:** Also has an interpreter and notebook/IDE environments.
132
-
***Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment.
135
+
***Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years!
133
136
134
137
## 3. Tuple Syntax
135
138
In data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above:
136
139
137
140
```scala
138
-
sparkContext.parallelize(1until100).
141
+
sparkContext.parallelize(1to100).
139
142
map(i => (i, isPrime(i))). // <-- here
140
143
groupBy(tuple => tuple._2).
141
144
map(tuple => (tuple._1, tuple._2.size)) // <-- here
@@ -174,7 +177,7 @@ This is one of the most powerful features you'll find in most functional languag
174
177
Let's rewrite our previous primes example:
175
178
176
179
```scala
177
-
sparkContext.parallelize(1until100).
180
+
sparkContext.parallelize(1to100).
178
181
map(i => (i, isPrime(i))).
179
182
groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body }
180
183
map{ case (primality, values) => (primality, values.size) } . // same here
@@ -184,7 +187,7 @@ sparkContext.parallelize(1 until 100).
184
187
The output is:
185
188
```scala
186
189
(true,25)
187
-
(false,74)
190
+
(false,75)
188
191
```
189
192
190
193
Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches.
@@ -268,7 +271,7 @@ j: Int = 10
268
271
Recall our previous Spark example, where we wrote nothing about types, but they were inferred:
269
272
270
273
```scala
271
-
sparkContext.parallelize(1until100).
274
+
sparkContext.parallelize(1to100).
272
275
map(i => (i, isPrime(i))).
273
276
groupBy{ case(_, primality) => primality }.
274
277
map{ case (primality, values) => (primality, values.size) }
@@ -330,11 +333,10 @@ Get the root directory of the notebooks:
330
333
valroot= sys.env("NOTEBOOKS_DIR")
331
334
```
332
335
333
-
Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame). (In the notebook, the "cell" returns Scala "Unit", `()`, which is sort of like `void`, to avoid an annoying bug in the output.)
336
+
Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame).
We cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default).
grouped.show(100) // 50 states + territories < 100
382
390
```
383
391
384
392
Here is the output:
@@ -460,7 +468,7 @@ What about the other languages?
460
468
***Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above.
461
469
462
470
## 9. And a Few Other Things...
463
-
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used infrequently in Spark code. Here they are, for completeness.
471
+
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness.
464
472
465
473
### 9A. Singletons Are a Built-in Feature
466
474
Implement the _Singleton Design Pattern_ without special logic to ensure there's only one instance.
@@ -713,6 +721,7 @@ SuccessOrFailure<? extends Object> sof = null;
713
721
sof =newSuccess<String>("foo");
714
722
715
723
```
724
+
716
725
This is harder for the user, who has to understand what's okay in this case, both what the designer intended and some technical rules of type theory.
717
726
718
727
It's much better if the _designer_ of `SuccessOrFailure[T]`, who understands the desired behavior, defines the allowed variance behavior at the _definition site_, which Scala supports. Recall from above:
0 commit comments