Skip to content

Commit 093eebd

Browse files
author
Andy Petrella
committed
Merge pull request #7 from data-fellas/deanwampler-master
Deanwampler master
2 parents 08939d5 + 496187a commit 093eebd

File tree

3 files changed

+100
-91
lines changed

3 files changed

+100
-91
lines changed

notebooks/WhyScala.md

Lines changed: 24 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77

88
* Scala Days NYC, May 5th, 2016
99
* GOTO Chicago, May 24, 2016
10+
* Strata + Hadoop World London, June 3, 2016
1011
* Scala Days Berlin, June 16th, 2016
1112

1213
See also the [Spark Notebook](http://spark-notebook.io) version of this content, available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science).
@@ -17,6 +18,8 @@ While Python and R are traditional languages of choice for Data Science, [Spark]
1718

1819
However, using one language for all work has advantages like simplifying the software development process, such as building, testing, and deploying techniques, coding conventions, etc.
1920

21+
If you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do).
22+
2023
So, what are the advantages, as well as disadvantages of Scala?
2124

2225
## 1. Functional Programming Plus Objects
@@ -37,7 +40,7 @@ Scala also implements some _functional_ features using _object-oriented inherita
3740
* **R:** As a Statistics language, R is more functional than object-oriented.
3841
* **Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced).
3942

40-
There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning:
43+
There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically:
4144

4245
### 1a. Traits vs. Interfaces
4346
Scala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance).
@@ -47,7 +50,7 @@ Imagine that you want to define reusable logging code and mix it into other clas
4750
Scala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes.
4851

4952
### 1b. Java Streams
50-
When you use the Java 8 collections, you can convert the the tradition collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
53+
When you use the Java 8 collections, you can convert the traditional collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
5154

5255
### The Virtue of Functional Collections:
5356
Let's examine how concisely we can operate on a collection of values in Scala and Spark.
@@ -84,7 +87,7 @@ Let's compare a Scala collections calculation vs. the same thing in Spark; how m
8487
This produces the results:
8588
```scala
8689
res16: scala.collection.immutable.Map[Boolean,Int] = Map(
87-
false -> 74, true -> 25)
90+
false -> 75, true -> 25)
8891
```
8992
Note that for the numbers between 1 and 100, inclusive, exactly 1/3 of them are prime!
9093

@@ -95,7 +98,7 @@ Note how similar the following code is to the previous example. After constructi
9598
However, because Spark collections are "lazy" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them!
9699

97100
```scala
98-
val rddPrimes = sparkContext.parallelize(1 until 100).
101+
val rddPrimes = sparkContext.parallelize(1 to 100).
99102
map(i => (i, isPrime(i))).
100103
groupBy(tuple => tuple._2).
101104
map(tuple => (tuple._1, tuple._2.size))
@@ -106,7 +109,7 @@ This produces the result:
106109
```scala
107110
rddPrimes: org.apache.spark.rdd.RDD[(Boolean, Int)] =
108111
MapPartitionsRDD[4] at map at <console>:61
109-
res18: Array[(Boolean, Int)] = Array((false,74), (true,25))
112+
res18: Array[(Boolean, Int)] = Array((false,75), (true,25))
110113
```
111114

112115
Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples.
@@ -123,19 +126,19 @@ What about the other languages?
123126

124127
## 2. Interpreter (REPL)
125128

126-
In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks possible!
129+
In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible!
127130

128131
What about the other languages?
129132

130133
* **Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments.
131134
* **R:** Also has an interpreter and notebook/IDE environments.
132-
* **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment.
135+
* **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years!
133136

134137
## 3. Tuple Syntax
135138
In data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above:
136139

137140
```scala
138-
sparkContext.parallelize(1 until 100).
141+
sparkContext.parallelize(1 to 100).
139142
map(i => (i, isPrime(i))). // <-- here
140143
groupBy(tuple => tuple._2).
141144
map(tuple => (tuple._1, tuple._2.size)) // <-- here
@@ -174,7 +177,7 @@ This is one of the most powerful features you'll find in most functional languag
174177
Let's rewrite our previous primes example:
175178

176179
```scala
177-
sparkContext.parallelize(1 until 100).
180+
sparkContext.parallelize(1 to 100).
178181
map(i => (i, isPrime(i))).
179182
groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body }
180183
map{ case (primality, values) => (primality, values.size) } . // same here
@@ -184,7 +187,7 @@ sparkContext.parallelize(1 until 100).
184187
The output is:
185188
```scala
186189
(true,25)
187-
(false,74)
190+
(false,75)
188191
```
189192

190193
Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches.
@@ -268,7 +271,7 @@ j: Int = 10
268271
Recall our previous Spark example, where we wrote nothing about types, but they were inferred:
269272

270273
```scala
271-
sparkContext.parallelize(1 until 100).
274+
sparkContext.parallelize(1 to 100).
272275
map(i => (i, isPrime(i))).
273276
groupBy{ case(_, primality) => primality }.
274277
map{ case (primality, values) => (primality, values.size) }
@@ -330,11 +333,10 @@ Get the root directory of the notebooks:
330333
val root = sys.env("NOTEBOOKS_DIR")
331334
```
332335

333-
Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame). (In the notebook, the "cell" returns Scala "Unit", `()`, which is sort of like `void`, to avoid an annoying bug in the output.)
336+
Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame).
334337

335338
```scala
336339
val airportsDF = sqlContext.read.json(s"$root/airports.json")
337-
()
338340
```
339341

340342
Note the "schema" is inferred from the JSON and shown by the REPL (by calling `DataFrame.toString`).
@@ -345,6 +347,12 @@ airportsDF: org.apache.spark.sql.DataFrame = [airport: string, city: string, cou
345347

346348
We cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default).
347349

350+
```scala
351+
airportsDF.cache
352+
airportsDF.show
353+
```
354+
355+
Here's the output of `show`:
348356
```
349357
+--------------------+------------------+-------+----+-----------+------------+-----+
350358
| airport| city|country|iata| lat| long|state|
@@ -378,7 +386,7 @@ Now we can show the idiomatic DataFrame API (DSL) in action:
378386
```scala
379387
val grouped = airportsDF.groupBy($"state", $"country").count.orderBy($"count".desc)
380388
grouped.printSchema
381-
grouped.show(100) // all 50 states + territories
389+
grouped.show(100) // 50 states + territories < 100
382390
```
383391

384392
Here is the output:
@@ -460,7 +468,7 @@ What about the other languages?
460468
* **Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above.
461469

462470
## 9. And a Few Other Things...
463-
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used infrequently in Spark code. Here they are, for completeness.
471+
There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness.
464472

465473
### 9A. Singletons Are a Built-in Feature
466474
Implement the _Singleton Design Pattern_ without special logic to ensure there's only one instance.
@@ -713,6 +721,7 @@ SuccessOrFailure<? extends Object> sof = null;
713721
sof = new Success<String>("foo");
714722

715723
```
724+
716725
This is harder for the user, who has to understand what's okay in this case, both what the designer intended and some technical rules of type theory.
717726

718727
It's much better if the _designer_ of `SuccessOrFailure[T]`, who understands the desired behavior, defines the allowed variance behavior at the _definition site_, which Scala supports. Recall from above:

notebooks/WhyScala.pdf

1.23 KB
Binary file not shown.

0 commit comments

Comments
 (0)