Skip to content

Commit f6ead88

Browse files
committed
[SPARK-32245][INFRA] Run Spark tests in Github Actions
This PR aims to run the Spark tests in Github Actions. To briefly explain the main idea: - Reuse `dev/run-tests.py` with SBT build - Reuse the modules in `dev/sparktestsupport/modules.py` to test each module - Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`. - `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test. Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins: ![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png) So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases. _Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23 Last week and onwards, the Jenkins machines became very unstable for many reasons: - Apparently, the machines became extremely slow. Almost all tests can't pass. - One machine (worker 4) started to have the corrupt `.m2` which fails the build. - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at apache#29017. - Almost all PRs are basically blocked by this instability currently. The advantages of using Github Actions: - To avoid depending on few persons who can access to the cluster. - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce. - To control the environment more flexibly. - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost. Note that: - The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_ - We can now control the environments especially for Python easily. - The test and build look more stable than the Jenkins'. No, dev-only change. Tested at #4 Closes apache#29057 from HyukjinKwon/migrate-to-github-actions. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 34544d6 commit f6ead88

File tree

19 files changed

+434
-159
lines changed

19 files changed

+434
-159
lines changed

.github/workflows/branch-2.4.yml

Lines changed: 0 additions & 104 deletions
This file was deleted.

.github/workflows/master.yml

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
name: master
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- branch-2.4
7+
8+
jobs:
9+
# TODO(SPARK-32248): Recover JDK 11 builds
10+
# Build: build Spark and run the tests for specified modules.
11+
build:
12+
name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }} (JDK ${{ matrix.java }}, ${{ matrix.hadoop }})"
13+
runs-on: ubuntu-latest
14+
strategy:
15+
fail-fast: false
16+
matrix:
17+
java:
18+
- 1.8
19+
hadoop:
20+
- hadoop2.6
21+
# TODO(SPARK-32246): We don't test 'streaming-kinesis-asl' for now.
22+
# Kinesis tests depends on external Amazon kinesis service.
23+
# Note that the modules below are from sparktestsupport/modules.py.
24+
modules:
25+
- |-
26+
core, unsafe, kvstore, avro,
27+
network_common, network_shuffle, repl, launcher
28+
examples, sketch, graphx
29+
- |-
30+
catalyst, hive-thriftserver
31+
- |-
32+
streaming, sql-kafka-0-10, streaming-kafka-0-10,
33+
mllib-local, mllib,
34+
yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl,
35+
streaming-flume, streaming-flume-sink, streaming-kafka-0-8
36+
- |-
37+
pyspark-sql, pyspark-mllib
38+
- |-
39+
pyspark-core, pyspark-streaming, pyspark-ml
40+
- |-
41+
sparkr
42+
# Here, we split Hive and SQL tests into some of slow ones and the rest of them.
43+
included-tags: [""]
44+
excluded-tags: [""]
45+
comment: [""]
46+
include:
47+
# Hive tests
48+
- modules: hive
49+
java: 1.8
50+
hadoop: hadoop2.6
51+
included-tags: org.apache.spark.tags.SlowHiveTest
52+
comment: "- slow tests"
53+
- modules: hive
54+
java: 1.8
55+
hadoop: hadoop2.6
56+
excluded-tags: org.apache.spark.tags.SlowHiveTest
57+
comment: "- other tests"
58+
# SQL tests
59+
- modules: sql
60+
java: 1.8
61+
hadoop: hadoop2.6
62+
included-tags: org.apache.spark.tags.ExtendedSQLTest
63+
comment: "- slow tests"
64+
- modules: sql
65+
java: 1.8
66+
hadoop: hadoop2.6
67+
excluded-tags: org.apache.spark.tags.ExtendedSQLTest
68+
comment: "- other tests"
69+
env:
70+
TEST_ONLY_MODULES: ${{ matrix.modules }}
71+
TEST_ONLY_EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
72+
TEST_ONLY_INCLUDED_TAGS: ${{ matrix.included-tags }}
73+
HADOOP_PROFILE: ${{ matrix.hadoop }}
74+
# GitHub Actions' default miniconda to use in pip packaging test.
75+
CONDA_PREFIX: /usr/share/miniconda
76+
steps:
77+
- name: Checkout Spark repository
78+
uses: actions/checkout@v2
79+
# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
80+
- name: Cache Scala, SBT, Maven and Zinc
81+
uses: actions/cache@v1
82+
with:
83+
path: build
84+
key: build-${{ hashFiles('**/pom.xml') }}
85+
restore-keys: |
86+
build-
87+
- name: Cache Maven local repository
88+
uses: actions/cache@v2
89+
with:
90+
path: ~/.m2/repository
91+
key: ${{ matrix.java }}-${{ matrix.hadoop }}-maven-${{ hashFiles('**/pom.xml') }}
92+
restore-keys: |
93+
${{ matrix.java }}-${{ matrix.hadoop }}-maven-
94+
- name: Cache Ivy local repository
95+
uses: actions/cache@v2
96+
with:
97+
path: ~/.ivy2/cache
98+
key: ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-${{ hashFiles('**/pom.xml') }}-${{ hashFiles('**/plugins.sbt') }}
99+
restore-keys: |
100+
${{ matrix.java }}-${{ matrix.hadoop }}-ivy-
101+
- name: Install JDK ${{ matrix.java }}
102+
uses: actions/setup-java@v1
103+
with:
104+
java-version: ${{ matrix.java }}
105+
# PySpark
106+
- name: Install PyPy3
107+
# SQL component also has Python related tests, for example, IntegratedUDFTestUtils.
108+
# Note that order of Python installations here matters because default python3 is
109+
# overridden by pypy3.
110+
uses: actions/setup-python@v2
111+
if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
112+
with:
113+
python-version: pypy3
114+
architecture: x64
115+
- name: Install Python 3.6
116+
uses: actions/setup-python@v2
117+
if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
118+
with:
119+
python-version: 3.6
120+
architecture: x64
121+
- name: Install Python 2.7
122+
uses: actions/setup-python@v2
123+
if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
124+
with:
125+
python-version: 2.7
126+
architecture: x64
127+
- name: Install Python packages
128+
if: contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
129+
# PyArrow is not supported in PyPy yet, see ARROW-2651.
130+
# TODO(SPARK-32247): scipy installation with PyPy fails for an unknown reason.
131+
run: |
132+
python3 -m pip install numpy pyarrow pandas scipy
133+
python3 -m pip list
134+
python2 -m pip install numpy pyarrow pandas scipy
135+
python2 -m pip list
136+
pypy3 -m pip install numpy pandas
137+
pypy3 -m pip list
138+
# SparkR
139+
- name: Install R 3.6
140+
uses: r-lib/actions/setup-r@v1
141+
if: contains(matrix.modules, 'sparkr')
142+
with:
143+
r-version: 3.6
144+
- name: Install R packages
145+
if: contains(matrix.modules, 'sparkr')
146+
run: |
147+
sudo apt-get install -y libcurl4-openssl-dev
148+
sudo Rscript -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow', 'roxygen2'), repos='https://cloud.r-project.org/')"
149+
# Show installed packages in R.
150+
sudo Rscript -e 'pkg_list <- as.data.frame(installed.packages()[, c(1,3:4)]); pkg_list[is.na(pkg_list$Priority), 1:2, drop = FALSE]'
151+
# Run the tests.
152+
- name: "Run tests: ${{ matrix.modules }}"
153+
run: |
154+
# Hive tests become flaky when running in parallel as it's too intensive.
155+
if [[ "$TEST_ONLY_MODULES" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
156+
mkdir -p ~/.m2
157+
./dev/run-tests --parallelism 2
158+
rm -rf ~/.m2/repository/org/apache/spark
159+
160+
# Static analysis, and documentation build
161+
lint:
162+
name: Linters, licenses, dependencies and documentation generation
163+
runs-on: ubuntu-latest
164+
steps:
165+
- name: Checkout Spark repository
166+
uses: actions/checkout@v2
167+
- name: Cache Maven local repository
168+
uses: actions/cache@v2
169+
with:
170+
path: ~/.m2/repository
171+
key: docs-maven-repo-${{ hashFiles('**/pom.xml') }}
172+
restore-keys: |
173+
docs-maven-
174+
- name: Install JDK 1.8
175+
uses: actions/setup-java@v1
176+
with:
177+
java-version: 1.8
178+
- name: Install Python 3.6
179+
uses: actions/setup-python@v2
180+
with:
181+
python-version: 3.6
182+
architecture: x64
183+
- name: Install Python linter dependencies
184+
run: |
185+
pip3 install flake8 sphinx numpy
186+
- name: Install R 3.6
187+
uses: r-lib/actions/setup-r@v1
188+
with:
189+
r-version: 3.6
190+
- name: Install R linter dependencies and SparkR
191+
run: |
192+
sudo apt-get install -y libcurl4-openssl-dev
193+
sudo Rscript -e "install.packages(c('devtools'), repos='https://cloud.r-project.org/')"
194+
sudo Rscript -e "devtools::install_github('jimhester/[email protected]')"
195+
./R/install-dev.sh
196+
- name: Install Ruby 2.7 for documentation generation
197+
uses: actions/setup-ruby@v1
198+
with:
199+
ruby-version: 2.7
200+
- name: Install dependencies for documentation generation
201+
run: |
202+
sudo apt-get install -y libcurl4-openssl-dev pandoc
203+
pip install sphinx mkdocs numpy
204+
gem install jekyll jekyll-redirect-from rouge
205+
sudo Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')"
206+
- name: Scala linter
207+
run: ./dev/lint-scala
208+
- name: Java linter
209+
run: ./dev/lint-java
210+
- name: Python linter
211+
run: ./dev/lint-python
212+
- name: R linter
213+
run: ./dev/lint-r
214+
- name: License test
215+
run: ./dev/check-license
216+
- name: Dependencies test
217+
run: ./dev/test-dependencies.sh
218+
- name: Run documentation build
219+
run: |
220+
cd docs
221+
jekyll build
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.tags;
19+
20+
import org.scalatest.TagAnnotation;
21+
22+
import java.lang.annotation.ElementType;
23+
import java.lang.annotation.Retention;
24+
import java.lang.annotation.RetentionPolicy;
25+
import java.lang.annotation.Target;
26+
27+
@TagAnnotation
28+
@Retention(RetentionPolicy.RUNTIME)
29+
@Target({ElementType.METHOD, ElementType.TYPE})
30+
public @interface SlowHiveTest { }

dev/run-pip-tests

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ fi
6868
PYSPARK_VERSION=$(python3 -c "exec(open('python/pyspark/version.py').read());print(__version__)")
6969
PYSPARK_DIST="$FWDIR/python/dist/pyspark-$PYSPARK_VERSION.tar.gz"
7070
# The pip install options we use for all the pip commands
71-
PIP_OPTIONS="--upgrade --no-cache-dir --force-reinstall "
71+
PIP_OPTIONS="--user --upgrade --no-cache-dir --force-reinstall "
7272
# Test both regular user and edit/dev install modes.
7373
PIP_COMMANDS=("pip install $PIP_OPTIONS $PYSPARK_DIST"
7474
"pip install $PIP_OPTIONS -e python/")
@@ -81,8 +81,12 @@ for python in "${PYTHON_EXECS[@]}"; do
8181
VIRTUALENV_PATH="$VIRTUALENV_BASE"/$python
8282
rm -rf "$VIRTUALENV_PATH"
8383
if [ -n "$USE_CONDA" ]; then
84+
if [ -f "$CONDA_PREFIX/etc/profile.d/conda.sh" ]; then
85+
# See also https://github.com/conda/conda/issues/7980
86+
source "$CONDA_PREFIX/etc/profile.d/conda.sh"
87+
fi
8488
conda create -y -p "$VIRTUALENV_PATH" python=$python numpy pandas pip setuptools
85-
source activate "$VIRTUALENV_PATH"
89+
conda activate "$VIRTUALENV_PATH" || (echo "Falling back to 'source activate'" && source activate "$VIRTUALENV_PATH")
8690
else
8791
mkdir -p "$VIRTUALENV_PATH"
8892
virtualenv --python=$python "$VIRTUALENV_PATH"
@@ -115,6 +119,7 @@ for python in "${PYTHON_EXECS[@]}"; do
115119
cd /
116120

117121
echo "Run basic sanity check on pip installed version with spark-submit"
122+
export PATH="$(python3 -m site --user-base)/bin:$PATH"
118123
spark-submit "$FWDIR"/dev/pip-sanity-check.py
119124
echo "Run basic sanity check with import based"
120125
python "$FWDIR"/dev/pip-sanity-check.py
@@ -125,7 +130,7 @@ for python in "${PYTHON_EXECS[@]}"; do
125130

126131
# conda / virtualenv environments need to be deactivated differently
127132
if [ -n "$USE_CONDA" ]; then
128-
source deactivate
133+
conda deactivate || (echo "Falling back to 'source deactivate'" && source deactivate)
129134
else
130135
deactivate
131136
fi

0 commit comments

Comments
 (0)