Skip to content

Commit c074e06

Browse files
committed
[SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources. In addition, this PR also fixes 3 more issues very closely related together while I am here. 1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases. 2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available. 3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example, ```bash dev/run-tests.py --modules sql,core ``` which is consistent with `python/run-tests.py`, for example, ```bash python/run-tests.py --modules pyspark-core,pyspark-ml ``` 4. Lastly, also fixed the formatting issue in module specification in the matrix: ```diff - network_common, network_shuffle, repl, launcher + network-common, network-shuffle, repl, launcher, ``` which incorrectly runs build/test the modules. By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc. Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too. No, dev-only. Manually tested in my own forked Spark: #7 #8 #9 #10 #11 #12 Closes apache#29086 from HyukjinKwon/SPARK-32292. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent f4924a2 commit c074e06

File tree

2 files changed

+98
-29
lines changed

2 files changed

+98
-29
lines changed

.github/workflows/master.yml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
modules:
2828
- |-
2929
core, unsafe, kvstore, avro,
30-
network_common, network_shuffle, repl, launcher
30+
network-common, network-shuffle, repl, launcher,
3131
examples, sketch, graphx
3232
- |-
3333
catalyst, hive-thriftserver
@@ -61,16 +61,20 @@ jobs:
6161
excluded-tags: org.apache.spark.tags.SlowHiveTest
6262
comment: "- other tests"
6363
env:
64-
TEST_ONLY_MODULES: ${{ matrix.modules }}
65-
TEST_ONLY_EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
66-
TEST_ONLY_INCLUDED_TAGS: ${{ matrix.included-tags }}
64+
MODULES_TO_TEST: ${{ matrix.modules }}
65+
EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
66+
INCLUDED_TAGS: ${{ matrix.included-tags }}
6767
HADOOP_PROFILE: ${{ matrix.hadoop }}
6868
# GitHub Actions' default miniconda to use in pip packaging test.
6969
CONDA_PREFIX: /usr/share/miniconda
70+
GITHUB_PREV_SHA: ${{ github.event.before }}
7071
ARROW_PRE_0_15_IPC_FORMAT: 1
7172
steps:
7273
- name: Checkout Spark repository
7374
uses: actions/checkout@v2
75+
# In order to fetch changed files
76+
with:
77+
fetch-depth: 0
7478
# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
7579
- name: Cache Scala, SBT, Maven and Zinc
7680
uses: actions/cache@v1
@@ -147,9 +151,9 @@ jobs:
147151
- name: "Run tests: ${{ matrix.modules }}"
148152
run: |
149153
# Hive tests become flaky when running in parallel as it's too intensive.
150-
if [[ "$TEST_ONLY_MODULES" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
154+
if [[ "$MODULES_TO_TEST" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
151155
mkdir -p ~/.m2
152-
./dev/run-tests --parallelism 2
156+
./dev/run-tests --parallelism 2 --modules "$MODULES_TO_TEST" --included-tags "$INCLUDED_TAGS" --excluded-tags "$EXCLUDED_TAGS"
153157
rm -rf ~/.m2/repository/org/apache/spark
154158
155159
# Static analysis, and documentation build

dev/run-tests.py

Lines changed: 88 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -100,12 +100,14 @@ def setup_test_environ(environ):
100100
os.environ[k] = v
101101

102102

103-
def determine_modules_to_test(changed_modules):
103+
def determine_modules_to_test(changed_modules, deduplicated=True):
104104
"""
105105
Given a set of modules that have changed, compute the transitive closure of those modules'
106106
dependent modules in order to determine the set of modules that should be tested.
107107
108108
Returns a topologically-sorted list of modules (ties are broken by sorting on module names).
109+
If ``deduplicated`` is disabled, the modules are returned without tacking the deduplication
110+
by dependencies into account.
109111
110112
>>> [x.name for x in determine_modules_to_test([modules.root])]
111113
['root']
@@ -121,11 +123,30 @@ def determine_modules_to_test(changed_modules):
121123
... # doctest: +NORMALIZE_WHITESPACE
122124
['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 'hive-thriftserver',
123125
'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml']
126+
>>> sorted([x.name for x in determine_modules_to_test(
127+
... [modules.sparkr, modules.sql], deduplicated=False)])
128+
... # doctest: +NORMALIZE_WHITESPACE
129+
['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml',
130+
'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
131+
>>> sorted([x.name for x in determine_modules_to_test(
132+
... [modules.sql, modules.core], deduplicated=False)])
133+
... # doctest: +NORMALIZE_WHITESPACE
134+
['avro', 'catalyst', 'core', 'examples', 'graphx', 'hive', 'hive-thriftserver',
135+
'mllib', 'mllib-local', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib',
136+
'pyspark-sql', 'pyspark-streaming', 'repl', 'root',
137+
'sparkr', 'sql', 'sql-kafka-0-10', 'streaming', 'streaming-flume',
138+
'streaming-flume-assembly', 'streaming-flume-sink', 'streaming-kafka-0-10',
139+
'streaming-kafka-0-8', 'streaming-kinesis-asl']
124140
"""
125141
modules_to_test = set()
126142
for module in changed_modules:
127-
modules_to_test = modules_to_test.union(determine_modules_to_test(module.dependent_modules))
143+
modules_to_test = modules_to_test.union(
144+
determine_modules_to_test(module.dependent_modules, deduplicated))
128145
modules_to_test = modules_to_test.union(set(changed_modules))
146+
147+
if not deduplicated:
148+
return modules_to_test
149+
129150
# If we need to run all of the tests, then we should short-circuit and return 'root'
130151
if modules.root in modules_to_test:
131152
return [modules.root]
@@ -488,6 +509,24 @@ def parse_opts():
488509
"-p", "--parallelism", type="int", default=4,
489510
help="The number of suites to test in parallel (default %default)"
490511
)
512+
parser.add_option(
513+
"-m", "--modules", type="str",
514+
default=None,
515+
help="A comma-separated list of modules to test "
516+
"(default: %s)" % ",".join(sorted([m.name for m in modules.all_modules]))
517+
)
518+
parser.add_option(
519+
"-e", "--excluded-tags", type="str",
520+
default=None,
521+
help="A comma-separated list of tags to exclude in the tests, "
522+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
523+
)
524+
parser.add_option(
525+
"-i", "--included-tags", type="str",
526+
default=None,
527+
help="A comma-separated list of tags to include in the tests, "
528+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
529+
)
491530

492531
(opts, args) = parser.parse_args()
493532
if args:
@@ -537,40 +576,72 @@ def main():
537576
# add path for Python3 in Jenkins if we're calling from a Jenkins machine
538577
os.environ["PATH"] = "/home/anaconda/envs/py3k/bin:" + os.environ.get("PATH")
539578
else:
540-
# else we're running locally and can use local settings
579+
# else we're running locally or Github Actions.
541580
build_tool = "sbt"
542581
hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.6")
543-
test_env = "local"
582+
if "GITHUB_ACTIONS" in os.environ:
583+
test_env = "github_actions"
584+
else:
585+
test_env = "local"
544586

545587
print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version,
546588
"under environment", test_env)
547589

548590
changed_modules = None
591+
test_modules = None
549592
changed_files = None
550-
should_only_test_modules = "TEST_ONLY_MODULES" in os.environ
593+
should_only_test_modules = opts.modules is not None
551594
included_tags = []
595+
excluded_tags = []
552596
if should_only_test_modules:
553-
str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")]
597+
str_test_modules = [m.strip() for m in opts.modules.split(",")]
554598
test_modules = [m for m in modules.all_modules if m.name in str_test_modules]
555-
# Directly uses test_modules as changed modules to apply tags and environments
556-
# as if all specified test modules are changed.
599+
600+
# If we're running the tests in Github Actions, attempt to detect and test
601+
# only the affected modules.
602+
if test_env == "github_actions":
603+
if os.environ["GITHUB_BASE_REF"] != "":
604+
# Pull requests
605+
changed_files = identify_changed_files_from_git_commits(
606+
os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"])
607+
else:
608+
# Build for each commit.
609+
changed_files = identify_changed_files_from_git_commits(
610+
os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
611+
612+
modules_to_test = determine_modules_to_test(
613+
determine_modules_for_files(changed_files), deduplicated=False)
614+
615+
if modules.root not in modules_to_test:
616+
# If root module is not found, only test the intersected modules.
617+
# If root module is found, just run the modules as specified initially.
618+
test_modules = list(set(modules_to_test).intersection(test_modules))
619+
557620
changed_modules = test_modules
558-
str_excluded_tags = os.environ.get("TEST_ONLY_EXCLUDED_TAGS", None)
559-
str_included_tags = os.environ.get("TEST_ONLY_INCLUDED_TAGS", None)
560-
excluded_tags = []
561-
if str_excluded_tags:
562-
excluded_tags = [t.strip() for t in str_excluded_tags.split(",")]
563-
included_tags = []
564-
if str_included_tags:
565-
included_tags = [t.strip() for t in str_included_tags.split(",")]
621+
if len(changed_modules) == 0:
622+
print("[info] There are no modules to test, exiting without testing.")
623+
return
624+
625+
# If we're running the tests in AMPLab Jenkins, calculate the diff from the targeted branch, and
626+
# detect modules to test.
566627
elif test_env == "amplab_jenkins" and os.environ.get("AMP_JENKINS_PRB"):
567628
target_branch = os.environ["ghprbTargetBranch"]
568629
changed_files = identify_changed_files_from_git_commits("HEAD", target_branch=target_branch)
569630
changed_modules = determine_modules_for_files(changed_files)
631+
test_modules = determine_modules_to_test(changed_modules)
570632
excluded_tags = determine_tags_to_exclude(changed_modules)
633+
634+
# If there is no changed module found, tests all.
571635
if not changed_modules:
572636
changed_modules = [modules.root]
573-
excluded_tags = []
637+
if not test_modules:
638+
test_modules = determine_modules_to_test(changed_modules)
639+
640+
if opts.excluded_tags:
641+
excluded_tags.extend([t.strip() for t in opts.excluded_tags.split(",")])
642+
if opts.included_tags:
643+
included_tags.extend([t.strip() for t in opts.included_tags.split(",")])
644+
574645
print("[info] Found the following changed modules:",
575646
", ".join(x.name for x in changed_modules))
576647

@@ -585,8 +656,6 @@ def main():
585656

586657
should_run_java_style_checks = False
587658
if not should_only_test_modules:
588-
test_modules = determine_modules_to_test(changed_modules)
589-
590659
# license checks
591660
run_apache_rat_checks()
592661

@@ -643,10 +712,6 @@ def main():
643712

644713

645714
def _test():
646-
if "TEST_ONLY_MODULES" in os.environ:
647-
# TODO(SPARK-32252): Enable doctests back in Github Actions.
648-
return
649-
650715
import doctest
651716
failure_count = doctest.testmod()[0]
652717
if failure_count:

0 commit comments

Comments
 (0)