Spark Structured Streaming checkpointing integration test #459

akirillov · 2018-11-28T22:18:47Z

What changes were proposed in this pull request?

The new integration test launches Spark Structured Streaming job with checkpointing backed by HDFS, feeds some test data to Kafka and verifies that the aggregated data is not reprocessed but recovered from a checkpoint instead if a failure happens. All components are Kerberized.

Apart from the test this PR introduces a refactoring for KDC/HDFS/Kafka-dependent tests and allows reuse of these components across the tests without reinstalling them every time.

Changes:

Spark structured streaming checkpointing recovery integration test
session-scoped shared KDC, HDFS, and Kafka services
separate service accounts and secrets for Kafka and HDFS
shared KDC, HDFS, and Kafka configuration and tooling was extracted to 'fixtures' modules
Kerberized Kafka Feeder and Spark Structured Streaming test jobs
Spark version upgrade to 2.3.2

How were these changes tested?

Integration tests from this repo

Release Notes

Upgrade of Spark version to 2.3.2
Structured Streaming checkpointing and recovery integration test

elezar

Thanks @akirillov.

I have made some comments regarding the Python testing infrastructure here. It's a good cleanup, so these are generally minor. I didn't spend too much time reviewing the Spark code specifically.

elezar · 2018-12-03T10:26:10Z

tests/fixtures/__init__.py

+this_file_dir = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'tests')))
+sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'testing')))
+sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'spark-testing')))


Nit: Newline:

Suggested change

sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'spark-testing')))

sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'spark-testing')))

elezar · 2018-12-03T10:29:40Z

tests/fixtures/hdfs.py

+            additional_options=service_kerberos_options,
+            timeout_seconds=30 * 60)
+
+        yield kerberos_env


Why does this yield kerberos_env? This fixture doesn't modify the env at all (that I can tell).

removed in favour of using kerberos_env throughout the dependent tests and fixtures.

elezar · 2018-12-03T10:30:07Z

tests/fixtures/hdfs.py

+
+    finally:
+        sdk_install.uninstall(HDFS_PACKAGE_NAME, HDFS_SERVICE_NAME)
+        kerberos_env.cleanup()


In my opinion, this should be handled in the fixture providing kerberos_env.

cleanup moved to KDC fixture

elezar · 2018-12-03T10:32:01Z

tests/fixtures/hdfs.py

+
+
+@pytest.fixture(scope='session')
+def setup_hdfs_client(hdfs_with_kerberos):


It is strange that this depends on hdfs_with_kerberos without accessing it. Should the client depend on the service in some way?

The main intent of this dependency is to introduce the ordering in which hdfsclient never created before hdfs service. What would be a better way of implementing this?

I realise that it's about ordering, so that's fine, but the dependence is not really explicit here. It takes knowledge of the client itself to know that it requires an HDFS service with a name hdfs.

What I have used in the past is something like: https://github.com/mesosphere/dcos-commons/blob/d83139ac0fb9c53719a9e5637a5704ae9dfe23d5/frameworks/hdfs/tests/test_kerberos_auth.py#L101

where the hdfs_with_kerberos is a dict which contains the required configuration and this si explicitly passed to the client. It's not a requirement for this PR though.

elezar · 2018-12-03T10:34:15Z

tests/fixtures/hdfs.py

+                    "enabled": True,
+                    "realm": sdk_auth.REALM,
+                    "kdc": {
+                        "hostname": hdfs_with_kerberos.get_host(),


I think that we should explicitly use kerberos_env here.

switched to kerberos_env

elezar · 2018-12-03T10:37:27Z

tests/fixtures/kafka.py

+    finally:
+        sdk_install.uninstall(KAFKA_PACKAGE_NAME, KAFKA_SERVICE_NAME)
+        kerberos_env.cleanup()
+        print('noop')


Suggested change

print('noop')

this wasn't intended to be committed, removed

elezar · 2018-12-03T10:38:06Z

tests/fixtures/kafka.py

+
+    finally:
+        sdk_install.uninstall(KAFKA_PACKAGE_NAME, KAFKA_SERVICE_NAME)
+        kerberos_env.cleanup()


As is the case with HDFS, we should not cleanup the Kerberos environment here.

cleanup moved to KDC fixture

elezar · 2018-12-03T10:38:23Z

tests/fixtures/kdc.py

+        yield kerberos_env
+
+    finally:
+        print('noop')


Suggested change

print('noop')

this wasn't intended to be committed, removed

elezar · 2018-12-03T10:43:05Z

tests/test_checkpointing.py

+    setup_hdfs_paths()
+
+    # running kafka producer
+    messages = ["test"] * 100


Would it make sense to make these messages unique for a specific run?:
["test0", "test1", ... ]?

Suggested change

messages = ["test"] * 100

messages = ["test%d" % d for d in range(100)]

Good catch! The new implementation uses two different word sets to verify both the count accuracy and aggregated data checkpointing.

elezar · 2018-12-03T10:46:06Z

tests/test_soak.py


 import spark_utils as utils
-from tests.test_kafka import KAFKA_PACKAGE_NAME, test_pipeline
+# from tests.test_kafka import KAFKA_PACKAGE_NAME, test_pipeline


Let's rather remove this than comment it out.

this was commented by mistake, reverted back now.

akirillov · 2018-12-03T21:16:30Z

thanks for your comments @elezar. I cleaned up the code according to your suggestions and left a comment related to a wider refactoring. The Structured Streaming test itself is modified now to avoid false positives by using different word sets fed into Kafka before and after driver failure.

elezar

Thanks @akirillov.

I think the remaining points I had were more specific to any follow-up refactoring work that may be done.

samvantran

Overall looks good and I appreciate the cleanup/fixtures. Just a couple of comments/questions my side.

samvantran · 2018-12-04T19:32:01Z

manifest.json

    "default_spark_dist": {
        "hadoop_version": "2.7",
-        "uri": "https://downloads.mesosphere.com/spark/assets/spark-2.2.1-4-SNAPSHOT-bin-2.7.tgz"
+        "uri": "https://downloads.mesosphere.com/spark/assets/spark-2.3.2-1-SNAPSHOT-bin-2.7.3.tgz"


Is the appended -1 in 2.3.1-1 from upstream spark tar creation or our own? If it's from us, we can probably do away with appending -1 to the actual version as SNAPSHOT is enough to indicate dev status.

So downloads.mesosphere.com/.../spark-2.3.2-SNAPSHOT-bin.... If it's from upstream, pls disregard.

Jenkins upload job is failing if an object with the same key exists, this complicated SNAPSHOT replacement. Overall, I think we should still keep some sort of versioning to avoid failures when a snapshot is overridden with some broken/incompatible changes.

If this is the case then do we need the word SNAPSHOT at all? We can go with a simpler versioning system like 2.3.2.1 or 2.3.2-1, if we make it known the last digit represents a dev build. Or perhaps a proper thing to do is tag the file w/ a date-time

samvantran · 2018-12-04T19:32:32Z

manifest.json

        {
            "hadoop_version": "2.7",
-            "uri": "https://downloads.mesosphere.com/spark/assets/spark-2.2.1-4-SNAPSHOT-bin-2.7.tgz"
+            "uri": "https://downloads.mesosphere.com/spark/assets/spark-2.3.2-1-SNAPSHOT-bin-2.7.3.tgz"


Don't we plan to use hadoop 2.7.7?

There's a separate ticket for porting CVEs commit from branch 2.2 to 2.3, in the current branch Hadoop is of version 2.7.3. It makes sense to (at least) update Hadoop dependency first in 2.3 branch and use the updated tar here to avoid having an outdated master in spark-build

Okay let's port the CVE updates first and rebase this

samvantran · 2018-12-04T21:24:12Z

tests/fixtures/kafka.py

+            },
+            "kafka": {
+                "default_replication_factor": 3,
+                "num_partitions": 32


This is what solved the hanging kafka consumers in the tests, correct?

Yes, this unstuck Spark Structured Streaming consumer. spark_kafka test has stabilized too but it looks like the stuck consumption issue is still present there (long freeze before starting actual consumption).

samvantran · 2018-12-04T21:27:22Z

tests/jobs/scala/build.sbt

@@ -1,4 +1,4 @@
-lazy val SparkVersion = sys.env.get("SPARK_VERSION").getOrElse("2.2.0")
+lazy val SparkVersion = sys.env.getOrElse("SPARK_VERSION", "2.3.2")


not this file but I think you accidentally added tests/jobs/scala/.DS_Store. Please remove (you can also add DS_Store to .gitignore to guard against in the future

thanks for this catch! will add to gitignore and cleanup the PR from it

samvantran · 2018-12-04T22:01:54Z

tests/jobs/scala/src/main/scala/KerberizedKafkaProducer.scala

+    } catch {
+      case t: Throwable =>
+        t.printStackTrace()
+        Thread.sleep(30*60*1000)


why sleep for 30 minutes?

This was added to ease Kafka producer troubleshooting in case of errors (happens only when an exception is thrown). After verifying functionality it looks like it's not needed anymore.

samvantran · 2018-12-04T22:42:32Z

tests/test_checkpointing.py

+
+    # Wait until executor is running
+    LOGGER.info("Starting supervised driver {}".format(driver_task_id))
+    sdk_tasks.check_running(SPARK_APPLICATION_NAME, 1, timeout_seconds=600)


can you add expected_task_count=1 otherwise the lone 1 is a bit mysterious or use a named_constant because I see something similar on L117.

good catch. fixed that.

samvantran · 2018-12-04T22:50:12Z

tests/test_hdfs.py


+from tests.fixtures.hdfs import HDFS_SERVICE_NAME, HDFS_DATA_DIR, HDFS_HISTORY_DIR
+from tests.fixtures.hdfs import HISTORY_PACKAGE_NAME
+from tests.fixtures.hdfs import SPARK_SUBMIT_HDFS_KERBEROS_ARGS, KEYTAB_SECRET_PATH, GENERIC_HDFS_USER_PRINCIPAL


nit: need 2 line breaks b/w imports and code

samvantran · 2018-12-04T22:53:49Z

tests/test_hdfs.py

 @pytest.mark.skipif(not utils.hdfs_enabled(), reason='HDFS_ENABLED is false')
 @pytest.mark.sanity
-def test_history():
+def test_history(kerberized_spark, hdfs_with_kerberos, setup_history_server):


if you're depending onkerberized_spark (which internally depends on setup_history_sever, hdfs_with_kerberos), do you need to call those two again here?

The main intention here is to explicitly list the dependencies of the test to ease troubleshooting. It's not only in this test, but a little bit of redundancy also provides a better view of the context.

okay that's fair

samvantran · 2018-12-04T22:56:06Z

tests/test_kafka.py

+    kerberos_args = get_kerberized_kafka_spark_conf(spark_service_name)

-    producer_config = ["--conf", "spark.cores.max=2", "--conf", "spark.executor.cores=2",
+    producer_config = ["--conf", "spark.cores.max=2", "--conf", "spark.executor.cores=1",


This will mean that we spawn 2 producers, is this intended?

yes, the main goal here is to lower down a single executor resource demands while keeping parallelism at the same level.

samvantran · 2018-12-04T22:58:07Z

tests/test_kafka.py

    sdk_tasks.check_running(KAFKA_SERVICE_NAME, 1, timeout_seconds=600)

-    consumer_config = ["--conf", "spark.cores.max=4", "--class", "KafkaConsumer"] + common_args
+    consumer_config = ["--conf", "spark.cores.max=2", "--conf", "spark.executor.cores=1",


2 producers, 2 consumers - okay, looks like it is intended

yes, it is the same here - fixed-size executors preserving specified parallelism level.

samvantran

Approving since tests pass and my comments are non-blockers (other than updating the tgz to include the latest 2.3 CVE fixes)

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch 4 times, most recently from 48dc64d to 9589ab1 Compare December 3, 2018 06:22

akirillov requested review from elezar, samvantran and soumasish December 3, 2018 06:31

elezar reviewed Dec 3, 2018

View reviewed changes

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch 2 times, most recently from 751ebf4 to 9c26c8c Compare December 3, 2018 20:00

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch 2 times, most recently from ea2d212 to 1bcbe0b Compare December 3, 2018 22:05

elezar approved these changes Dec 4, 2018

View reviewed changes

samvantran suggested changes Dec 4, 2018

View reviewed changes

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch 5 times, most recently from e2e1ebf to 1889404 Compare December 5, 2018 21:18

akirillov mentioned this pull request Dec 7, 2018

[DCOS-45820] Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs d2iq-archive/spark#43

Merged

samvantran approved these changes Dec 10, 2018

View reviewed changes

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch from 1889404 to a09f417 Compare December 12, 2018 19:22

Spark Structured Streaming checkpointing integration test

aa1f35e

akirillov force-pushed the DCOS-41580-spark-structured-streaming-checkpointing-test branch from a09f417 to aa1f35e Compare December 12, 2018 21:21

akirillov merged commit a75ae0d into master Dec 13, 2018

vishnu2kmohan deleted the DCOS-41580-spark-structured-streaming-checkpointing-test branch February 19, 2019 19:12

	sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'spark-testing')))
	sys.path.append(os.path.normpath(os.path.join(this_file_dir, '..', '..', 'spark-testing')))



		@pytest.fixture(scope='session')
		def setup_hdfs_client(hdfs_with_kerberos):

	messages = ["test"] * 100
	messages = ["test%d" % d for d in range(100)]

		@@ -1,4 +1,4 @@
		lazy val SparkVersion = sys.env.get("SPARK_VERSION").getOrElse("2.2.0")
		lazy val SparkVersion = sys.env.getOrElse("SPARK_VERSION", "2.3.2")

Spark Structured Streaming checkpointing integration test #459

Spark Structured Streaming checkpointing integration test #459

Uh oh!

Conversation

akirillov commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How were these changes tested?

Release Notes

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akirillov commented Dec 3, 2018

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

akirillov commented Nov 28, 2018 •

edited

Loading