Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 14, 2025

What changes were proposed in this pull request?

Add Spark History Server example.

Why are the changes needed?

Since Apache Spark 4.0, Spark rolls the event logs by default and compressed them by default.

However, we still need more configurations to allow SHS manages the event log directories. This PR aims to provide an example of Spark History Server with the configuration.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review.

Was this patch authored or co-authored using generative AI tooling?

No.

applicationTolerations:
restartConfig:
restartPolicy: Always
maxRestartAttempts: 9223372036854775807
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default value of maxRestartAttempts is 3 (for normal jobs), I set this as Long.MAX_VALUE because SHS is supposed to run forever.

$ jshell
|  Welcome to JShell -- Version 21.0.7
|  For an introduction type: /help intro

jshell> Long.MAX_VALUE
$1 ==> 9223372036854775807

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya . Merged to main.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-52481 branch June 14, 2025 23:19
@melin
Copy link

melin commented Jun 15, 2025

The Spark event log cannot be directly written to s3. There are the following errors:

WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to logs/spark/events/eventlog_v2_spark-61bc831c6f774629a1cd1bc95d6a7288/events_1_spark-61bc831c6f774629a1cd1bc95d6a7288. This is unsupported

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 16, 2025

@melin , it's a warning, not an error.

The Spark event log cannot be directly written to s3. There are the following errors:

FYI, Apache Spark 3.2+ has the following by default.

@melin
Copy link

melin commented Jun 17, 2025

@melin , it's a warning, not an error.

The Spark event log cannot be directly written to s3. There are the following errors:

FYI, Apache Spark 3.2+ has the following by default.

spark.eventLog.dir cannot be directly set to s3 path, which is not very convenient. There are two solutions:

  1. Write directly to the nfs shared storage
  2. First, write to the pod locally. Once the task is completed, upload it to s3.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 17, 2025

@melin , it's a warning, not an error.

The Spark event log cannot be directly written to s3. There are the following errors:

FYI, Apache Spark 3.2+ has the following by default.

spark.eventLog.dir cannot be directly set to s3 path, which is not very convenient. There are two solutions:

  1. Write directly to the nfs shared storage
  2. First, write to the pod locally. Once the task is completed, upload it to s3.

It seems that the above are not questions.

So, to be clear, No, your personal assessments and proposed solutions sounds wrong to me, @melin . It doesn't make sense at all for long running jobs and streaming jobs. In other words, that's not a community recommendation.

Apache Spark community has been using S3 (and S3-compatible object storage) directly as an Apache Spark event log directory for a long time.

BTW, Apache Spark PR is not a good place for this kind of Q&A. I'd recommend you to send to [email protected] if you have any difficulties or you want to discuss this.

@dongjoon-hyun
Copy link
Member Author

FYI, @melin . Here is the example.

@Ferdinanddb
Copy link

Hi @dongjoon-hyun , thank you for your work here.

I am trying to deploy a spark-history-server pod using your example, and I see it as running but I cannot port-forward it since the port 18080 is not being used by the pod (and so not used by the service).

Could you please tell me what should I do to be able to fix my issue?

I have the following config:

apiVersion: spark.apache.org/v1
kind: SparkApplication
metadata:
  name: spark-history-server
  namespace: spark-operator
spec:
  mainClass: "org.apache.spark.deploy.history.HistoryServer"
  sparkConf:
    # spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.4.1"
    spark.jars: "https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/3.1.1/gcs-connector-3.1.1-shaded.jar"
    spark.jars.ivy: "/tmp/.ivy2.5.2"
    spark.driver.memory: "2g"
    spark.kubernetes.namespace: spark-operator
    spark.kubernetes.authenticate.driver.serviceAccountName: "spark"
    spark.kubernetes.container.image: "apache/spark:4.0.0-java21-scala"
    spark.history.fs.cleaner.enabled: "true"
    spark.history.fs.cleaner.maxAge: "30d"
    spark.history.fs.cleaner.maxNum: "100"
    spark.history.fs.eventLog.rolling.maxFilesToRetain: "10"
    
    spark.hadoop.fs.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    spark.hadoop.fs.AbstractFileSystem.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    spark.hadoop.fs.gs.auth.service.account.enable: "true"
    # spark.hadoop.fs.gs.auth.type: "COMPUTE_ENGINE"
    spark.hadoop.fs.gs.project.id: "rcim-prod-data-core-0"
    spark.hadoop.fs.defaultFS: "gs://SOME-BUCKET-GCS-spark-logs"
    spark.history.fs.logDirectory: "gs://SOME-BUCKET-GCS-spark-logs"
  runtimeVersions:
    sparkVersion: "4.0.0"
  applicationTolerations:
    restartConfig:
      restartPolicy: Always
      maxRestartAttempts: 9223372036854775807

I changed the value of my GCS bucket name, but again: I don't see any errors in the pod's logs.

When I execute the following port-forward command I get this:

kubectl port-forward svc/spark-history-server-0-driver-svc 8080:spark-ui --namespace spark-operator
Forwarding from 127.0.0.1:8080 -> 4040
Forwarding from [::1]:8080 -> 4040
Handling connection for 8080
E0630 21:06:51.716348 2639361 portforward.go:424] "Unhandled Error" err="an error occurred forwarding 8080 -> 4040: error forwarding port 4040 to pod 76c12381718e2542ae6e371d8c28992dc9e541439cd53b832df5aa5aaaeddbac, uid : failed to execute portforward in network namespace \"/var/run/netns/cni-5208164a-41a3-2769-bfaf-6b6f31eb1c80\": failed to connect to localhost:4040 inside namespace \"76c12381718e2542ae6e371d8c28992dc9e541439cd53b832df5aa5aaaeddbac\", IPv4: dial tcp4 127.0.0.1:4040: connect: connection refused IPv6 dial tcp6 [::1]:4040: connect: cannot assign requested address "

Thank you very much if you can help!

@dongjoon-hyun
Copy link
Member Author

You need to use spark.ui.port. Let me revise the example for you.

@dongjoon-hyun
Copy link
Member Author

Here is the PR, @Ferdinanddb .

@Ferdinanddb
Copy link

@dongjoon-hyun thanks for the taking the time for this! I confirm that it now works :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants