Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Java 3,408 1,170 Updated Nov 29, 2025

salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

Scala 2,270 401 Updated Sep 29, 2023

maxpumperla / elephas

Distributed Deep learning with Keras & Spark

Python 1,577 311 Updated May 1, 2023

jadianes / spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Jupyter Notebook 1,667 917 Updated Mar 16, 2024

combust / mleap

MLeap: Deploy ML Pipelines to Production

Scala 1,527 315 Updated Dec 16, 2025

hi-primus / optimus

🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Python 1,538 233 Updated Dec 2, 2024

jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Python 1,363 455 Updated Sep 9, 2025

fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

Python 2,128 98 Updated Dec 2, 2025

qubole / sparklens

Qubole Sparklens tool for performance tuning Apache Spark

Scala 586 143 Updated Jun 26, 2024

ericxiao251 / spark-syntax

This is a repo documenting the best practices in PySpark.

Jupyter Notebook 462 77 Updated Dec 8, 2022

microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

Scala 430 116 Updated Jan 14, 2022

lifeomic / sparkflow

Easy to use library to bring Tensorflow on Apache Spark

Python 297 45 Updated Oct 11, 2023

mozilla / jupyter-spark

Jupyter Notebook extension for Apache Spark integration

JavaScript 191 32 Updated Dec 1, 2020

julioasotodv / spark-df-profiling

Create HTML profiling reports from Apache Spark DataFrames

Python 197 76 Updated Feb 2, 2020

databrickslabs / automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and…

HTML 192 44 Updated Jun 1, 2021

AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Scala 157 86 Updated Nov 28, 2025

Bergvca / pyspark_dist_explore

Data Exploration in PySpark made easy - Pyspark_dist_explore provides methods to get fast insights in your Spark DataFrames.

Python 102 16 Updated Aug 20, 2019

target / data-validator

A tool to validate data, built around Apache Spark.

Scala 100 34 Updated Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rafael Ribeiro rafpyprog

Achievements

Achievements

Block or report rafpyprog

Spark

YotpoLtd / metorikku

lucidworks / spark-solr

databricks / LearningSparkV2

oap-project / gazelle_plugin

linkedin / spark-tfrecord

apache / spark

horovod / horovod

JerryLead / SparkInternals

yahoo / TensorFlowOnSpark

microsoft / SynapseML

databricks / koalas

intel / BigDL

apache / linkis