Spark
A simplified, lightweight ETL Framework based on Apache Spark
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Read and write Tensorflow TFRecord data from Apache Spark.
Apache Spark - A unified analytics engine for large-scale data processing
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Notes talking about the design and implementation of Apache Spark
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Distributed Deep learning with Keras & Spark
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Jupyter magics and kernels for working with remote Spark clusters
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Qubole Sparklens tool for performance tuning Apache Spark
This is a repo documenting the best practices in PySpark.
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Easy to use library to bring Tensorflow on Apache Spark
Jupyter Notebook extension for Apache Spark integration
Create HTML profiling reports from Apache Spark DataFrames
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and…
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Data Exploration in PySpark made easy - Pyspark_dist_explore provides methods to get fast insights in your Spark DataFrames.
A tool to validate data, built around Apache Spark.



