A project demonstrating the automated generation of PowerPoint presentations from web-scraped cricket data, showcasing data analysis and visualization skills.
View on GitHub
Table of Contents
This project showcases my ability to automate the generation of PowerPoint presentations from web-scraped data. I've developed a Python-based pipeline that extracts cricket statistics, processes it, and creates visually appealing PPTs with data visualizations.
- I employed web scraping techniques to gather data on cricket players and their century records from various online sources.
- The collected data encompasses player details (name, date of birth, place of birth, family information), career statistics (total centuries, centuries per year), and match information.
- The scraped data was organized into efficient Pandas DataFrames for streamlined manipulation.
- Data cleaning and transformation steps were performed to address missing values and ensure data integrity.
- The processed DataFrames were saved as Excel files (
personal_data.xlsxandprocessed_data.xlsx) for intermediate storage.
- I developed a Python script leveraging the
python-pptxlibrary to automate the creation of visually informative PowerPoint presentations. - The script dynamically reads data from the generated Excel files to create individual slides for each player, including:
- Comprehensive player information
- Detailed century statistics
- Well-structured tables summarizing key data points
Matplotlibwas utilized to generate insightful visualizations of century trends over the years, presented as clear bar and line plots within the PPT.
- To provide a more engaging data exploration experience, I created interactive visualizations using Plotly and embedded them in separate HTML pages.
- These dynamic graphs allow for features like zooming and tooltips, enabling deeper data analysis.
- (Note: The generated PPTs are static. The interactive graphs reside in separate HTML files. Future integration could involve linking from the PPT or exporting the PPT to an HTML format.)
The project's codebase is thoughtfully structured for clarity and maintainability:
runner.ipynb: This Jupyter Notebook orchestrates the entire PPT creation process, acting as the main execution point.prepare_data.ipynb: This notebook manages the data gathering and web scraping phases of the project.ppt_generator.py(Class): This Python class is responsible for data transformation, the generation of static graphs (for the PPT), and the creation of interactive HTML versions of these graphs.custom_presentation.py(Class): This class handles the styling of the PowerPoint presentation, the creation of individual slides, and the population of these slides with text, tables, and images.
runner.ipynb: Modify thePPT_DATAvariable within this file to adjust the filters applied when generating the PPTs (e.g., specific player groups or data ranges).prepare_data.ipynb: Update this file to modify the data sources or the web scraping logic to work with different or updated cricket statistics.ppt_generator.py(Class): Alter the data filtering and transformation logic within this class. You can also customize the appearance of the static graphs (for the PPT) and the interactive Plotly graphs (in HTML) here.custom_presentation.py(Class): Modify this file to change the overall style of the generated PowerPoint presentations, including the logo, color scheme, slide layouts, and font styles.
prepare_data.ipynb: Execute this notebook first to fetch and process the latest cricket data. This step will generate or update thepersonal_data.xlsxandprocessed_data.xlsxfiles.runner.ipynb or main.py: After successfully runningprepare_data.ipynb(or if you already have thepersonal_data.xlsxandprocessed_data.xlsxfiles), execute this notebook or the python file. This will trigger the PPT generation process, creating the output PowerPoint files.
The project currently generates two distinct PowerPoint presentations based on player gender: "player_Male.pptx" containing analysis for male cricket players and "player_Female.pptx" for female players.
This project relies on the following Python libraries. Ensure they are installed in your environment:
You can install these dependencies using pip:
pip install matplotlib python-pptx pandas plotly scipyThe application will open a window displaying the webcam feed with detected cards and the identified poker hand.
You can also adapt the capture variable in the script to process a video file instead of a live webcam feed.
- Robust error handling has been implemented within the web scraping and data processing stages to ensure the pipeline's stability.
- The codebase is designed with a modular architecture to enhance maintainability and facilitate future extensions.
- Potential future improvements include:
- Implementing dynamic data updates to keep the presentations current.
- Utilizing external configuration files for easier customization.
- Further breaking down the code into smaller, more specialized modules.
- Adding comprehensive logging for better monitoring and debugging.
- Incorporating unit tests to ensure the reliability of individual components.
This project effectively demonstrates my ability to integrate diverse technical skills to create a fully automated workflow, from extracting raw data to generating insightful and visually appealing presentations. I am enthusiastic about leveraging and further developing these skills in a professional environment.
Feel free to reach out if you have any questions, suggestions, or would like to collaborate!