DATA SCIENCE & ML

25+ Data Science Projects to Build in 2026 (With GitHub Links)

By BuildIdeas Team·May 28, 2026·8 min read
Updated: May 2026

Beginner Projects

House Price Prediction with Regression & Streamlit Deployment

Build an end-to-end regression pipeline to predict residential house prices using features like location, area, bedrooms, and bathrooms. Includes EDA, feature engineering, model comparison, and a live Streamlit web app for real-time predictions — a clean, complete beginner project.

PythonPandasScikit-learnXGBoostStreamlitMatplotlibSeabornJoblib
View on GitHub

Customer Churn Prediction for Telecom (with Flask API)

Predict whether a telecom customer will churn using classification algorithms like Random Forest, XGBoost, and Gradient Boosting. Covers SMOTE for class imbalance, SHAP for explainability, hyperparameter tuning, and deployment via Flask — a recruiter-tested portfolio project.

PythonPandasScikit-learnXGBoostSMOTE (imbalanced-learn)SHAPFlaskMatplotlib
View on GitHub

Movie Recommendation System with Collaborative Filtering

Build a personalized movie recommendation engine using collaborative filtering algorithms (SVD, NMF, KNNBasic) on the MovieLens dataset. Evaluates performance with RMSE, generates top-N user recommendations, and prepares the model for web deployment — a classic resume project with real-world value.

PythonPandasScikit-learnSurprise LibraryMatplotlibSeabornJupyter Notebook
View on GitHub

Twitter Sentiment Analysis with BERT & Streamlit

Fine-tune a BERT/DistilBERT model on Twitter data to classify tweets into positive, negative, and neutral sentiments. Features full preprocessing, model evaluation, and a live Streamlit UI for real-time sentiment prediction — demonstrates strong NLP fundamentals.

PythonHugging Face TransformersPyTorchBERTStreamlitNLTKPandasMatplotlib
View on GitHub

Customer Segmentation with K-Means Clustering (RFM + Streamlit)

Apply K-Means and Agglomerative clustering to segment customers based on RFM (Recency, Frequency, Monetary) analysis. Uses PCA for visualization, elbow method for optimal clusters, and wraps everything in a Streamlit app — shows strong unsupervised ML skills for e-commerce roles.

PythonPandasScikit-learnMatplotlibSeabornPCAStreamlitYellowbrick
View on GitHub

Stock Price Prediction using LSTM Neural Networks

Predict future stock closing prices using a multi-layer LSTM network trained on historical OHLCV data from Yahoo Finance. Covers sliding-window preprocessing, early stopping, RMSE/MAE/MAPE evaluation, and training/prediction visualizations — a strong time series portfolio project.

PythonPyTorchyFinancePandasNumPyMatplotlibScikit-learn
View on GitHub

Multiple Disease Prediction Web App (Diabetes, Heart, Parkinson's)

A deployed Streamlit web application that predicts the likelihood of diabetes, heart disease, and Parkinson's disease using supervised ML models trained on curated medical datasets. Demonstrates multi-model deployment and healthcare ML fundamentals — a visually impressive beginner project.

PythonScikit-learnStreamlitPandasNumPyJoblibJupyter Notebook
View on GitHub

Image Classification with CNN using PyTorch (CIFAR-10)

Build and train a Convolutional Neural Network on the CIFAR-10 dataset for 10-class image classification. Covers data augmentation, batch normalization, learning rate scheduling, and TensorBoard visualization — a foundational computer vision project demonstrating deep learning proficiency.

PythonPyTorchtorchvisionNumPyMatplotlibTensorBoard
View on GitHub

Credit Card Fraud Detection with XGBoost & LightGBM

Build a fraud detection system on a highly imbalanced Kaggle dataset (0.17% fraud rate) by comparing XGBoost, LightGBM, CatBoost, and Random Forest. Uses PR-AUC as the primary metric, cross-validation, and hyperparameter tuning — achieves 97.71% AUC while handling class imbalance properly.

PythonXGBoostLightGBMCatBoostScikit-learnPandasMatplotlibimbalanced-learn
View on GitHub

Intermediate Projects

End-to-End MLOps Pipeline with FastAPI, MLflow & Docker

A production-grade MLOps pipeline with CI/CD, continuous training, and monitoring — trains models via MLflow, deploys via FastAPI, containerizes with Docker, orchestrates with Kubernetes, and tracks metrics with Prometheus and Grafana. A standout DevOps-for-ML portfolio piece.

PythonFastAPIMLflowDVCDockerKubernetesAWS (ECR/EKS)PrometheusGrafanaGitHub Actions
View on GitHub

Sales & Demand Forecasting with ARIMA, Prophet & LSTM

Compare and evaluate ARIMA, Facebook Prophet, XGBoost, and LSTM models for retail sales forecasting on the Kaggle store-item demand dataset. Covers stationarity tests, seasonality decomposition, rolling forecasts, and error benchmarking — essential for supply chain and data science roles.

PythonStatsmodelsProphet (Meta)XGBoostKeras/TensorFlowPandasMatplotlibScikit-learn
View on GitHub

NLP Text Summarization with HuggingFace Pegasus & FastAPI

Fine-tune Google's Pegasus transformer model on the SAMSum dialogue dataset to generate abstractive conversation summaries. Implements a full MLOps pipeline with modular architecture, comprehensive logging, ROUGE metric evaluation, and a FastAPI + Docker deployment — production-quality NLP engineering.

PythonHuggingFace TransformersPegasusFastAPIDockerROUGEPyTorchJupyter Notebook
View on GitHub

Real-Time Object Detection with YOLOv8 & OpenCV

Deploy a real-time object detection system using YOLOv8 on webcam or video input. Covers single-stage detection theory, confidence thresholding, bounding box rendering, and frame-by-frame inference — demonstrates the practical computer vision skills most CV job roles require.

PythonYOLOv8 (Ultralytics)OpenCVPyTorchNumPy
View on GitHub

Named Entity Recognition (NER) System with SpaCy

Build a production-style custom NER system trained on a custom-annotated dataset using spaCy — extracts structured entities (names, locations, organizations, medical terms) from raw text with displaCy HTML visualization and API-ready inference output.

PythonspaCydisplaCyPandasJupyter Notebook
View on GitHub

Predictive Maintenance for Industrial Equipment (IoT + ML + MLOps)

An end-to-end MLOps system that ingests industrial sensor data, trains Random Forest and XGBoost failure prediction models, tracks experiments with MLflow, and serves predictions via FastAPI — containerized with Docker, CI/CD-ready via GitHub Actions, and cloud-deployable on AWS ECS. Achieves 92% accuracy.

PythonScikit-learnXGBoostMLflowFastAPIDockerGitHub ActionsAWS ECR/ECSSMOTEPandas
View on GitHub

RAG-Based LLM Document Chatbot with LangChain & Qdrant

Build a Retrieval-Augmented Generation (RAG) chatbot that answers user questions from uploaded PDFs using Llama 3.2 (via Ollama), BGE embeddings, and Qdrant as the vector database — orchestrated with LangChain and deployed as a Streamlit web app inside Docker.

PythonLangChainLLaMA 3.2 (Ollama)BGE Embeddings (HuggingFace)QdrantStreamlitDockerUnstructured
View on GitHub

Fraud Detection with FastAPI, Streamlit & MLflow Tracking

A professional ML pipeline for credit card fraud detection using XGBoost with SHAP explainability, Streamlit for interactive fraud predictions, MLflow for experiment tracking, and Azure/GitHub Actions for CI/CD. Combines MLOps tooling with business-critical use case — a top-tier intermediate project.

PythonXGBoostLightGBMSHAPFastAPIStreamlitMLflowDockerAzureGitHub Actions
View on GitHub

AI-Powered Demand Forecasting Dashboard with Prophet & Streamlit

An AI-powered demand forecasting system combining ARIMA and Facebook Prophet with adaptive seasonality decomposition and inventory/revenue optimization. Ships with an interactive Streamlit dashboard featuring production-ready metrics — ideal for supply chain analytics portfolios.

PythonProphetARIMAStatsmodelsStreamlitPlotlyPandasScikit-learnNumPy
View on GitHub

Advanced Projects

End-to-End MLflow + FastAPI + MinIO ML Deployment Platform

A production-ready ML deployment platform integrating MLflow for experiment tracking and model registry, FastAPI for model serving, and MinIO (S3-compatible) for artifact storage — all orchestrated with Docker Compose. Demonstrates the full ML lifecycle from training to versioned API serving.

PythonMLflowFastAPIMinIODocker ComposeScikit-learnPydanticUvicorn
View on GitHub

Neural Collaborative Filtering Movie Recommendation Engine

Implement a deep learning-based recommendation engine using Neural Collaborative Filtering (NCF) with PyTorch — combines user and item embeddings, trains on the MovieLens dataset, and evaluates with Hit Rate and NDCG. A more advanced recommendation system beyond classical matrix factorization.

PythonPyTorchPandasNumPyMatplotlibScikit-learnMovieLens Dataset
View on GitHub

Time Series Anomaly Detection with LSTM & Isolation Forest

Detect anomalies in multivariate time series (network traffic / server metrics) using a spectrum of methods — from statistical IQR baseline to LSTM-based forecasting reconstruction error. Benchmarks across multiple detectors for a rigorous, interview-ready ML project.

PythonTensorFlow/KerasScikit-learnPandasNumPyMatplotlibJupyter Notebook
View on GitHub

End-to-End Text Summarization MLOps Pipeline (AWS + CI/CD)

A complete text summarization system using the HuggingFace Transformers BERT model deployed on AWS with S3 and EC2 — covering full training pipeline, ROUGE score evaluation, FastAPI serving, and Streamlit deployment. Demonstrates cloud ML engineering with automated CI/CD for AWS-focused roles.

PythonTransformersBERTAWS S3AWS EC2StreamlitFastAPIGitHub ActionsDocker
View on GitHub

IoT-Based Predictive Maintenance System with Kafka & Spark

A full-stack IoT predictive maintenance system that streams sensor data via Apache Kafka, processes it in real time with Spark, connects to AWS IoT Core, and runs ML failure prediction models tracked with MLflow. Demonstrates advanced data engineering + ML for IIoT roles.

PythonApache KafkaApache SparkMLflowAWS IoT CoreScikit-learnDockerPandas
View on GitHub

RAG Chatbot with Multi-LLM Support (OpenAI, Gemini, HuggingFace)

A production-ready RAG chatbot powered by LangChain supporting multiple LLM backends — OpenAI GPT-4, Google Gemini Pro, and HuggingFace Mistral. Users upload documents (PDF, CSV, DOCX), which are embedded into Chroma vector store and queried with conversational retrieval — deployed via Streamlit.

PythonLangChainOpenAI APIGoogle GeminiHuggingFaceChromaDBStreamlitPandas
View on GitHub

Full MLOps Pipeline: Customer Churn (Airflow + DVC + Kubernetes)

A fully automated, production-grade ML pipeline for telecom customer churn prediction orchestrated with Apache Airflow — covering data ingestion, validation, feature engineering, model training, and monitoring with DVC-based data versioning for complete reproducibility. Deploys on Kubernetes for scale.

PythonApache AirflowDVCScikit-learnXGBoostDockerKubernetesMLflowPostgreSQLGitHub Actions
View on GitHub

Tips for Building Projects That Get You Hired

  1. 1

    Use real datasets from Kaggle, UCI ML Repository, or government open data — not toy datasets.

  2. 2

    Always deploy — a live Streamlit or FastAPI URL shows you can ship, not just train models.

  3. 3

    Quantify everything in your resume — "94% F1-score on 50,000 samples" beats "built a classifier."

  4. 4

    Add model explainability using SHAP or LIME — interviewers always ask about this.

  5. 5

    Write a proper README with problem statement, approach, results, and a demo GIF.

Not sure which Data Science project to build?

BuildIdeas generates 3 personalized project ideas based on your exact stack — with week-by-week roadmaps and interview prep built in.

Generate My Data Science Project