🏦 Home Credit Default Risk Solution

A machine learning solution for predicting the probability of loan default in the Kaggle Home Credit Default Risk competition.

📝 Description

This repository contains a solution for the Home Credit Default Risk competition on Kaggle. The goal of this competition is to predict how capable each applicant is of repaying a loan. The solution uses LightGBM as the main algorithm and employs model blending to improve prediction accuracy.

✨ Features

Feature Engineering: Creates numerous features from multiple data sources to improve model performance
LightGBM Model: Implements a gradient boosting model with optimized hyperparameters
K-Fold Cross-Validation: Uses stratified k-fold cross-validation to ensure robust model evaluation
Model Blending: Combines predictions from multiple models to create a stronger ensemble
Feature Importance Analysis: Visualizes and exports feature importance for model interpretability

🛠️ Prerequisites

Python 3.x
Required libraries:
- pandas
- numpy
- LightGBM
- scikit-learn
- matplotlib
- seaborn

📊 Data

The solution expects the following data files (not included in this repository):

application_train.csv
application_test.csv
bureau.csv
bureau_balance.csv
previous_application.csv
POS_CASH_balance.csv
installments_payments.csv
credit_card_balance.csv

These files should be placed in the root directory of the project.

🚀 Usage

Main Model Training

To train the main LightGBM model and generate predictions:

python creditDefault.py

This will:

Process all data files
Create features
Train a LightGBM model with 5-fold cross-validation
Generate a submission file named HomeCreditDefaultSubmit.csv
Output feature importance visualization

Model Blending

To blend multiple model predictions:

python blender.py

This will:

Load prediction files from the blended/ directory
Create a weighted average of predictions
Generate a final submission file named blended.csv

🔍 Model Details

The solution uses a LightGBM classifier with the following key parameters:

Learning rate: 0.02
Number of leaves: 34
Max depth: 8
Feature subsampling and row subsampling for regularization
Early stopping to prevent overfitting

Feature engineering includes:

Ratio features (e.g., credit to income ratio)
Statistical aggregations (mean, max, min, etc.)
Temporal features based on days and dates
One-hot encoding for categorical variables

📈 Performance

The model's performance is evaluated using the ROC AUC metric, which is printed during training for each fold and for the overall validation set.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏦 Home Credit Default Risk Solution

📝 Description

✨ Features

🛠️ Prerequisites

📊 Data

🚀 Usage

Main Model Training

Model Blending

🔍 Model Details

📈 Performance

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
blended		blended
LICENSE		LICENSE
README.md		README.md
blender.py		blender.py
creditDefault.py		creditDefault.py

License

corticalstack/KaggleHomeCreditDefault

Folders and files

Latest commit

History

Repository files navigation

🏦 Home Credit Default Risk Solution

📝 Description

✨ Features

🛠️ Prerequisites

📊 Data

🚀 Usage

Main Model Training

Model Blending

🔍 Model Details

📈 Performance

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages