Credit Risk Scoring

Nowadays, one can invest in other people's loans using online peer-to-peer lending platforms such as, for example, Lending Club. In the Lending Club, borrowers with higher credit scores (more trustworthy and less risky) earn lower interest rates on their loans. In comparison, borrowers with lower credit scores (less trustworthy and riskier) earn higher rates. From an investor's point of view, loans with higher interest rates are more attractive because they provide a higher rate of return on investment (ROI) but, on the other hand, carry the risk of not being returned at all (default). Therefore, a machine learning model that can predict which high-interest loans are more likely to be repaid. It will provide added value by minimizing the associated risks.

For this project, I will be using the libraries for data manipulation (Pandas, Numpy), data visualization (Matplotlib, Seaborn), machine learning (Scikit-learn, XGBoost), and some statistics to get some insight and the trend of the data. The dataset consists of 466285 rows and 75 columns. Using df.info, we know that some columns have missing columns, so we should handle the missing value first. First, we drop the column having many missing values. For the rest missing values, we can just drop the rows to remove them. The rows decrease from 466285 to 372162. Then we move to the number of the columns because 75 columns are quite a much. The columns having uncertain values, such as an address, zip code, and URL, will be dropped.

The next step is data exploration. We can explore the data to gain some insights. Here are some investigations that we can do.
  1. The most jobs title applied for a loan
  2. The distribution of loan terms, home ownership, and loan purposes
The result of exploration and visualization can be used as an initial guess or hypothesis for synthesizing some promotion or offer.

Next, we can check the distribution of the columns with the number value. See if the column has an imbalance distribution or even no distribution. It turns out some columns have only one value, so that the columns can be dropped easily. It will not affect the model we will build later. We should also check the outlier. There are many outlier detections that we can use. Just pick one of the methods. In this article, interquartile outlier detection is used. After removing the outliers, the rows decrease again from 372162 to 250872.

Feature engineering is the next step we can do. But, this step is optional if you already feel enough with your features. In this article, some new columns are created to help the model understand the correlation between columns. It is just a simple division between columns. It is done because the machine learning model doesn't know the relation between the columns. It just processes the data we give.

After that, we move to the ordinal columns. We have to convert the value to make the model understand what we want. The term, grade, sub_grade, and emp_length columns are converted to be some numbers. You can change the value as long as the value is reasonable.

The target feature for this dataset is an indicator if the loan is good (0) or bad (1). To identify good loans, I use their loan statuses and print their counts below. The 'Current' and 'Fully Paid' are considered as 0 while 'Charged Off', 'Default': 'Late (16-30 days)', and 'Late (31-120 days)' are considered as 1. The Lending Club provides the description for each status:
  - Current: The loan is up to date on all outstanding payments.
  - In Grace Period: The loan is past due but within the 15-day grace period.
  - Late (16-30): The loan has not been current for 16 to 30 days.
  - Late (31-120): The loan has not been current for 31 to 120 days.
  - Fully paid: Loan has been fully repaid, either at the expiration of the 3- or 5-year year term or as a result of a prepayment.
  - Default: The loan has not been current for an extended period of time.
  - Charged Off: Loan for which there is no longer a reasonable expectation of further payments.

For the imbalanced data problem, we can use SMOTE for handling the majority data while RandomUnderSampler for handling the minority data. Then we build 3 different machine learning models, named decision tree, random forest, and XGBoost. The three models are built to make a scoring through the loan's applicant. Finally, the models are tested, and the performances, such as accuracy, f1 score, and AUC score, are calculated. It shows that XGBoost is the best model among the others. Therefore, the XGBoost model is saved by pickle and can be used in the future.