theprofittable / freddiemacdata Goto Github PK

2.0 3.0 2.0 24 KB

transformation of the freddie mac data to The Profit Table's standard data dictionary

R 100.00%

freddiemacdata's Introduction

freddiemacdata

transformation of the freddie mac data to The Profit Table's standard data dictionary

Background

As part of a larger effort to increase transparency, Freddie Mac is making available loan-level credit performance data on a portion of fully amortizing fixed-rate mortgages that the company purchased or guaranteed from 1999 to 2016.

You can get more information around the data set, as well as the user manual and actual download here.

Scope

The scope of this repo is to convert the Freddie Mac data to the standard format used by The Profit Table for loan analysis in all of our packages. Please see this repo for the data dictionary. You can also acces our style guide and on boarding information there.

The main output of the script is a data frame (df) used for analysis using the loanportr package. We are currently reading in the Freddie Mac sample data from 2012 onwards as this results in a data frame of an easily manageable size.

Improvements and Work To Be Done

Create a very small sample set to accompany the loanportr package as lazy data.

NOTE: issue for the above still to be logged.

freddiemacdata's People

Contributors

Stargazers

Watchers

Forkers

floresfdev jimdg

freddiemacdata's Issues

create orig_month and loan_period

make assumptions around the fpd_month variable (assume orig_month is one month before) and then create loan_period from that.

Will be good to have these to time related fields as well, because any vintage graph can be grouped on a orig_month or fpd_month basis.

Missing fpd_period 1

Some contracts are missing fpd_period == 1.
Example contract_key == "F116Q3442182"

Rows with NA in several columns

Affected columns:

months_arrears
fico
cltv
dti
default_flag: value computed from months_arrears
default_month: value computed from months_arrears
fico_bin: value computed from fico
cltv_bin: value computed from cltv
dti_bin: value computed from dti

For example:
Missing fico value for contracts F116Q1191757 and F116Q3004784
Missing cltv value for contract F116Q3425682

binning of continuous variables

dti, cltv, fico.
new variable for each.
create 3 levels for each: low medium high .
calculate: min, 33rd percentile and 66th percentile,
bin1: - 33rd
bin2: 33rd - 66th
bin3: 66th +

create default_month and default_flag

Date transformation

@DandeVilliers can you please tackle this one. the pointintime_month and fpd_monthvariable needs to get the last day of the month added to it so it can be transformed from an integer to a date variable.
Do in seperate script (not data wrangling) after I've finished #1

create (hack) orig_month and loan_period

Sample data set creation

Background

The loanportr package needs a sample lazy data set, primarily to enable automated unit tests to be set up with the testthat package.

The Freddie Mac data set is an obvious candidate for use as a sample data set as it is open and it is (to my knowledge) the only open monthly performance loan data set available.

Problem

The dataset to be used as a sample set to go with the loanportr package should comply with some basic criteria:

It should comply with The Profit Table's standard data dictionary. Variable names that refer to the data dictionary will be in code blocks throughout this issue.
It should have all the variables necessary for using the functions in the loanportr package as well as some other variables that will be used in a typical IFRS9 provision model as loanportr will most likely be enriched in the coming months with functions to perform IFRS9 analysis or a new package will be created. Where the variables are not available in the current data set, variables will need to be created that mimic the typical behaviour of those that are needed.
It should be small enough so as to not make the package unnecessarily big.
It should be big enough in terms of the number of observations to produce meaningful analysis.
It should contain enough defaults and other events to calculate meaningful rates.
There should be a meaningful and fairly consistent number of contracts in each vintage or origination month.

Solution

Create a sample data set for the loanportr package by:

Loan selection

Randomly select a 100 from each orig_month in df. Note: this refers to a contract level selection, i.e. 100 unique contract_keys.
select all loans from the df that were ever in default (default_flag = 1). Note: Here you will have to work with monthly performance level data and then take it back to contract level.

Variables
Note that points marked with {future} do not need to be prioritised for the first version of the sample data set.

Add missing variable: interest_rate (add from Freddie Mac data set - int_rt) {future}
Create missing variable: instalment {future}
Create missing variable: net_advance For now, make this simply a fixed % of the closing_balance: 90% for loans not in default (default_flag = 0), 50% for loans that are in default (default_flag = 1) .
Create missing variable: payment {future}
Create missing variable: disclosure If default_flag = 0 then "PL". If default_flag = 1 then "NPL".

use 2012 data onwards for now

import and transform freddie mac data

Data continuity

Background

Background reading required for this task:

Freddie Mac data readme
Data dictionary
loanportr package readme
Check out some of the function code and documentation of loanportr in this directory
Style guide

Problem

The Freddie Mac data set exhibits a problem commonly found in loan data sets:
The performance history for some contracts stops before it has to. It has to stop when there is no further information available because no more time has elapsed.
The premature stopping of loan history typically happens because the loan has either been written off or is paid up, which results in there being no further observations (or monthly performance data) beyond a certain date or month.

E.g., a loan is originated in May 2017. Today's date is 6 November 2017. We will have monthly performance data for this loan up to Oct 2017 (last month). The problem that I am talking about here is where, for example, the loan data only runs up to July 2017 and not October. I.e. there are no further rows.

pointintime_month	orig_month	period_loan
2017-05	2017-05	1
2017-06	2017-05	2
2017-07	2017-05	3

Stops here, even though it should go up to pointintime_month 2017-10.

This is problematic because when you are trying to calculate the rate at which loans reach some status as a % of the total loans originated in the origination month, it throws out both numerator and the denominator. The loan dissapears from the total and the 'event' part of the rate calculation. Functions likevintalyse and early_default (in loanportr package)depend on monthly loan performance (rows) being populated up until the most recent pointintime_month.

Solution

The solution can be broken up into a few functions:

Write a function that tests whether this issue is prevelant in the loan data set.
IN: data in our standard format. Which, using the Freddie Mac example, is the df created in this script
OUT: TRUE or FALSE
Write a function that gives you a list of contract_key numbers that exhibit this problem.
IN: data in our standard format. Which, using the Freddie Mac example, is the df created in this script
OUT: list of contract_key
Write a function that takes the last available row of performance information for each of the loans that are exhibiting this problem, and inserts the extra rows needed. This will be logged as a separate Issue.

Bonus points: document your functions using a roxygen skeleton. You can do this by putting your cursor inside the curly bracket of the function, and in the Rstudio IDE go to Code -> Insert Roxygen Skeleton.

theprofittable / freddiemacdata Goto Github PK

freddiemacdata's Introduction

freddiemacdata

Background

Scope

Improvements and Work To Be Done

freddiemacdata's People

Contributors

Stargazers

Watchers

Forkers

freddiemacdata's Issues

Background

Problem

Solution

Background

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org