Coder Social home page Coder Social logo

freddiemacdata's Introduction

freddiemacdata

transformation of the freddie mac data to The Profit Table's standard data dictionary

Background

As part of a larger effort to increase transparency, Freddie Mac is making available loan-level credit performance data on a portion of fully amortizing fixed-rate mortgages that the company purchased or guaranteed from 1999 to 2016.

You can get more information around the data set, as well as the user manual and actual download here.

Scope

The scope of this repo is to convert the Freddie Mac data to the standard format used by The Profit Table for loan analysis in all of our packages. Please see this repo for the data dictionary. You can also acces our style guide and on boarding information there.

The main output of the script is a data frame (df) used for analysis using the loanportr package. We are currently reading in the Freddie Mac sample data from 2012 onwards as this results in a data frame of an easily manageable size.

Improvements and Work To Be Done

  • Create a very small sample set to accompany the loanportr package as lazy data.

NOTE: issue for the above still to be logged.

freddiemacdata's People

Contributors

floresfdev avatar dandevilliers avatar

Stargazers

Mohamed Hani ElMasry avatar  avatar

Watchers

James Cloos avatar  avatar Mohamed Hani ElMasry avatar

Forkers

floresfdev jimdg

freddiemacdata's Issues

create orig_month and loan_period

make assumptions around the fpd_month variable (assume orig_month is one month before) and then create loan_period from that.

Will be good to have these to time related fields as well, because any vintage graph can be grouped on a orig_month or fpd_month basis.

Missing fpd_period 1

Some contracts are missing fpd_period == 1.
Example contract_key == "F116Q3442182"

Rows with NA in several columns

Affected columns:

  • months_arrears
  • fico
  • cltv
  • dti
  • default_flag: value computed from months_arrears
  • default_month: value computed from months_arrears
  • fico_bin: value computed from fico
  • cltv_bin: value computed from cltv
  • dti_bin: value computed from dti

For example:
Missing fico value for contracts F116Q1191757 and F116Q3004784
Missing cltv value for contract F116Q3425682

binning of continuous variables

dti, cltv, fico.
new variable for each.
create 3 levels for each: low medium high .
calculate: min, 33rd percentile and 66th percentile,
bin1: - 33rd
bin2: 33rd - 66th
bin3: 66th +

Date transformation

@DandeVilliers can you please tackle this one. the pointintime_month and fpd_monthvariable needs to get the last day of the month added to it so it can be transformed from an integer to a date variable.
Do in seperate script (not data wrangling) after I've finished #1

Sample data set creation

Background

The loanportr package needs a sample lazy data set, primarily to enable automated unit tests to be set up with the testthat package.

The Freddie Mac data set is an obvious candidate for use as a sample data set as it is open and it is (to my knowledge) the only open monthly performance loan data set available.

Problem

The dataset to be used as a sample set to go with the loanportr package should comply with some basic criteria:

  • It should comply with The Profit Table's standard data dictionary. Variable names that refer to the data dictionary will be in code blocks throughout this issue.
  • It should have all the variables necessary for using the functions in the loanportr package as well as some other variables that will be used in a typical IFRS9 provision model as loanportr will most likely be enriched in the coming months with functions to perform IFRS9 analysis or a new package will be created. Where the variables are not available in the current data set, variables will need to be created that mimic the typical behaviour of those that are needed.
  • It should be small enough so as to not make the package unnecessarily big.
  • It should be big enough in terms of the number of observations to produce meaningful analysis.
  • It should contain enough defaults and other events to calculate meaningful rates.
  • There should be a meaningful and fairly consistent number of contracts in each vintage or origination month.

Solution

Create a sample data set for the loanportr package by:

Loan selection

  • Randomly select a 100 from each orig_month in df. Note: this refers to a contract level selection, i.e. 100 unique contract_keys.
  • select all loans from the df that were ever in default (default_flag = 1). Note: Here you will have to work with monthly performance level data and then take it back to contract level.

Variables
Note that points marked with {future} do not need to be prioritised for the first version of the sample data set.

  • Add missing variable: interest_rate (add from Freddie Mac data set - int_rt) {future}
  • Create missing variable: instalment {future}
  • Create missing variable: net_advance For now, make this simply a fixed % of the closing_balance: 90% for loans not in default (default_flag = 0), 50% for loans that are in default (default_flag = 1) .
  • Create missing variable: payment {future}
  • Create missing variable: disclosure If default_flag = 0 then "PL". If default_flag = 1 then "NPL".

Data continuity

Background

Background reading required for this task:

Problem

The Freddie Mac data set exhibits a problem commonly found in loan data sets:
The performance history for some contracts stops before it has to. It has to stop when there is no further information available because no more time has elapsed.
The premature stopping of loan history typically happens because the loan has either been written off or is paid up, which results in there being no further observations (or monthly performance data) beyond a certain date or month.

E.g., a loan is originated in May 2017. Today's date is 6 November 2017. We will have monthly performance data for this loan up to Oct 2017 (last month). The problem that I am talking about here is where, for example, the loan data only runs up to July 2017 and not October. I.e. there are no further rows.

pointintime_month orig_month period_loan
2017-05 2017-05 1
2017-06 2017-05 2
2017-07 2017-05 3

Stops here, even though it should go up to pointintime_month 2017-10.

This is problematic because when you are trying to calculate the rate at which loans reach some status as a % of the total loans originated in the origination month, it throws out both numerator and the denominator. The loan dissapears from the total and the 'event' part of the rate calculation. Functions likevintalyse and early_default (in loanportr package)depend on monthly loan performance (rows) being populated up until the most recent pointintime_month.

Solution

The solution can be broken up into a few functions:

  1. Write a function that tests whether this issue is prevelant in the loan data set.
    IN: data in our standard format. Which, using the Freddie Mac example, is the df created in this script
    OUT: TRUE or FALSE

  2. Write a function that gives you a list of contract_key numbers that exhibit this problem.
    IN: data in our standard format. Which, using the Freddie Mac example, is the df created in this script
    OUT: list of contract_key

  3. Write a function that takes the last available row of performance information for each of the loans that are exhibiting this problem, and inserts the extra rows needed. This will be logged as a separate Issue.

Bonus points: document your functions using a roxygen skeleton. You can do this by putting your cursor inside the curly bracket of the function, and in the Rstudio IDE go to Code -> Insert Roxygen Skeleton.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.