The fraud_nlp from weiansheng

#The Language of Fraud

“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write.” -- Peter Millican on use of forensic linguistics

Language use is constant. While other indicators for fraud, such as IP addresses, bank accounts, can be changed, language use is constant and indicative.

The inspiration for this project comes from an in-class case study we did on fraud detection. Bag of words approaches and text length showed some promising results for fraud detection. Based on that, I wanted to see if some deep learning approaches with language would yield better results as it would give a larger feature set with context in language use. As indicated in the above quote, authors have characteristic writing that is unique and traceable. If the use of language in perpetuating fraud could be thought of as a genre, I wanted to try to find via computational means the patterns of usage that indicate this 'genre of fraud'.

For featurization of the text descriptions, I chose the Stanford Core Parser because it gave a rich feature set should I choose to extend it further than I did for the current model. In this model, I have used only the syntactic depdendencies and part-of-speech tags given by the parser. Word2Vec was used for the featurization of the words themselves for two reasons: it trains very quickly and the gensim library within python allows for its ease of use. I then created a sparse matrix with a single dependency within a sentence represented by a row and then built a model using scikit-learn's logistic regression classifier.

The scoring in this model is such that every sentence is given a score by averaging the binary fraud/not fraud scores of its dependencies. Every event is then scored as fraudulent given that one sentence within is indicated as fraudulent.

The scripts for building the model are in the build scripts folder and the scoring scripts are within the scoring scripts folder.

This project is meant as a proof of concept model, not a working application.
###Required technologies for this product:

####NLP

####Python libraries

Pandas
Gensim
Scikit-learn
Numpy

####Database

Postgres

weiansheng / fraud_nlp Goto Github PK

fraud_nlp's Introduction

fraud_nlp's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent