Coder Social home page Coder Social logo

fraud_nlp's Introduction

#The Language of Fraud

“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write.” -- Peter Millican on use of forensic linguistics

Language use is constant. While other indicators for fraud, such as IP addresses, bank accounts, can be changed, language use is constant and indicative.

The inspiration for this project comes from an in-class case study we did on fraud detection. Bag of words approaches and text length showed some promising results for fraud detection. Based on that, I wanted to see if some deep learning approaches with language would yield better results as it would give a larger feature set with context in language use. As indicated in the above quote, authors have characteristic writing that is unique and traceable. If the use of language in perpetuating fraud could be thought of as a genre, I wanted to try to find via computational means the patterns of usage that indicate this 'genre of fraud'.

For featurization of the text descriptions, I chose the Stanford Core Parser because it gave a rich feature set should I choose to extend it further than I did for the current model. In this model, I have used only the syntactic depdendencies and part-of-speech tags given by the parser. Word2Vec was used for the featurization of the words themselves for two reasons: it trains very quickly and the gensim library within python allows for its ease of use. I then created a sparse matrix with a single dependency within a sentence represented by a row and then built a model using scikit-learn's logistic regression classifier.

The scoring in this model is such that every sentence is given a score by averaging the binary fraud/not fraud scores of its dependencies. Every event is then scored as fraudulent given that one sentence within is indicated as fraudulent.

The scripts for building the model are in the build scripts folder and the scoring scripts are within the scoring scripts folder.

This project is meant as a proof of concept model, not a working application.
###Required technologies for this product:

####NLP

####Python libraries

  • Pandas
  • Gensim
  • Scikit-learn
  • Numpy

####Database

  • Postgres

fraud_nlp's People

Contributors

luxzia avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.