Coder Social home page Coder Social logo

predicting_nationality_by_names's Introduction

Predicting_nationality_by_names

  • This is the project of predicting nationality by names.
  • In this project I use the Faker library to build a small binary lstm model to clean the data from names-dataset(link:https://github.com/philipperemy/name-dataset)
  • The reason why I choose this dataset instead of by creating fake name to train large lstm model is this 3.5GB data will capture the wide variety of names of a countries, which provide more valuable information than using fake names.
  • The downside of this large dataset is it has a lot of noise, since the data taken from 530M facebook users. The user may not name their name as their actual name.
  • From that problem, I have create a binary lstm model to clean the raw_dataset.

The workflow:

Given raw_dataset of specific country -> create small dataset where target country will label 1 and a list of other countries label 0 using Faker library -> Train binary lstm on small dataset -> Use trained lstm model to clean the raw dataset and save it in cleaned dataset folder -> train multi languages LSTM model.

  • Some insights and reports: The process of training small lstm is quite easy and fast, it depend of how many examples you want faker library to create, the common accuracy around 88% -> 96% for ~10 epochs, 1 hidden layer, 128 hiddens unit Traning the large LSTM took longer since I use two hidden layer and dataset is large ~100k examples. The final accuracy is 85.6% with batch_size=64

  • How you use them?

  • First you have to download the raw dataset link:https://github.com/philipperemy/name-dataset

  • clone my github repo, git clone https://github.com/BlueElf246/Predicting_nationality_by_names.git

  • In the run.py file src is the target language use want to clean from src raw dataset

the second argument in Fake_dataset is list of countries that will be label 0, the last in number of example for 1 label, 0 label for example:

src="English"

ds_creator = Fake_dataset(src, ['French','India','Italian','Spanish'], [1000,1000])

English ds will have 4k example, French, india,... each will have 1k examples

This will save the created file as src{df_name_num}.csv

df = ds_creator.create_dataset(df_name_num='')

This two lines will create pytoch dataset and train it using lstm model, use can modify epoch, num_category in train_model class init.

train_loader, valid_loader = create_model().train_test_loader(batch_size=batch_size, df=df, test_size=0.1)

train = train_model(batch_size=batch_size, model_name=src, epoch=8, layer=2).lstm_train_test_process(train_loader, valid_loader)

  • Finaly run the file run_filter.py to filter raw_dataset

f.Filter(name[11], "English"),the first the path to raw_dataset the second argument is model saved file name use want to use for example, English.pt.

In train_multi_lstm.py, we will combine all true_.csv dataset into a large and train it.

list_name = glob.glob("/Users/datle/Desktop/mybiggestprj/true_") change it to "/cleaned_dataset/true_.csv"

  • In this file, I bring all the model architecture and training function separately instead of using the same archtecture as small lstm.

The input size will be (number of example, length of char in a batch, 54), where 54 is the vector size encoded from a character. For more in detail, read my report.

Thanks for reading <3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.