Coder Social home page Coder Social logo

mlh-fellowship / social-berterfly Goto Github PK

View Code? Open in Web Editor NEW
89.0 5.0 30.0 78.26 MB

Finding your MBTI personality type based on your Twitter activity using BERT

License: MIT License

Python 0.97% HTML 47.78% Jupyter Notebook 7.58% CSS 28.31% JavaScript 6.61% SCSS 8.74% Shell 0.02%
mbti nlp personality-predicting personality-test bert myers-briggs personality-profiling

social-berterfly's Introduction

Social BERTerfly πŸ¦‹

Open In Colab

Predicts your personality out of the 16 Myers-Briggs Type Personalities by your Twitter handle and compares your personality types with the people that you follow

It utilizes machine learning classifier and NLP using the state of the art language model - BERT (Bidirectional Encoder Representations from Transformers) to predict the personality type of the given user based on their recent tweets.

Getting Started: πŸ™Œ

How to run locally:

Follow the below steps to run and explore your personality types, as well as that of your friends!

  • git clone https://github.com/MLH-Fellowship/Social-BERTerfly.git

  • Install our model weights from the following Drive link:

    BERT_base_model

  • Place the downloaded .h5 model under server/models/.

  • Navigate to the server folder by:

    cd server/

  • Install dependencies by:

    pip install -r requirements.txt (you can install the packages in a virtualenv if you prefer)

  • Add your Twitter API keys and authorization credentials in the .env file. To get Twitter API key you can refer to this article. Do not make a PR or publish .env file with your Twitter API key and credentials. Create a separate copy of .env file in your cloned repo and delete if after use or you can uncomment the "/server/.env" in gitignore.

  • Create a new folder "twitter_data" in the same directory to store the fetched tweets.

  • Run the following in your terminal:

    flask run

    or, python app.py

  • Wait around 15 seconds for the model to load.

  • Visit the application at http://127.0.0.1:5000/ and enjoy exploring various personality traits for you and your following!

Note : Make sure to click on Submit button first to fetch the tweets and results. After the personality type is displayed on the landing page, click on Go to Dashboard for detailed analysis.

Start contributing! πŸ“£

If you wish to contribute to our model, you can take a look at our notebook, and provide suggestions or comments.

Open In Colab

An Example:

Landing Page:

h1

A brief description of personality types:

h2

Try it Out:

Head over to the Get Started section to put it your Twitter Handle and press Submit. The model should take approx. 15 sec to return your predicted personality type on the screen as follows: Homepage

Head over to the Dashboard:

Click on Go to Dashboard to get detailed personality analysis along with career suggestions.

d

Compare personality types!:

Now you can also compare your personality type against that of your followers and friends!

d2

Tech Stack:

  • Twitter API for fetching tweets
  • tweepy for connecting the API with Python (https://pypi.org/project/tweepy/)
  • Flask for the backend server
  • Google colaboratory for collaborating on the model and accessing the free TPU πŸ˜‚
  • Keras for training and testing the BERT model
  • BERT as a SOTA model for tweet predictions. (https://arxiv.org/abs/1810.04805)
  • Bootstrap for the homepage and the dashboard UI
  • chartjs for displaying graphs on the Dashboard

Implementation Details:

P.S: If you ain't into the boring stuff, head on over to the next section to contribute to our model and the app!

About MBTI

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

img

  • Introversion (I) – Extroversion (E)
  • Intuition (N) – Sensing (S)
  • Thinking (T) – Feeling (F)
  • Judging (J) – Perceiving (P)

It is one of, if not the, the most popular personality test in the world. It is used in businesses, online, for fun, for research and lots more. From scientific or psychological perspective it is based on the work done on cognitive functions by Carl Jung i.e. Jungian Typology. This was a model of 8 distinct functions, thought processes or ways of thinking that were suggested to be present in the mind. Later this work was transformed into several different personality systems to make it more accessible, the most popular of which is of course the MBTI.

Dataset

For the dataset, we have used the famous Myers-Briggs Personality Type Dataset that includes a large number of people's MBTI type and content written by them. This dataset contains over 8600 rows of data, on each row is a person’s:

- Type (This persons 4 letter MBTI code/type)
- A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. As of 2019, Google has been leveraging BERT to better understand user searches.

Data Fetching:

Using tweepy and Twitter API, we fetch the 50 latest tweets posted by the user according to the username entered. These tweets are stored in a .csv file and sent for preprocessing, and finally the cleaned texts are sent to the Keras model.

Data preprocessing:

We have used regex to detect special characters like '@,emojis' etc. from the posts, remove stopwords and punctuation, convert the text to lowercase and stemming to extract the root of words. The preprocessed data is split using train_test split and sent to the Keras model for predictions.

BERT Model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
input_word_ids (InputLayer)  [(None, 1500)]            0         
_________________________________________________________________
tf_bert_model_1 (TFBertModel ((None, 1500, 768), (None 109482240)) 
_________________________________________________________________
tf_op_layer_strided_slice_1  [(None, 768)]             0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                12304     
=================================================================
Total params: 109,494,544
Trainable params: 109,494,544
Non-trainable params: 0

Results achieved:

We tested using a LSTM model, and BERT-base to contrast accuracies.

Model Train accuracy Validation accuracy
LSTM baseline 18.96% 16.9%
BERT-base-uncased 85% 79%

Deployment:

Uses flask for the backend and model deployment and Bootstrap for building the Dashboard and the Homepage UI.

Contributing:

Social BERTerfly is fully Open-Source and open for contributions! We request you to respect our contribution guidelines as defined in our CODE OF CONDUCT and CONTRIBUTING GUIDELINES.

Contributors

Made with ❀️️ by Team Social-BERTerfly as part of MLH Explorer Fall Fellowship 2020 Sprint3.

social-berterfly's People

Contributors

sh-biswas avatar susiejojo avatar v2dha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

social-berterfly's Issues

Remaining tasks

Below are the remaining tasks to be completed before the final demo :

  • Get the followering twitter handles and display it along with their personality type

  • Improve the UI of the dashboard. For instance - personality traits card is taking a lot of blank space and is looking uneven, remove the purchase banner if possible

  • Update the Readme with setting up code

  • Put screenshots in README

  • Add a pull request template

  • Clean up the commented out code

If you guys think there are any other things left you can add it over here.

Edit: I think we can also display the full form of the personality type below the abbreviation for the personality. For instance, INTP will be Introvert-Intuitive-Thinker-Perceiving.

OSError: Unable to open file (file signature not found)

I am getting error when i run this repo on an instance and wget the bert_base_model.h5 file from google drive into the required file models in instance. When i run the project the error is:

Traceback (most recent call last):
...
with h5py.File(filepath, 'r') as f:
File "/home/ubuntu/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 406, in init
fid = make_fid(name, mode, userblock_size,
File "/home/ubuntu/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 173, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

Can anyone suggest why the problem occurs.

Set up backend server

We need a backend server to take the data entered by the user (or subsequently the response from the Twitter API) and send it to the predict function of the model.
API functionalities needed:

  • take user input in the form of text

  • send text for preprocessing and cleaning

  • send cleaned text to the model predict function

  • fetch the predictions from the predict function as a response

@V2dha pls add further points if any and check them as and when done.

Connect Twitter response to model

  • Clean Twitter API response based on the type of data returned. (may need to remove @, urls etc.)

  • Decide if there's a need to make a separate model from when the user enters text on the browser. This may be needed coz the cleaning function, the predict function will all be different.

  • Send the twitter response as a DataFrame to the model. Even a Pandas series should do.

  • Batch predict the top n tweets and run a majority vote on the predictions. (This won't work while fetching probabilities so we will need to probably normalise the sum of probabilities of each type returned)

  • Decide n by validation.

  • Check if expected response is being returned by the server.

AttributeError: 'NoneType' object has no attribute 'to_csv' (Twitter scraper return null)

Hi and thanks for this cool project!

First issue on requirements, dataclasses==0.8 is not available on latest python, staying on 0.6 would be fine.
Also pyasn1-modules is already part of dist-package in ubuntu 20.04, I had to comment it out since it creates an error.

Now to the main issue. Once successfully installed on a clean ubuntu, the server starts fine, but when I submit a handle I get the following trace:

127.0.0.1 - - [17/Jan/2021 07:09:36] "OPTIONS /tweet_pred HTTP/1.1" 200 -
failed on_status, Failed to send request: Only unicode objects are escapable. Got None of type <class 'NoneType'>.
[2021-01-17 07:09:40,698] ERROR in app: Exception on /tweet_pred [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/Social-BERTerfly/server/app.py", line 36, in tweet
    tweet_return(user_handle)
  File "/opt/Social-BERTerfly/server/twitterscraper.py", line 82, in tweet_return
    twitter.get_user_tweets(str(tweet_handle)).to_csv(tweet_path)
AttributeError: 'NoneType' object has no attribute 'to_csv'
127.0.0.1 - - [17/Jan/2021 07:09:40] "POST /tweet_pred HTTP/1.1" 500 -

Looks like the twitter scraper does not return any results.
My instance is on GCP and firewall allow external calls...

After looking a bit in the code I see that auth credentials are needed amd that this is not a credential free scraper like twint... too bad :)

So I think your readme should mention this part about creating a .env file, and about dependency issues as well.

Set up baseline model

The model broadly performs the following functions:

  • fetch the Kaggle MBTI dataset

  • preprocessing: clean text, use nltk to process to stem, tokenise, remove stopwords, NER from text

  • visualise the given data classes and distribution

  • send data to the training model (preferably BERT)

  • choose hyperparameters for training

  • cross-validate

  • contrast BERT vs LSTM models

  • an evaluator function that reports accuracy and other metrics ( I believe accuracy will be good enough for us, we need to check the confusion matrix once tho)

  • create a predict function which accepts text, preprocesses it and returns the predicted class

ValueError: Cannot assign to variable tf_bert_model/bert/encoder/layer_._0/attention/self/query/kernel:0 due to variable shape (768, 12, 64) and value shape (768, 768) are incompatible

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Design a frontend

We need to interact with the user, allow him to authenticate us to use the social media profiles, or let the user enter text as response to various questions.

  • Input side:

  • Accept text input (for testing)

  • Get the user to authenticate the APIs

  • Output side:

  • Display the personality type detected.

  • Display traits of the personality type

  • Maybe display names of eminent people of that personality.

Open to suggestions!

Deploy the baseline model

Depends on #2 and #1 .

  • With the baseline accuracy, save the trained weights locally

  • send user input to the predict function

  • predict using pretrained baseline model weights

  • be able to display the predicted class on the browser.

Scrape user data

We need to fetch the user's top n(TBD) posts from various social media platforms.

  • Learn about the Twitter API (as the dataset is scraped from Twitter)

  • be able to send the response from Twitter API to the Flask server

  • Clearly Facebook has limitations. So we will look into YouTube and perform personality predictions for commenters using comments.

Add predictions for Twitter followers

Tasks to be done:

  • Fetch n followers by the username.

  • Fetch the top posts of each of the n followers.

  • for each follower, return prediction

  • return overall type prediction for n followers as json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.