Coder Social home page Coder Social logo

sharmaroshan / twitter-sentiment-analysis Goto Github PK

View Code? Open in Web Editor NEW
222.0 4.0 123.0 2.84 MB

It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization

License: GNU General Public License v3.0

Jupyter Notebook 99.04% Python 0.96%
nlp sentiment-analysis data-analysis bag-of-words data-visualization eda machine-learning classification cross-validation evaluation-metrics

twitter-sentiment-analysis's Introduction

Twitter-Sentiment-Analysis

It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization

Introduction

Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most common applications of NLP is sentiment analysis. From opinion polls to creating entire marketing strategies, this domain has completely reshaped the way businesses work, which is why this is an area every data scientist must be familiar with.

Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete the same task.

We will do so by following a sequence of steps needed to solve a general sentiment analysis problem. We will start with preprocessing and cleaning of the raw text of the tweets. Then we will explore the cleaned text and try to get some intuition about the context of the tweets. After that, we will extract numerical features from the data and finally use these feature sets to train models and identify the sentiments of the tweets.

This is one of the most interesting challenges in NLP so I’m very excited to take this journey with you!

Understand the Problem Statement

Let’s go through the problem statement once as it is very crucial to understand the objective before working on the dataset. The problem statement is as follows:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.

Note: The evaluation metric from this practice problem is F1-Score.

Take a look at the pictures below depicting two scenarios of an office space – one is untidy and the other is clean and organized.

Tweets Preprocessing and Cleaning

You are searching for a document in this office space. In which scenario are you more likely to find the document easily? Of course, in the less cluttered one because each item is kept in its proper place. The data cleaning exercise is quite similar. If the data is arranged in a structured format then it becomes easier to find the right information.

The preprocessing of the text data is an essential step as it makes the raw text ready for mining, i.e., it becomes easier to extract information from the text and apply machine learning algorithms to it. If we skip this step then there is a higher chance that you are working with noisy and inconsistent data. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text.

In one of the later stages, we will be extracting numeric features from our Twitter text data. This feature space is created using all the unique words present in the entire data. So, if we preprocess our data well, then we would be able to get a better quality feature space.

Let’s first read our data and load the necessary libraries.

Story Generation and Visualization from Tweets

In this section, we will explore the cleaned tweets text. Exploring and visualizing data, no matter whether its text or any other data, is an essential step in gaining insights. Do not limit yourself to only these methods told in this tutorial, feel free to explore the data as much as possible.

Before we begin exploration, we must think and ask questions related to the data in hand. A few probable questions are as follows:

What are the most common words in the entire dataset? What are the most common words in the dataset for negative and positive tweets, respectively? How many hashtags are there in a tweet? Which trends are associated with my dataset? Which trends are associated with either of the sentiments? Are they compatible with the sentiments?

End Notes

In this article, we learned how to approach a sentiment analysis problem. We started with preprocessing and exploration of data. Then we extracted features from the cleaned text using Bag-of-Words and TF-IDF. Finally, we were able to build a couple of models using both the feature sets to classify the tweets.

Did you find this article useful? Do you have any useful trick? Did you use any other method for feature extraction? Feel free to discuss your experiences in comments below or on the discussion portal and we’ll be more than happy to discuss.

twitter-sentiment-analysis's People

Contributors

sharmaroshan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

twitter-sentiment-analysis's Issues

Error at line 253

Your case study has helped my students to understand well about sentiment analysis
But there is an error at line 253
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size = 0.25, random_state = 42)

Can you please tell me why am i getting such error?
I am trying to resolve from last week, but couldnt

The error says:
image

Please guide me and my students through this

Thanks in advance

not working

AttributeError Traceback (most recent call last)
in
3
4 # importing gensim
----> 5 import gensim
6
7 # creating a word to vector model

~\anaconda3\lib\site-packages\gensim_init_.py in
9 import logging
10
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
12
13

~\anaconda3\lib\site-packages\gensim\parsing_init_.py in
2
3 from .porter import PorterStemmer # noqa:F401
----> 4 from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
5 strip_tags, strip_short, strip_numeric,
6 strip_non_alphanum, strip_multiple_whitespaces,

~\anaconda3\lib\site-packages\gensim\parsing\preprocessing.py in
24 import glob
25
---> 26 from gensim import utils
27 from gensim.parsing.porter import PorterStemmer
28

~\anaconda3\lib\site-packages\gensim\utils.py in
34 import numpy as np
35 import scipy.sparse
---> 36 from smart_open import open
37
38 from gensim import version as gensim_version

~\anaconda3\lib\site-packages\smart_open_init_.py in
32
33 from smart_open import version # noqa: E402
---> 34 from .smart_open_lib import open, parse_uri, smart_open, register_compressor # noqa: E402
35
36 _WARNING = """smart_open.s3_iter_bucket is deprecated and will stop functioning

~\anaconda3\lib\site-packages\smart_open\smart_open_lib.py in
33
34 from smart_open import compression
---> 35 from smart_open import doctools
36 from smart_open import transport
37

~\anaconda3\lib\site-packages\smart_open\doctools.py in
19
20 from . import compression
---> 21 from . import transport
22
23 PLACEHOLDER = ' smart_open/doctools.py magic goes here'

~\anaconda3\lib\site-packages\smart_open\transport.py in
20 NO_SCHEME = ''
21
---> 22 _REGISTRY = {NO_SCHEME: smart_open.local_file}
23 _ERRORS = {}
24 _MISSING_DEPS_ERROR = """You are trying to use the %(module)s functionality of smart_open

AttributeError: partially initialized module 'smart_open' has no attribute 'local_file' (most likely due to a circular import)

Error: TypeError: unhashable type: 'list'

Can i help?

import nltk
a = nltk.FreqDist(HT_regular)
d = pd.DataFrame({'Hashtag': list(a.keys()),
'Count': list(a.values())})

selecting top 20 most frequent hashtags

d = d.nlargest(columns="Count", n = 20)
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()


TypeError Traceback (most recent call last)
in ()
1 import nltk
----> 2 a = nltk.FreqDist(HT_regular)
3 d = pd.DataFrame({'Hashtag': list(a.keys()),
4 'Count': list(a.values())})
5

3 frames
/usr/lib/python3.7/collections/init.py in update(*args, **kwds)
653 super(Counter, self).update(iterable) # fast path when counter is empty
654 else:
--> 655 _count_elements(self, iterable)
656 if kwds:
657 self.update(kwds)

TypeError: unhashable type: 'list'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.