sharmaroshan / twitter-sentiment-analysis Goto Github PK

It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization

License: GNU General Public License v3.0

Jupyter Notebook 99.04% Python 0.96%

nlp sentiment-analysis data-analysis bag-of-words data-visualization eda machine-learning classification cross-validation evaluation-metrics

twitter-sentiment-analysis's Introduction

Twitter-Sentiment-Analysis

Introduction

Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most common applications of NLP is sentiment analysis. From opinion polls to creating entire marketing strategies, this domain has completely reshaped the way businesses work, which is why this is an area every data scientist must be familiar with.

Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete the same task.

We will do so by following a sequence of steps needed to solve a general sentiment analysis problem. We will start with preprocessing and cleaning of the raw text of the tweets. Then we will explore the cleaned text and try to get some intuition about the context of the tweets. After that, we will extract numerical features from the data and finally use these feature sets to train models and identify the sentiments of the tweets.

This is one of the most interesting challenges in NLP so I’m very excited to take this journey with you!

Understand the Problem Statement

Let’s go through the problem statement once as it is very crucial to understand the objective before working on the dataset. The problem statement is as follows:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.

Note: The evaluation metric from this practice problem is F1-Score.

Take a look at the pictures below depicting two scenarios of an office space – one is untidy and the other is clean and organized.

Tweets Preprocessing and Cleaning

You are searching for a document in this office space. In which scenario are you more likely to find the document easily? Of course, in the less cluttered one because each item is kept in its proper place. The data cleaning exercise is quite similar. If the data is arranged in a structured format then it becomes easier to find the right information.

The preprocessing of the text data is an essential step as it makes the raw text ready for mining, i.e., it becomes easier to extract information from the text and apply machine learning algorithms to it. If we skip this step then there is a higher chance that you are working with noisy and inconsistent data. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text.

In one of the later stages, we will be extracting numeric features from our Twitter text data. This feature space is created using all the unique words present in the entire data. So, if we preprocess our data well, then we would be able to get a better quality feature space.

Let’s first read our data and load the necessary libraries.

Story Generation and Visualization from Tweets

In this section, we will explore the cleaned tweets text. Exploring and visualizing data, no matter whether its text or any other data, is an essential step in gaining insights. Do not limit yourself to only these methods told in this tutorial, feel free to explore the data as much as possible.

Before we begin exploration, we must think and ask questions related to the data in hand. A few probable questions are as follows:

What are the most common words in the entire dataset? What are the most common words in the dataset for negative and positive tweets, respectively? How many hashtags are there in a tweet? Which trends are associated with my dataset? Which trends are associated with either of the sentiments? Are they compatible with the sentiments?

End Notes

In this article, we learned how to approach a sentiment analysis problem. We started with preprocessing and exploration of data. Then we extracted features from the cleaned text using Bag-of-Words and TF-IDF. Finally, we were able to build a couple of models using both the feature sets to classify the tweets.

Did you find this article useful? Do you have any useful trick? Did you use any other method for feature extraction? Feel free to discuss your experiences in comments below or on the discussion portal and we’ll be more than happy to discuss.

twitter-sentiment-analysis's People

Contributors

Stargazers

Watchers

Forkers

rkumar45 smoolya17 cnxtech engcomm ragha349 akhiladindi themayankbansal gwamakacharles mohakkamat ai-natural-language-processing-lab farhanjusoh nkululekotech navtikakumar sid515 devaatom ajjumaxy 100rabh1401 pranavpathare manthan2501 deepakahirwar ramannmathur ananyaas nthabisengmoela tamoghnokandar pmsorion anusha2105 grksharma deepak2233 christan7652 sagnikmitra marksilas subhasish28 tahseen-mulla s33k3rs kescardoso tiadwi jaswantcoder sthan41 gradled dpkprmr siddhantm99 rukeshlekkala sreshthakashyap sohamron sandithr subalakshmiariyaselvam prava1 amang09 medfouad79 pradeep7g asaif11 jatins-dev fulwade hemanth143710 ruhail-ali-khan miles-hub99 rola-ahmed crunchywaterlol mmmmosman joey00072 mickymocombe dipass-io vengat-jerry saurabh22111999 mohammad-hossein-ataie kashish121 dincerdogan jayeshmjadhav amaansayyad amitarp lamis-amd biglboy pjdpaulsagnik misssophieexplores siphon18 aniket-kote metypes junior-081 bcgreddy kskhoo-jason iz-nzy deven876 techsoft29 sidharthtomar manishnath001 vamsi1284 laharigowda02 xi-525-hub majogamit aishwaryasharma-stats pruthvirajpardeshi shivashirsath oboho-etuk homebrew-startup lucasndjoli ahmedmustahid babyyoda15 harishkumar1111 milangeorge2000 carolinebarra

twitter-sentiment-analysis's Issues

Final project 2

Chi

Error at line 253

Your case study has helped my students to understand well about sentiment analysis
But there is an error at line 253
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size = 0.25, random_state = 42)

Can you please tell me why am i getting such error?
I am trying to resolve from last week, but couldnt

The error says:

Please guide me and my students through this

Thanks in advance

not working

AttributeError Traceback (most recent call last)
in
3
4 # importing gensim
----> 5 import gensim
6
7 # creating a word to vector model

~\anaconda3\lib\site-packages\gensim_init_.py in
9 import logging
10
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
12
13

~\anaconda3\lib\site-packages\gensim\parsing_init_.py in
2
3 from .porter import PorterStemmer # noqa:F401
----> 4 from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
5 strip_tags, strip_short, strip_numeric,
6 strip_non_alphanum, strip_multiple_whitespaces,

~\anaconda3\lib\site-packages\gensim\parsing\preprocessing.py in
24 import glob
25
---> 26 from gensim import utils
27 from gensim.parsing.porter import PorterStemmer
28

~\anaconda3\lib\site-packages\gensim\utils.py in
34 import numpy as np
35 import scipy.sparse
---> 36 from smart_open import open
37
38 from gensim import version as gensim_version

~\anaconda3\lib\site-packages\smart_open_init_.py in
32
33 from smart_open import version # noqa: E402
---> 34 from .smart_open_lib import open, parse_uri, smart_open, register_compressor # noqa: E402
35
36 _WARNING = """smart_open.s3_iter_bucket is deprecated and will stop functioning

~\anaconda3\lib\site-packages\smart_open\smart_open_lib.py in
33
34 from smart_open import compression
---> 35 from smart_open import doctools
36 from smart_open import transport
37

~\anaconda3\lib\site-packages\smart_open\doctools.py in
19
20 from . import compression
---> 21 from . import transport
22
23 PLACEHOLDER = ' smart_open/doctools.py magic goes here'

~\anaconda3\lib\site-packages\smart_open\transport.py in
20 NO_SCHEME = ''
21
---> 22 _REGISTRY = {NO_SCHEME: smart_open.local_file}
23 _ERRORS = {}
24 _MISSING_DEPS_ERROR = """You are trying to use the %(module)s functionality of smart_open

AttributeError: partially initialized module 'smart_open' has no attribute 'local_file' (most likely due to a circular import)

Error: TypeError: unhashable type: 'list'

Can i help?

import nltk
a = nltk.FreqDist(HT_regular)
d = pd.DataFrame({'Hashtag': list(a.keys()),
'Count': list(a.values())})

selecting top 20 most frequent hashtags

d = d.nlargest(columns="Count", n = 20)
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

TypeError Traceback (most recent call last)
in ()
1 import nltk
----> 2 a = nltk.FreqDist(HT_regular)
3 d = pd.DataFrame({'Hashtag': list(a.keys()),
4 'Count': list(a.values())})
5

3 frames
/usr/lib/python3.7/collections/init.py in update(*args, **kwds)
653 super(Counter, self).update(iterable) # fast path when counter is empty
654 else:
--> 655 _count_elements(self, iterable)
656 if kwds:
657 self.update(kwds)

TypeError: unhashable type: 'list'