Document dsbda codes for mini project PDF

Title	Document dsbda codes for mini project
Author	Aditya Deshmukh
Course	computer engineer
Institution	Savitribai Phule Pune University
Pages	8
File Size	200 KB
File Type	PDF
Total Downloads	106
Total Views	706

Preview

CLICK TO PREVIEW PDF

Summary

Download Document dsbda codes for mini project PDF

Description

DATA SCIENCE AND BIG DATA ANALYTICS PROJECT REPORT ON

TWEET SENTIMENT ANALYSIS MODEL USING SCIKIT-LEARN IN PYTHON

SUBMITTED BY 1. Deshmukh Aditya (TCOD35) 2. Mitali Doshi ( TCOD77)

UNDER THE GUIDANCE OF

Ms. Neha Chankhore

Title: Develop a Tweet Sentiment Analysis model using Scikit-learn library in python

Objective : Student should be able to Develop a Tweet Sentiment Analysis model using Scikitlearn library in python

Introduction : Supervised Learning is when the model is getting trained on a labelled dataset. Labelled dataset is one which have both input and output parameters. In this type of learning both training and validation datasets are labelled Here, we have three components such as Training Data, Test Data and features. Training data is that where data is usually split in the ratio of 80:20 i.e. 80% as training data and rest as testing data. Testing data is that when data is good to be tested. At the time of testing, input is fed from remaining 20% data which the model has never seen before, the model will predict some value and we will compare it with actual output and calculate the accuracy Three different types of Machine Learning methods are as follows1. Data Extraction and Cleaning This is been used to extract and clean the data by using scripting languages such as Python, Shell Scripting etc. Here we extract and filter the useful data of tweets datasets according to our need 2. Build ML Model Once we will extract and clean data will start building up the model with tools such as Tensor Flow, Azure ML etc. We build our sentiment analysis engine here which would be in form of a Python script. 3. Build Software Infrastructure Here we have to build ML components such as a product for the users by using the ML algorithms in the form of a software by using JavaScript. A little knowledge of cloud infrastructure such as AWS (Amazon Web Services) and to collaborate with people a little knowledge of GitHub is also known.

Logistic Regression: Logistic regression is a supervised machine learning technique for classification problems. Supervised machine learning algorithms train on a labeled dataset along with an answer key which it uses to train and evaluate its accuracy. The goal of the model is to learn and approximate a mapping function f(Xi) = Y from input variables {x1, x2, xn} to output variable(Y). It is called supervised because the model predictions are iteratively evaluated and corrected against the output values, until an acceptable performance is achieved.

Sentiment Analysis using Logistic Regression: As a part of building sentiment classifier using logistic regression, we train the model on twitter sample dataset. The dataset available is in its natural human format of tweets, which is not so easy for a model to understand. Thus we will have to do some data pre-processing and cleaning to break down the given text into a easily understood format for the model.

Architecture: Pre-processing of tweets include the following steps: 1. Removing punctuations, hyperlinks and hashtags

2. Tokenization — Converting a sentence into list of word 3. Converting words to lower cases 4. Removing stop words 5. Lemmatization/stemming — Transforming a word to its root word For this model, we will use NLTK’s twitter_samples corpus as our labeled training data. The twitter_samples corpus contains 3 files. negative_tweets.json: contains 5000 negative tweets positive_tweets.json: contains 5000 positive tweets tweets.20150430– 223406.json: contains 20k positive and negative tweets Feature Extraction: Machine learning models can only deal with numbers rather than text as they can only understand the language of binary digits. Thus, we need to transform these tweets into vectors which can be later fed into our model for training. There are a lot of ways to represent text as vectors depending on the context. We will use the frequency dictionary built from the previous section to convert each tweet to a vector of 3 dimensions as below: tweet = [1, Σfreq of words in positive class, Σfreq of words in negative class] Since each tweet is represented as a vector, we can combine all the vectors into a single matrix Sigmoid activation Function: Logistic regression achieves the best predictions using the maximum likelihood technique. Sigmoid is a mathematical function having a characteristic that can take any real value between -∞ and +∞ and map it to a real value between 0 to 1. So if the outcome of sigmoid function is more than 0.5 then we classify it as positive class and if it is less than 0.5 then we can classify it as negative class.

Cost Function and Gradient Descent: We use a cross-entropy or log loss cost function in case of logistic regression. The cross entropy cost function can be divided into 2 cost functions, one for each output.

Training & Evaluating the Sentiment classifier: Gradient descent is applied and the resultant theta vector of optimal weights is obtained. The sentiment of a new tweet is predicted using the sigmoid of dot product of extracted feature vector x and theta vector The threshold is set at 0.5. So if y_pred > 0.5, it is predicted as a positive tweet else a negative tweet.

Implementation: Step 1: Install Python If you do not have Python installed on your pc Install the latest version of Python from Web. Step 2: Download the pip package manager for Python Once you have installed Python We need to install some libraries which we are going to use in our workshop. We will install Numpy, Matplotlib and Pandas to work with our datasets. pip is a package management system used to install and manage software packages written in Python. Go to the folder where you saved this file. In windows explorer use Shift + Right Click and then select Open command window here to open command prompt in this directory. Then run the following command:

python get-pip.py Step 3: Install Libraries Open Command Prompt. Run the following command to install necessary libraries.

Pip install numpy matpotlib pandas scikit learn gym opencv-python If the installation completes without any errors, you are all set CONCLUSION: Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are still far to detect the sentiments of s corpus of texts very accurately because of the complexity in the English language and even more if we consider other languages such as Chinese.

In this project we tried to show the basic way of classifying tweets into positive or negative category using Logistic Regression as baseline. Logistic Regression and can produce better results. We could further improve our classifier by trying to extract more features from the tweets, trying different kinds of features.

References: •

h ttps://drive.google.com/file/d/1sJ9N .

•

https://arxiv.org/ftp/arxiv/papers/1509/1509.04219.pdf

•

https://medium.com/nerd-for-tech/twitter-sentiment-analysis-using-logisticregression-ff9944982c67

•

https://www.researchgate.net/figure/The-ML-process-of-the-sentencessentiment-analysis_fig1_233859568

•

https://dl.acm.org/doi/abs/10.1145/2938640

•

https://www-nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf

•

https://www.google.com/search? q=cosine+similarity+python&oq=cosine&aqs=chrome

•

https://docs.python.org/3/tutoriall

•

https://en.wikipedia.org/wiki/twitter_sentiments

•

https://www.learnpython.org/en/Pandas_Basics

•

https://www.google.com/search?q=sublime+text+python&oq=sublime+ h &aqs=chrome...