Extractive text summarizer using TF-IDF algorithm PDF

Title	Extractive text summarizer using TF-IDF algorithm
Author	Nirusha Manandhar
Pages	30
File Size	805.7 KB
File Type	PDF
Total Downloads	141
Total Views	164

Preview

CLICK TO PREVIEW PDF

Summary

Description

Kathmandu University Department of Computer Science and Engineering Dhulikhel, Kavre

A Mini-Project Report on

“Text Summarizer”

COMP 473 - Speech and Language Processing (For partial fulfillment of 4 Year/ 2nd Semester in Computer Engineering) th

Submitted by: Ekta Chaudhary (11) Nirusha Manander (31) Sagun Lal Shrestha (53) Ruchi Tandukar (57) Submitted to: Dr. Bal Krishna Bal Associate Professor Department of Computer Science and Engineering Submission Date: July 31, 2020

Table of Contents Abstract ........................................................................................................................................... 3 List of Figures ................................................................................................................................. 4 List of Tables .................................................................................................................................. 5 List of Abbreviations ...................................................................................................................... 6 Chapter 1: Introduction ................................................................................................................... 1 1.1

Problem Definition ........................................................................................................... 1

1.2

Motivation ........................................................................................................................ 2

1.3

Objectives ......................................................................................................................... 2

Chapter 2: Related works ................................................................................................................ 3 Chapter 3: Datasets ......................................................................................................................... 4 Chapter 4: Methods and Algorithms Used...................................................................................... 5 4.1 Word Embedding .................................................................................................................. 5 4.2 Text Rank Algorithm ............................................................................................................ 5 4.3 TF-IDF Algorithm ................................................................................................................. 5 4.4 Luhn Algorithm ..................................................................................................................... 6 Chapter 5: Experiments................................................................................................................... 8 5.1 Data Preparation .................................................................................................................... 8 5.2 Tokenization and Text Preprocessing ................................................................................. 10 5.3 Building Model ................................................................................................................... 12 Chapter 6. Discussion on Results ................................................................................................. 21 Chapter 7: Contributions ............................................................................................................... 22 Chapter 8: Code ............................................................................................................................ 23 Chapter 9: Conclusion................................................................................................................... 23 References ..................................................................................................................................... 24

Abstract This report represents the mini-project assigned to the eighth semester students for the partial fulfillment of COMP 473, Speech and Language Processing, given by the Department of Computer Science and Engineering, KU. With the ever-growing amount of text and information in digital space, it is nearly impossible to manually extract summary. Hence, there is demand of automatic system that can comprehend those data and deliver relevant information efficiently in short time. In this project, we have developed an unsupervised extractive text summarizer that pulls out most important and relevant information from text to form concise and accurate summary. The system is designed to generate summary for both categories of dataset i.e. single document and multiple documents. Various extractive summarization algorithms like Text Rank, TF-IDF and Luhn’s algorithm are used for experimenting and building the model. Keywords: Natural Language Processing, Text Summarization, Extractive Text Summarization, Text Rank, TF-IDF, Luhn’s algorithm.

List of Figures Figure 1 Words-frequency diagram ................................................................................................ 6 Figure 2 Significant word selection in Luhn’s approach ................................................................ 7 Figure 3 Reading csv file ................................................................................................................ 8 Figure 4 Wikipedia article extraction ............................................................................................. 9 Figure 5 Reading Local File ........................................................................................................... 9 Figure 6 Tokenization ................................................................................................................... 10 Figure 7 Cleaning punctuations and symbols ............................................................................... 10 Figure 8 Removing stop words ..................................................................................................... 10 Figure 9 Word embedding extraction ........................................................................................... 11 Figure 10 Vector representation of string ..................................................................................... 11 Figure 11 Similarity creation matrix ............................................................................................. 12 Figure 12 Ranking sentences using pagerank ............................................................................... 12 Figure 13 Output summary text generated by model using Text Rank Algorithm ...................... 13 Figure 14 Word tokenization ........................................................................................................ 13 Figure 15 Word frequency calculation for term frequency .......................................................... 14 Figure 16 Importance of sentence calculation .............................................................................. 14 Figure 17 Sentence index with corresponding importance scores ................................................ 15 Figure 18 POS (noun and verb) tagging to each word.................................................................. 15 Figure 19 Term Frequency calculation function for each sentence .............................................. 15 Figure 20 Inverse Document Frequency calculation function for each sentence ......................... 16 Figure 21 POS tagging, TF value, IDF value, TF-IDF score value, for each sentence ................ 16 Figure 22 Finding most important sentence based on TF-IDF score and generating summary ... 17 Figure 23 Output summary generated by model using TF-IDF approach .................................... 17 Figure 24 Txt file summarization ................................................................................................. 18 Figure 25 Reading txt file ............................................................................................................. 18 Figure 26 Output displayed ........................................................................................................... 19 Figure 27 Other imports ................................................................................................................ 19 Figure 28 Summary calculation .................................................................................................... 20 Figure 29 Output summary generated by model using Luhn algorithm ....................................... 20 Figure 30 Final User Interface for Text Summarization using Text Rank ................................... 21

List of Tables Table 1 Work division .................................................................................................................. 22

List of Abbreviations HTML: Hyper Text Markup Language NLP: Natural Language Processing NLTK: Natural Language Toolkit TF-IDF: Term Frequency-Inverse Document Frequency UI: User Interface URL: Universal Resource Locator

Chapter 1: Introduction The main concept of text summarization is selecting the most important information from the source text to generate a concise and meaningful summary while preserving the overall meaning and zest of the text. With an ever-growing amount of text and information circulating in the digital space day by day, summarizer has become a handy tool to view the information shortly and clearly, while saving a lot of time. Text summarization methods can be categorized based on two aspects: extractive or abstractive and supervised or unsupervised. The method proposed in this document is unsupervised extractive text summarization which pulls out keywords or key sentences from the original voluminous document and assembles them to create a summary. This method words by identifying a section of text cropping out and stitch together proportions of the content to produce a condensed version. Basically, it highlights the important details from the huge chunks of texts. Some of the algorithms used for extractive text summarization that are discussed in this paper are Text Rank, TF-IDF and Luhn.

1.1 Problem Definition In this age, an enormous amount of data on internet, which is a valuable source of knowledge. So, there is a need to develop an automatic system that can deliver a concise and accurate summary of the text. Summarization is greatly needed to consume the escalating amount of text data available online. In essence, it helps to extract relevant and important information faster. Hence, extensive research in this field of natural language processing is going on. Automatic text summarizer can reduce reading time and accelerate the research process to a great extent. So, this project focuses on designing an automatic system that generates clear and concise summary with crucial details from a large document.

1

1.2 Motivation As mentioned above, text summarization has been one of the most researched fields in natural language processing. The main reason that this kind of research is gaining more and more attention is because of the increasing amount of information. The International Data Corporation (IDC) projects that the total amount of digital data circulating annually around the world would sprout to 180 zettabytes in 2025. Summarization of thousands of documents would be cumbersome to human beings and this implicates the motivation to research an appropriate algorithm on this sector, providing core contents, and hence saving time and energy.

1.3 Objectives The main objective of developing this project are:

1. To develop a model to summarize single as well as multiple documents entered by user in txt files as well as through Wikipedia Articles using various natural language processing techniques. 2. To explore, compare and understand extractive summarization techniques like Text Rank algorithm, Luhn’s algorithm, and TF-IDF algorithm. 3. To understand natural language toolkit (nltk) and its functions and implement in the application of text summarization for POS-tagging, tokenization, lemmatization and so on.

2

Chapter 2: Related works Abundance of unstructured data and need of condensed information has given peak demand of summarization of documents into short-length and readable summary. Along with this demand, various researches on abstractive and extractive text summarization has been carried out. [1]In a research based extractive and unsupervised text summarization project, Text Rank approach using word embedding, similarity matrix and graphs were used. For word embedding the project used GloVe as global vectors for word representation. The output of the paper provided a bullet point summary to the users, through multiple documents. [2]On a survey on extractive text summarization, various extractive text summarization techniques, its result was studied and analyzed, which further implicated that semantic impressions were lacking in the summaries. Moreover, these techniques were found to be simple and useful for short documents, but did not focus on time and space complexity, hence being slower for larger documents. This paper also concluded an issue of lack of standard against which these summarization models could be compared and properly evaluated. The implication of such issue comes with content choice, where choices are subjective. There are very few approaches to automatically evaluate summaries using paraphrases called ParaEval. [3]In a research performed using TF-IDF algorithm, the approach of term frequency was proven to be 67% better according to the sample respondents. [4]But traditional TF-IDF approach is also found to be very discriminative and slow for larger documents.

3

Chapter 3: Datasets There are many approaches of text summarization. This project focuses on designing both multiple documents summarization and single document summarization. We have taken a dataset of scraped articles in “.csv” format (tennis_article.csv) which would fall under a single domain multiple documents summarization task. The dataset includes title of article (article_title), the actual article (article_text) and source of article (source). All these articles are concatenated to generate single summary for the whole dataset. For the next approach i.e. single document text summarization, we have used the Wikipedia article as input. The article is accessed using https request whose topic is taken as user input. The system is also designed to accept the local file of “.txt” format to generate summary. This approach also belongs to single document summarization task.

4

Chapter 4: Methods and Algorithms Used 4.1 Word Embedding Word embeddings are vector representations of words. These word embedding’s will be used to create vectors for our sentences. Word embedding is one of the most popular representations of document vocabulary. It is capable of capturing the context of a word in a document and similarity with other words. To use word embedding we are going to use “Glove” which is a popular model for distributed word representation. It is an unsupervised learning algorithm for obtaining vector representations for words, achieved by mapping words into meaningful words into a meaningful space where the distance between words is related to semantic similarity.

4.2 Text Rank Algorithm Text Rank algorithm is based on Page Rank algorithm which is developed by Google for ranking web pages. It is based on the popularity of webpages which is determined by the number and type of incoming links. In Page rank algorithm, scores are computed which is the probability of user visiting that page. Square matrix of n*n order is created where n is number of web pages. Based on the probability scores computed in the matrix, pages are ranked. In text rank algorithm, similar concept is used, but here sentences are ranked instead of web pages. The similarity between two sentences are evaluated and the similarity scores are stored in the similarity matrix of n*n order. Then, the similarity matrix is converted into graph and the sentences in top ranks are taken to generate the summary.

4.3 TF-IDF Algorithm TF-IDF stands for term frequency-inverse document frequency, which is one of the approaches used to measure the score of importance of a word in a document based on how often it appears in a specific document and in a collection of documents. This algorithm proceeds with the concept that a word appearing frequently in a document, is important and should be given a high score, but a word appearing frequently across too many other documents, it is not a unique identifier and should be given a lower score. It is based on unsupervised learning technique, which coverts document text into a bag of words and then assigns a weighted term to each word. This approach 5

is useful in various applications like information retrieval, text mining, user modeling, and text summarization. Calculating TF-IDF for a word is done by using following formula: Number of times term W appears in a document

•

TF(W)=

•

IDF(W)= log_e

Total number of terms in the document Total number of documents

Number of documents with term w in it

Hence, TF-IDF(W)=TF(W) * IDF(W) In extractive summarization, for an input document, sentence importance is scored based on word importance score evaluated by TF-IDF for a word.

4.4 Luhn Algorithm Luhn algorithm is a heuristic method based on TF-IDF. This is one of the earliest approaches to text summarization. It selects only the words of higher importance as per their frequency. Higher weights are assigned to the words present at the beginning of the document. It considers the words lying in the shaded region in this graph:

Figure 1 Words-frequency diagram

6

The region on the right signifies the highest occurring elements while words on the left signify the least occurring elements. Luhn introduced the following criteria during text preprocessing: 1. Removing stop words 2. Stemming (Likes->Like) In this method, we select sentences with the highest concentration of salient content terms. For

Figure 2 Significant word selection in Luhn’s approach

example, if we have 10 words in a sentence and 4 of the words are significant. For calculating the significance instead of a number of significant words by all words. Here we divide them by the span that consists of these words. Thus, the Score obtained from our example would be: Score= 42/6 = 2.7

7

Chapter 5: Experiments. 5.1 Data Preparation As mentioned above, three sets of datasets are used in this project. In case of dataset of scraped articles, the first task is to read the data and concatenate all the articles so that we can generate single summary for whole articles.

Figure 3 Reading csv file

For the dataset taken from Wikipedia article, the topic is taken as input from user. The URL of input topic is generated and article is accessed using https request. A python library called Beautiful Soup is used for HTML parsing i.e. extracting information contained within the paragraph tag () of HTML.

8

Figure 4 Wikipedia article extraction

The local file in txt format is directly accessed using open () and read () functions.

Figure 5 Reading Local File

9

5.2 Tokenization and Text Preprocessing The text obtained from above step was split into sentences using sent_tokenize () function of nltk.

Figure 6 Tokenization

After splitting the sentences, the punctuations, numbers and special characters were removed to make data noise-free. All the alphabets were converted to lower case.

Figure 7 Cleaning punctuations and symbols

The stop words were imported and removed from the text.

Figure 8 Removing stop words

10

Pre-trained Wikipedia 2014 + Giga word ...