Technical Paper PDF

Title	Technical Paper
Course	Platforms for Big Data Processing
Institution	University of Maryland Baltimore County
Pages	13
File Size	501.3 KB
File Type	PDF
Total Downloads	15
Total Views	168

Preview

CLICK TO PREVIEW PDF

Summary

Technical Research paper for the project understanding on SPark NLP...

Description

Harsh Bhanderi

DATA 603

Spark NLP

Technical Research Paper On

Natural Language Processing with Spark NLP

December 1, 2020 Submitted by Harsh Bhanderi (SY76891) 1

Harsh Bhanderi

DATA 603

Spark NLP

Abstract

One of the essential methods for data science teams around the world is Natural Language Processing. Most companies have already migrated to big data platforms such as Apache Hadoop and cloud offerings like AWS, Azure, and GCP with ever-growing data. These systems are more than capable of managing big data, enabling companies to conduct unstructured data analytics on a scale, such as text categorization. Yet there is a difference between big data structures and machine learning software when it comes to machine learning. Popular machine learning python libraries such as sci-kit-learn and gensim are highly optimized and not built for distributed environments to run on single node machines. Apache Spark MLlib is one of the applications that can fill this void by providing the most machine learning models to perform the most basic machine learning functions, such as Linear Regression, Logistic Regression, SVM, Random Forest, K-means, LDA, and many more.

Spark MLlib also includes a plethora of function transformers such as Tokenizer, StopWordRemover, n-grams, and extractors such as CountVectorizer, TF-IDF, and Word2Vec, in addition to machine learning algorithms. While these transformers and extractors are necessary to create the simple NLP pipeline, we need more sophisticated techniques such as stemming, lemmatization, part-of-speech tagging, and named object identification to create a more 2

Harsh Bhanderi

DATA 603

Spark NLP

comprehensive and production-grade pipeline. Spark NLP is a library for natural language processing designed on top of Apache Spark ML. For machine learning pipelines that scale easily in a distributed environment, it provides simple, effective, and precise NLP annotations. In more than 46 + languages, Spark NLP comes with 330 + pre-trained pipelines and templates. It supports state-of-the-art transformers that can be used effortlessly inside a cluster, such as BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder. Tokenization, Part-of-Speech Tagging, Named Object Identification, Dependency Parsing, Spell Checking, Multi-class text detection, Multi-class interpretation of emotion, and several other NLP tasks are also available.

This article will contain everything from fundamental linguistics and writing structures to emotion analysis and search engines through concrete illustrations, realistic and theoretical interpretations, and hands-on experiments for using NLP on the Spark processing structure. In program systems that outgrow older libraries such as spaCy, NLTK, and CoreNLP, Spark NLP is targeted towards output use.

3

Harsh Bhanderi

DATA 603

Spark NLP

INTRODUCTION

Natural Language Processing is the most used factor in text data analysis. Common uses of NLP are sentiment analysis, text categorization, and classification, language modeling, text summarization, etc. Nowadays with the increasing number of machine learning and artificial intelligence application, natural language processing is one of the essential factors. In the past few years, many NLP libraries were developed, and many are still developing and testing. There are many open-source natural language processing libraries such as NLTK(Natural Language Tool Kit), TextBlob, SpaCy, Gensim, Fast text, etc.

4

Harsh Bhanderi

DATA 603

Spark NLP

These libraries provide many features like sentence detection, tokenization, stemming, lemmatization, POS- Part of Speech tagger, NER- Named Entity Recognition, dependency parser, text matcher, date matcher, chunking, spell checker, sentiment detector, pre-trained models, and training models. Those libraries mentioned above provide many of these features, but not all of them are included in a single library, that’s when Spark NLP comes to light.

5

Harsh Bhanderi

DATA 603

Spark NLP

All we needed is the one universal solution for all the features included and that is Spark NLP. It can easily convert unstructured data into structured and allow to train of the model with no extra efforts. There are many open-source pre-trained models available to modify and compute in Spark NLP like OpenAI transformer, ULMFit, Ernie, ELMo, BERT, RoBERTa, ALBERT, XLNet. These pre-trained models provide many NLP tasks and results.

There was an immediate necessity of having a natural language processing library that is simple to use, includes all the features provided by other libraries, also be available in the latest used programming languages like python, scala, etc., very fast and support for large scale datasets. John Snow Labs (a global Artificial Intelligence company) start developing a Spark NLP library. John Snow Labs is leading in the data science industry and positively developing and maintaining spark NLP library.

6

Harsh Bhanderi

DATA 603

Spark NLP

SPARK NLP

Spark NLP is an open-source NLP-Natural Language Processing library, which is built on top of apache spark and machine learning libraries. It is written in Scala and it supports python and scala application programming interfaces. The library includes usual NLP tasks such as tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, and spell checking, etc. These features can be used via pre-trained models available by Spark NLP.

As of February 2019, 16 percent of business businesses use the library, and those companies use the most commonly used NLP library. The library, developed natively on Apache Spark and TensorFlow, provides quick, effective, and reliable NLP notes for machine learning pipelines that can be easily scaled in a distributed environment. Along with implementing NLP features, this library reuses the Spark ML pipeline.

As a native extension of the Spark ML API, the library offers the ability to train, change, and store models so they can run on a cluster, other computers, or save for later. It's also quick to expand and configure templates and pipelines. Over the past few years, the emergence of deep learning for natural language processing has meant that the algorithms introduced in

7

Harsh Bhanderi

DATA 603

Spark NLP

common libraries, such as spaCy, Stanford CoreNLP, NLTK, and OpenNLP, are less detailed than what has been made possible by recent scientific articles.

Once a part of the Hadoop ecosystem, Apache Spark is now becoming the big data platform of choice for companies, largely due to its ability to process streaming data. It is a versatile open-source engine that offers very high speed, ease of use and standard GUI, realtime stream processing, dynamic processing, graph processing, in-memory processing as well as batch processing.

Spark has a module that introduces multiple ML components called Spark ML. Estimators, which are algorithms that can be trained, and transformers are either the product of an 8

Harsh Bhanderi

DATA 603

Spark NLP

estimator training or an algorithm that needs no training at all. Both Estimators and Transformers may be part of a pipeline that is no more and no less than a series of steps done in order and is possibly dependent on the outcome of each other.

A. Annotators In spark, all annotators are estimators or transformers. An Estimator in Spark ML is an algorithm that can be used for the output of a Transformer on a DataFrame. A learning algorithm, for example, is an estimator that trains and outputs a model on a data frame. A transformer is an algorithm that can turn a single data frame into a separate data frame. An ML model, for example, is a converter that transforms a DataFrame with features into a DataFrame with forecasts.

There are two types of annotators in spark NLP, AnnotatorApproach and AnnotatorModel. AnnotatorApproach extends Spark ML estimators that are intended to be trained by fit() and AnnotatorModel extends transformers that are intended to transform data frames by transformation(). There is a Model suffix for some of the Spark NLP annotators, and some do not. When the annotator is the product of a training phase, the model suffix is specifically stated. Some annotators, like Tokenizer, are transformers, but because they are not qualified, annotators do not contain the suffix model. On its static object, model annotators have a pre-trained() to retrieve a model's public pre-trained version.

9

Harsh Bhanderi

DATA 603

Spark NLP

B. Pre-trained Models The pre-trained models are provided by Spark NLP in four languages (English, French, German, Italian) and all user have to do is load the pre-trained model onto your disk by defining the name of the model and then configuring the parameters of the model according to your use case and dataset. Then the user doesn't have to think about training a new model from scratch and can appreciate the pre-trained SOTA algorithms that are specifically applicable to data with the transformation ().

C. Transformers DocumentAssembler: Developers need to get raw data annotated to get through the NLP process. This is a special transformer that does this for us; it produces the first text style annotation that will be used down the line for annotators.

TokenAssembler: To use this document annotation in more annotators, this transformer reconstructs a document style annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell-checked, etc.

Pipeline: It is the sequence for algorithms to process. It has some stages like, convert document text into word tokens, then normalize the token by applying stemming, lemmatizing, etc., then 10

Harsh Bhanderi

DATA 603

Spark NLP

covert normalized token into numerical feature vectors like embeddings or tfidf, learn a model of prediction using the vectors and labels of features. In basic terms, to define an ML workflow, the pipeline chains several Transformers and Estimators together.

As a series of steps, a pipeline is defined, and each stage is either a transformer or an estimator. These stages are run in sequence, and as it goes through each stage, the DataFrame input is converted. That is to transfer the data through the mounted pipeline. The transform() method of each step changes the dataset and moves it to the next level.

As from the flow diagram below, depending on the input column parameters, any generated (output) column is directed to the next annotator as an input. It's like building blocks and legos from which developers with a little imagination can come up with incredible pipelines.

11

Harsh Bhanderi

DATA 603

Spark NLP

In addition to customizable pipelines, Spark NLP also has pre-trained pipelines that, according to different use cases, are already installed using those annotators and transformers.

Spark NLP library is used in enterprise programs, designed natively on Apache Spark and TensorFlow, and provides state-of-the-art NLP solutions all-in-one, delivering simple, effective, and reliable NLP notations for machine learning pipelines that can easily scale in a distributed environment.

12

Harsh Bhanderi

DATA 603

Spark NLP

Reference:

1. Natural Language Processing with Spark NLP – By Alex Thomas. June 2020 https://www.oreilly.com/library/view/natural-language-processing/9781492047759/ 2. Introduction to Spark NLP: Foundations and Basic Components – By Veysel Kocaman. Sep 29, 2019. https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basiccomponents-part-i-c83b7629ed59 3. Spark NLP: State of the Art Natural Language Processing – Saif Addin, GitHub. https://github.com/JohnSnowLabs/spark-nlp#examples 4. Build a text categorization model with spark NLP – By Satish Silveri. https://www.analyticsvidhya.com/blog/2020/07/build-text-categorization-model-withspark-nlp/

13...