10 - llllllllllllllllllllllllllllllllllllllllllllllllll PDF

Title	10 - llllllllllllllllllllllllllllllllllllllllllllllllll
Author	Hend Selmy
Course	statistics
Institution	جامعة القاهرة
Pages	8
File Size	243.1 KB
File Type	PDF
Total Downloads	51
Total Views	136

Preview

CLICK TO PREVIEW PDF

Summary

llllllllllllllllllllllllllllllllllllllllllllllllll...

Description

Big Data Analytics: A Review and Tools Comparison V. Dhivya

1 Introduction Big data analytics is the complex process of analysing large volume of data and varied data sets which include social network, videos, images, audios, sensors. Big Data is not only data which also involves various tools, techniques and frameworks. Social media which is the most important factor on the evolution of big data and analytics (Fig. 1). There are also many other driving factors that evolve big data. Retail, Banking and Finance, Media and Entertainment, Healthcare, Education, Government, Transportation. These are few more examples because of which data is evolving and converting it in to bigdata.

1.1 5V’S of Big Data Big Data characterizes huge volume, fast velocity and different variety of information. 5v’s of Big Data are Volume, Variety, Value, Velocity and Veracity (Fig. 2).

V. Dhivya (B) Department of Computer Science, Apollo Arts & Science College, Chennai, India e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2021 D. Goyal et al. (eds.), Information Management and Machine Intelligence, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-15-4936-6_44

399

400

V. Dhivya

Fig. 1 Big data analytics

Fig. 2 5 V’s of Big data

VOLUME

VERACITY

VELOCITY

VARIETY

VALUE

Volume Data have grown exponentially in recent years. The huge volume of data has been generated by humans, machines or by organizations. Only huge volume does not decide that the data is Big Data. This is one of the properties of Big data. A text file may be of few kilobytes, an audio file may be of few megabytes while a video file may be of few gigabytes. The data size determines whether it is a big data or not. Many data’s were produced by Instagram, emails, youtube, etc., Companies like Google and Facebook handle these data’s.

Big Data Analytics: A Review and Tools Comparison Table 1 Unstructured versus structured

401

80% Unstructured data

20% Structured data

• Audio

• Database

• Vidio

• Oracle

• Social media postings

• Tables

• Email • Website • Mobile data, etc.,

Velocity Velocity refers to the rate at which the data is generated and processed. The processing speed and fastness of data production and modification is measured by velocity. An increase in source of data drive velocity. YouTube uploads approximately 72 h of video for every minute. Variety Data comes in all varieties in the form of structured data (Databases), semi-structured data (XML file), Unstructured data (pdf, email, video, and transactions). These type of unstructured data creates a massive problem on storing it. In all of these data, 80% of the data are unstructured and only remaining 20% of the data are structured data (Table 1). Veracity Veracity of big data refers how accurate the data is. Since the large amount of data is collected, it is not sure that all the contents are authentic. Some may contain technical inaccuracy, errors etc., Veracity not only shows data accuracy but also deals with the level of importance of data. Value Value refers to the worth of data being extracted (i.e.) deriving meaningful of data from the entire collection of data. The most important initiative is to understand the costs and benefits of collecting and analysing the data. Having the endless amount of data is useless until it can be turned into a valuable data [1, 2]. Big Data Analytics Process First stage is identifying the problem. We want to identify what the problem is? Second stage is to design the data requirement. What kind of data is required for analysing the particular problem? Third stage is to pre-process the data. In this stage cleaning of data takes place using tools like Refine with some sort of processing. Fourth stage is the Analytical stage. In this stage we basically analyse the processed data using various method/tools like Hadoop MapReduce, Hive. Final stage is Data Visualization. In Visualization of data stage we basically visualize the data using the tools like tableau, plotly (Fig. 3) [3].

402

V. Dhivya

Fig. 3 Stages of Big data Analytics

IDENTIFYING PROBLEM

DESIGNING DATA REQUIREMENT

DATA VISUALIZATION

PERFORMING ANALYTICS OVER DATA

PREPROCESSING DATA

2 Types of Big Data Analytics See Fig. 4. Fig. 4 Big data analytics types

PRESCRIPTIVE ANALYTICS

PREDICTIVE ANALYTICS

DIAGNOSTIC ANALYSTIC

DESCRIPTIVE ANALYTICS

Big Data Analytics: A Review and Tools Comparison

403

2.1 Descriptive Analytics It is the important source to determine what to do next. It basically answers for the question What actually has happened in the past. It uses the techniques like aggregation and mining to provide a deep look at the data in the past then it answers what will be happen now based on the incoming data. Example: Google Analytic tools, the outcome helps the business to interpret what literally has happened in the past and then validate whether advertising campaign was successful or not based on the page reviews.

2.2 Diagnostic Analytics It is used to determine why something is happened in the past. It takes a deeper look at the data to understand the root cause of the event. It is characterized by technique like drill-down, data discovery, data mining. Example: In a Time, series data of sales, the diagnostic analytic will help you to understand why the sales of the company has decreased or increased for a past year and so on.

2.3 Predictive Analytics It predicts that what might be happen in the future. It basically used in statistical modules and forecast techniques to understand the future and answer what could happen. As the word suggest, it predicts that what are the different future outcomes. Example: Southwest airless, it analyses sensor data on their planes to identify the malfunctions and safety issues, this allows the airlines to address the possible problem and make repair with out interrupting the flights or preventing the passengers from danger. It also reduces the downtime, loses and prevents delays and accidents.

2.4 Prescriptive Analytics It uses optimization and simulation algorithm to analyse on the possible outcomes and it says what to be done. It guides us to opt the best solutions based upon the number of different possible actions. They use combination of techniques and tools such as business rules, algorithms, machine learning and computational modelling procedures. Then these techniques are applied against input from many different datasets including historical and transactional data.

404

V. Dhivya

Fig. 5 Comparing NOSQL DB

Example: Google self-driving car, it analyses the environment and decides whether to slow down or speed up or to change the lane etc. [4].

3 Cassandra Versus MongoDB Versus HBase Big Data Tools See Fig. 5 and Table 2.

3.1 CAP Theorem CAP theorem is very important in the BigData world. CAP stands for Consistency, Availability and Partition Tolerance. According to CAP concept, the distributed database system can only have only any two among this three. Cassandra provides AP (i.e) Availability and Partition Tolerance. MongoDB and HBase provides CP (i.e) Consistency and Partition Tolerance [5, 6] (Fig. 6).

4 Conclusion and future work Unstructured Datas that generated are increased dramatically in the recent years. Analysing those datas are the great challenge for RDBMS. This paper surveys that there are various tools, processes and types to examine how effectively big data emerged to resolve this problem. Also three of the NOSQL DB are compared and studied how it can be implemented in different applications based upon the user requirement. Even though each and every big data tools have its own benefits and limitations. There is a minor issues in finging the signal in the noise, loss of data, inaccurate data.

Big Data Analytics: A Review and Tools Comparison

405

Table 2 Cassandra versus MongoDB versus HBase [5] S. No Cassandra

MongoDB

HBase

1

Apache Cassandra is the leading NOSQL distributed data management system . It offers continuous availability, high scalability and performance, strong security and operational simplicity by lowering the overall cost

MongoDB is a document orienteddatabase. Terabyte of data’s are stored in the mongoDB and it is a schemaless database

Apache HBase runs on the top of HDFs and it is a NOSQL key value store. HBase operation run on real time on its database

2

It is the wide column store model. It consists of key spaces which is the outer most container and column family contains the ordered collection of rows

It is a document store architecture where data in mongoDB has flexible schema documents and collections

It is splitted in to tables and tables are further split into columns. Schema definition are not required for columns

3

Cassandra is implemented by JAVA programming language

MongoDB is implemented HBase is implemented by by C++ programming JAVA programming language language

4

Cassandra uses Cassandra Query Language(CQL)

MongoDB uses Dynamic object based language and java script

5

Cassandra provides Client authorization authentication. It provides SSL Encryption

MongoDB provides Client HBase provides Client authorization authentication. authorization authentication. It provides It provides Thrift server role Encryption Governance Auditing

6

Cassandra supports MongoDB supports Master HBase supports Selective Replication factor Selective Replication factor Slave Replication

7

Cassandra is used in IOT, Fraud detection, Recommendation Engines, Product Catalogs and Messaging applications. The top companies that uses Cassandra DB are Twitter and the NetFlix

MongoDB is used in IOT, Mobile, Real time analytics, Catelogs, Personalizations etc.

HBase uses Map Reduce

HBase is used to store the history of patient data in the field of medicine. It is also used in analytical and prediction field

So, more efficient tools can be developed to deal with these issues. The efficient tool that has to be developed should handle noisy, inaccurate and missing datas.

406

V. Dhivya

AVAILABILITY

(RDBMS)CA

CONSISTENCY

AP(CASSANDRA)

CP

PARTITION TOLERANCE

(MONGODB,HBASE) Fig. 6 CAP theorem

References 1. Acharjya, D. P., & Ahmed, K. (2016). A Survey on big data analytics: Challenges, open research issues and tools. International Journal of Advanced Computer Science and Applications (IJACSA), 7(2). 2. https://www.edureka.co/hadoop. 3. Philip, C. L., Chen, Q., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347. 4. https://www.edureka.co/big-data-and-hadoop. 5. https://www.edureka.co/masters-program/big-dat-architect-training. 6. Kambatla, K., Kollias, G., Kumar, V., & Gram, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573. 7. Li, Q., Xing, J., Liu, O., & Chong, W. (2017). The impact of big data analytics on customers ‘Online Behaviour’. In Proceedings of the International Multi-conference of Engineers and Computer Scientists 2017 Vol II, IMECS 2017„ March 15–17, 2017, Hong Kong....