Big Data - This is the academic notes I prepared based on the syllabus and is helpful to PDF

Title	Big Data - This is the academic notes I prepared based on the syllabus and is helpful to
Author	Devika Sanjay
Course	big data and business analytics
Institution	Mahatma Gandhi University
Pages	71
File Size	2.3 MB
File Type	PDF
Total Downloads	79
Total Views	478

Preview

CLICK TO PREVIEW PDF

Summary

Description

MODULE 1 INTRODUCTION TO BIG DATA

What is Big Data? Big Data is a collection of data that is huge in volume yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size. Big data is the term for – a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Examples of Bigdata • Following are some the examples of Big Data- The New York Stock Exchange generates about one terabyte of new trade data per day. - The statistic shows that 500+ terabytes of new data get ingested into the databases of social media site Facebook, every day. - A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time - Other examples of Big Data generation includes stock exchanges, social media, etc.

Big Data Benefits  Big data makes it possible for you to gain more complete answers because you have more information.  More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.  Better customer insights.

 Improved operations.  More insightful market intelligence.  Smarter recommendations and targeting.  Businesses can utilize outside intelligence while taking decisions.  Early identification of risk to the product/services, if any. Types Of Big Data Following are the types of Big Data: 1. Structured 2. Unstructured 3. Semi-structured Structured Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes. Data stored in a relational database management system is one example of a 'structured' data. Unstructured Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don't know how to derive value out of it since this data is in its raw form or unstructured format. Example of Unstructured data – The output returned by 'Google Search'. Semi-structured Semi-structured data can contain both the forms of data. We can see semistructured data as a structured in form but it is actually not defined with e.g. a

table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file. Characteristics Of Big Data Big data can be described by the following characteristics: - Volume - Variety - Velocity - Variability (i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data solutions. (ii) Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data. (iii) Velocity – The term 'velocity' refers to the speed of generation of data, means dynamic in nature. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Types of Analytics - Descriptive Analytics, which use data aggregation and data mining to provide insight into the past and answer: ”What has happened?” - Predictive Analytics, which use statistical models and forecasts techniques to understand the future and answer: ”What could happen?” - Prescriptive Analytics, which use optimization and simulation algorithms to advice on possible outcomes and answer: ”What should we do?”

Descriptive Analytics: Insight into the past Descriptive analysis or statistics does exactly what the name implies they Describe, or summarize raw data and make it something that is interpretable by humans. They are analytics that describe the past. The past refers to any point of time that an event has occurred, whether it is one minute ago, or one year ago. Descriptive analytics are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes. Common examples of descriptive analytics are reports that provide historical insights regarding the company’s production, financials, operations, sales, finance, inventory and customers.

Predictive Analytics: Understanding the future Predictive analytics has its roots in the ability to “predict” what might happen. Predictive analytics provides companies with actionable insights based on data. It is important to remember that no statistical algorithm can predict the future with 100% certainty. Companies use these statistics to forecast what might happen in the future. This is because the foundation of predictive analytics is based on probabilities. Predictive analytics can be used throughout the organization, from forecasting customer behavior and purchasing patterns to identifying trends in sales activities.

Prescriptive Analytics: Advise on possible outcomes The relatively new field of prescriptive analytics allows users to prescribe a number of different possible actions to and guide them towards a solution. At their best, prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions. Prescriptive analytics use a combination of techniques and tools such as business rules, algorithms, machine learning and computational modelling procedures. These m techniques are applied against input from many different data sets including historical and transactional data, real-time data feeds, and big data.

Who uses Big Data? -Applications  Banking - its important to understand customers and boost their satisfaction, its equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics  Education - Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support  Government - When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime.  Health care - Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care.  Manufacturing - More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions.

 Retail - Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business.

Operational Vs. Analytical Big Data Operational Big Data provide operational features to run real-time, interactive workloads that ingest and store data. MongoDB is a top technology for operational Big Data applications with over 10 million downloads of its open source software. The Operational Big Data is all about the normal day to day data that we generate. This could be the Online Transactions, Social Media, or the data from a Particular Organization etc. Examples:  Online ticket bookings, which includes your Rail tickets, Flight tickets, movie tickets etc.  Online shopping which is your Amazon, Flipkart, Walmart, Snap deal and many more.  Data from social media sites like Facebook, Instagram, what’s app and a lot more.  The employee details of any Multinational Company.

Analytical Big Data Analytical Big Data technologies, on the other hand, are useful for retrospective, sophisticated analytics of your data. Hadoop is the most popular example of an Analytical Big Data technology. It is a little complex than the Operational Big Data. In short, Analytical big data is where the actual performance part comes into the picture and the crucial real-time business decisions are made by analyzing the Operational Big Data. Examples:  Stock marketing  Carrying out the Space missions where every single bit of information is crucial.  Weather forecast information.  Medical fields where a particular patients health status can be monitored.

But picking an operational vs analytical Big Data solution isn’t the right way to think about the challenge. They are complementary technologies and you likely need both to develop a complete Big Data solution.

Traditional Approach v/s Google Solution Traditional Approach In the Traditional approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software’s can be written to interact with the database, process the required data and present it to the users for analysis purpose.

In the above approach it works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server. Google’s Solution Google solved the above problem using an algorithm called MapReduce. This algorithm divides the given task into small parts and then assigns those parts to many computers connected over the network, and collects the results to form the final result dataset. The following diagram shows various commodity hardware’s which could be single CPU machines or servers with higher capacity.

Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data. Big Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business. The process of converting large amounts of unstructured raw data, retrieved from different sources to a data product useful for organizations forms the core of Big Data Analytics.

Why is big data analytics important? Organizations can use big data analytics systems and software to make data-driven decisions that can improve business-related outcomes. The benefits may include more effective marketing, new revenue opportunities, customer personalization and improved operational efficiency. With an effective strategy, these benefits can provide competitive advantages over rivals.

How does big data analytics work? Data analysts, data scientists, predictive modelers, statisticians and other analytics professionals collect, process, clean and analyze growing volumes of structured

transaction data as well as other forms of data not used by conventional BI and analytics programs. Risks of Big Data • Will be so overwhelmed – Need the right people and solve the right problems • Costs escalate too fast – Isn’t necessary to capture 100% • Many sources of big data is privacy – self-regulation – Legal regulation

What is MapReduce? MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time. MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data whereas procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programming. The library where the procedure Map () and Reduce () belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop.

MapReduce algorithm consists of two functions: Map and Reduce. Map function splits all the incoming data into small units that can be analyzed independently of one another. When the data is split, Reduce function is launched that analyzes each independent bit of data and then brings them all together. So, insights are discovered in the data in such a way. Map() Procedure There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node. Reduce() Procedure All the worker nodes return the answer of the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node. This Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).

Big Data Platform

Big data platform is a type of IT solution that combines the features and capabilities of several big data application and utilities within a single solution. It is an enterprise class IT platform that enables organization in developing, deploying, operating and managing a big data infrastructure /environment.  Big Data Platform is integrated IT solution for Big Data management which combines several software system, software tools and hardware to provide easy to use tools system to enterprises.  It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size and data volume. Big Data Platform is enterprise class IT solution for developing, deploying and managing Big Data.  There are several Open source and commercial Big Data Platform in the market with varied features which can be used in Big Data environment.  Big data platform is a type of IT solution that combines the features and capabilities of several big data application and utilities within a single solution.  It is an enterprise class IT platform that enables organization in developing, deploying, operating and managing a big data infrastructure /environment.  Big data platform generally consists of big data storage, servers, database, big data management, business intelligence and other big data management utilities.  It also supports custom development, querying and integration with other systems.  The primary benefit behind a big data platform is to reduce the complexity of multiple vendors/ solutions into a one cohesive solution.  Big data platform are also delivered through cloud where the provider provides an all inclusive big data solutions and services.

Features of Big Data Platform a) Big Data platform should be able to accommodate new platforms and tool based on the business requirement. Because business needs can change due to new technologies or due to change in business process. b) It should support linear scale-out c) It should have capability for rapid deployment d) It should support variety of data format

e) Platform should provide data analysis and reporting tools f) It should provide real-time data analysis software g) It should have tools for searching the data through large data sets.

Challenges associated with Big Data Platforms          

Analysis Capture Data Curation Search Sharing Storage Transfer Visualization Querying Updating

Popular Big Data Platforms  Hadoop  Cloudera  Amazon Web Services

       

Hortonworks MapR IBM Open Platform Microsoft HDInsight Intel Distribution for Apache Hadoop Datastax Enterprise Analytics Teradata Enterprise Access for Hadoop Pivotal HD

What is Hadoop?  Hadoop is open-source, Java based programming framework and server software which is used to save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered environment.  Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant way.  Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity computers. If any server goes down it know how to replicate the data and there is no loss of data even in hardware failure.  Hadoop is Apache sponsored project and it consists of many software packages which runs on the top of the Apache Hadoop system.  Top Hadoop based Commercial Big Data Analytics Platform  Hadoop provides set of tools and software for making the backbone of the Big Data analytics system.  Hadoop ecosystem provides necessary tools and software for handling and analyzing Big Data.  On the top of the Hadoop system many applications can be developed and plugged-in to provide ideal solution for Big Data needs.

Cloudera  Cloudera is one of the first commercial Hadoop based Big Data Analytics Platform offering Big Data solution.  Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data Science & Engineering and Cloudera Essentials.  All these products are based on the Apache Hadoop and provides real-time processing and analytics of massive data sets.

Amazon Web Services  Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services package.  AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and Simple Storage Service (S3).  Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud environment.  Amazon EMR allows companies to setup and easily scale Apache Hadoop, S...