Big data Assignment - Analytics PDF

Title Big data Assignment - Analytics
Course Data Science and big data Analytics
Institution National University of Ireland Galway
Pages 11
File Size 599.7 KB
File Type PDF
Total Downloads 5
Total Views 190

Summary

Analytics...


Description

Business Information Systems Individual Assignment Submission Form Notes to candidates: 1. 2. 3. 4. 5.

Assignments will not be accepted without a signed copy of this form. This form must be stapled to any additional pages submitted. Please refer to your course lecturer for full assignment submission details. Any computer disc or CD submitted with the assignment must also be signed. Unreadable, virus infected or blank discs/memory sticks will score zero.

Your Details Surname: Nallabala First Name: Sri Charan Student ID Number: 17234324 Email: [email protected] Contact Phone No: 0899411132 Lecturer: Anatoli Nachev Course: MSc IA&A Assignment: Big Data Analytics: Log Management

I hereby declare that the work submitted is entirely my own, and that ideas or extracts taken from other sources are properly acknowledged and referenced. Furthermore, I acknowledge that the penalty for plagiarism may include suspension from examination.

Signed: N. Sri Charan

1|Page

Date: 09-03-2018

1) Script – Script is a machine /computer language which will be in a format of text document containing set of Instructions/Commands. This script can be executed using scripting engines even without being compiled. As the scripts are usually in text document format this allows user greater level of comfort for easy editing with the use of normal text editors so that required action can be obtained. Scripts are now becoming very prominent because of evolution of web-based applications. Scripts are very useful to automate the process which results in reduced consumption of man hours to complete any given task. Scripts are very useful to convert the raw data into meaning full information by providing instruction/commands based on the requirements of users. The open source scripts like Hadoop, hive can be integrated with AWS to provide scalable architecture for analysing large scale records of logs. There are two parts of script – 1) Client-Side scripting 2) Server-Side scripting

Client-Side Scripting – Client-side scripting means script which is runned on the client browser and source code is open. Ex – Java Script. Server-Side scripting - Server-side scripting means script which is runned on the server and only open to visitors of the website. Ex – Python, PHP.

Advantages of Scripts – 1) As the scripts are in text document format it is very easy for non-programmer to understand and execute. 2) Scripts doesn’t require file to be complied. 3) Comparatively allows complicated tasks to be executed in few steps. 4) Text format allows users to edit easily. Dis-Advantages of Scripts – 1) The Open source of script grant outsiders to view source code which may damage confidential data.

[ CITATION Mac15 \l 16393 ]

EUROSTAR Usually, the log record will be in unstructured or quasi structured format. So, to make use of this records it needs to be processed which can be done with the use of script. In the given case of EUROSTAR, the log records are quasi structured which are in raw format and difficult to analyse. So, to analyse the given log records, EUROSTAR is running the Hive script which contains instructions/commands and process the log files with the use of AWS EMR cluster. 2|Page

The Hive script will normally approach data by translating current format into database table with the help of serializer/deserializer. So, by using SerDe log files can defined as tables.

SCRIPT Analysis -

The available log data consists of information about Date & Time of user requests, OS & Browser used by users, Status of User Requests etc… The above part of Hive script creates data tables for log files which are stored in S3 Storage of AWS which then can be used by users for further analysis.

The above part of script is called Serializer/Deserializer. Serializer is used to combine N data streams to a single stream and Deserializer takes Serializer data as input and outputs original data.

The above hive script contains the query requested by users and overwrite Instructions. The given Hive script contains the command to calculate the total number of requests per operating system within a specific timeframe of 2014-07-05 to 2014-08-05. After executing the hive script using AWS EMR cluster the data is converted into useful information containing the statistics of number of requests per operating system within a specific timeframe. Android Linux MacOS OSX Windows iOS

-

855 813 852 799 883 794

From the above information EUROSTAR will now be able to analyse the segment of user requests to their website based on the Operating system used. This helps Eurostar to further promote and retain their business by introducing new attractive schemes for Android users as there are top visits of 883. Eurostar will 3|Page

have to take necessary measures to improve and should try to change their product line which also attracts the iOS users and increase their market base.

2) Modified Script – The given script has been modified to find the Request Status of users for each OS. The query is modified as SELECT OS, STATUS, COUNT(*) FROM CloudFront_logs WHERE DateLog BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY OS, STATUS;

After executing the script on provided log data using AWS EMR Cluster the output is as below Output – Android-200-704 Linux-200-667 MacOS-200-694 OSX-200-637 Windows-200-706 iOS-200-604 Android-304-151 Linux-304-146 MacOS-304-158 OSX-304-162 Windows-304-177 iOS-304-190

The above output can be analysed as – OS

Request Status

No of Requests

Android Android

200 304

704 151

Linux Linux

200 304

667 146

MacOS MacOS

200 304

694 158

OSX OSX

200 304

637 162

Windows Windows

200 304

706 177

iOS iOS

200 304

604 190

The above output shows the request status as 200 and 304. 4|Page

200 – means the requests successfully completed (HTTP 200) 304 – showing the request needs to be redirected as clients must have used conditional GET request and Client needs to send the request to server again by eliminating the condition. (HTTP 304) The above output can be analysed as – Total Android user requests are 855 out of which 704 user requests are succeeded and 151 user requests needs to be redirected as user must have used conditional GET request and user needs to send the request to server again by eliminating the condition. OS

Request Status

No of Requests

200 304

704 151

Android Android Query -2 –

The given script is modified in such a way to identify the most popular browser being used by the users Query – SELECT BROWSER, COUNT(*) FROM cloudfront_logs GROUP BY BROWSER

Output Chrome Firefox IE Lynx Opera Safari

-

828 795 774 889 - 835 - 875

From the above output it is very clear that most of the user’s requests are coming from lynx browser and less requests from IE browser. The above information can be used by Eurostar to analyse why the organization have less user request from IE browser and high user request from safari or Lynx and can act accordingly to improve the business.

3) Log – A log is an automatically generated or continuous record of all actions performed in any system or organisation. Each and every system or computer applications virtually generates logs files based on the actions performed by user. The action performed by users will be captured and recorded as logs in the data base. Each component in the system ranging from hardware such as routers, firewalls, servers to software such as operating systems (windows/Linux), applications creates log files. In large organizational systems it may lead to millions of log files being generated on daily basis and analysing this data is necessary for the organization as it provides vital information

5|Page

The generated logs files can be then used by management in various ways using proper analytic tools. Every server or network will have the transaction log files which records the actions of committed transactions performed by the server. These generated log files can be used when recovering in case of system/network crash.[ CITATION Dav10 \l 16393 ] Log data base of an organization is a wealth of data to perform analytics and to draw reasonable insights of data which are relevant to the business. Companies use log data base for various reasons like security, customer analysis, market analysis and to predict the customer behaviour. [ CITATION Kar06 \l 16393 ] Log Management – Log management means effective way to approach/manage the millions of log files which are being generated by capturing the actions performed by variety of users and analysing the generated data to create alerts on any security issues or performing statistical analysis to analyse trends in the market. Log Management system is a sequential process of data collection, efficient retention of collected data, searching the data to access information in all the logs, log parsing which helps in faster search results by using the data structure and reporting or scheduled reporting and disposing of log data. In layman words log management is just dealing with logs. Log management defined by The National Institute for Standards and Technology (NIST) "the – process for generating, transmitting, storing, analysing, and disposing of computer security log data” [ CITATION Dav10 \l 16393 ] Effective log management is necessary for organization to implement better security controls and compliance. With respect to the compliance there are many regulations across globe mandating the requirement for audit logs. Ex – Sarbanes, Oxley Act, HIPPA Etc. In addition to the inherent benefits of log management there are also many regulations compelling organizations to review logs.

6|Page

Basically, log files are divided into three parts and there are three types of log files – Three Parts of log files- Time when log file is created - Source from where the log file is generated - Why the log file is created. Three types of log files – 1. Request log files 2. Manager log files 3. The Internal concurrent manager log file - Request log files document the timestamped record of all executed programmes. Every request will produce log file. -Manager log files document the act of a concurrent manager which is executing the request. -Internal Concurrent Manager log records the performance of concurrent manager. Local Host log – This captures all the logs of transactions executed between client and application server. This log will be saved/named as localhost_access_log..txt. Few frameworks like Hibernate, produces log messages based on the connections established within the database. Uses of log Management to Organization – 1. Monitoring Employee Actions Log management system allows management to keep track of employee actions performed on system. This helps management to detect any unauthorised access, any suspicious activities performed like mis utilizing the organization confidential data which leads to improved IT security. Employees trying to copy the organizational data into their own drivers can be easily traced using log management by monitoring the login and logout of any storage devices.

2. Real Time Information Companies can gain real time information about the actions performed by users of systems whether onsite/offsite. This enables companies a greater level of comfort in terms of security and networking issues. Log files which are created by firewalls or routers will be stored in centralised data base of organisation and help companies to gain more control upon it users actions and prevent any unauthorized access. 7|Page

Besides the inherent benefits from log management organizations also face many challenges in terms of balancing the resources availability for log management and creation of huge log data. Challenges can be analysed in three ways – 1) Generation of logs – Logs can be generated from variety of sources which may not be consistent with others. 2) Confidentiality of produced logs – There is always a chance that employees working on log management can misuse the data for own or other purposes which might be against the organization policy and leads to breach of confidentiality. 3) Resources working on log management may not be completely experts in the analysis of logs and might lead to incorrect decision making.

Data logs can be efficiently managed by using proper analytic tools. Log data is very huge and is getting increased from day to day which is the ground reason for emerging demand for analytics. Because of Increased log database the traditional tools are not capable to analyse the complete data. To overcome the problems from using traditional analytic tools Bigdata has evolved. BIG DATA – Collection of large amounts of data is called Big Data. Processing this data to extract useful information and to uncover hidden valuable patterns is called Big Data Analytics. Big Data Analytics is a function of 3Vs which are Volume, Variety and Velocity. Volume refers to the huge data which is getting generated from various sources like website visit records, reports, email, tax filings and social media posts etc. The volume of data is measured in bytes. “According to computer giant IBM, 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. That's big by anyone's standards. About 75% of data is unstructured, coming from sources such as text, voice and video” [ CITATION BBC14 \l 16393 ] Data can be in variety of forms like pictures, audio, video or text etc. Data can be either in structured or unstructured format. As per IBM research 75% of data in existence is unstructured. At last, the speed at which data must be analysed is referred as velocity. Analysis of data is very important in decision making and to draw the reasonable conclusions. In the present competitive environment, it is very important to analyse data as soon as its generated to support faster decision making. [ CITATION Tec16 \l 16393 ]

8|Page

Big data solutions help and provides services like aggregation of log data from several servers into a single server or single centralised log file. The aggregated logs are then transformed into common format using big data technologies later which can be further used for analysis and visualisation. [ CITATION Jos17 \l 16393 ] Big Data technology is the best solution for log management. Steps performed by big data technologies in log management –

-

Log Collection and Centralization.

-

Log Transformation into single common format.

-

Storing of logs which are processed to common format.

-

Processing of logs.

-

Log Analysis and Visualization.

9|Page

Big data helps organization to sort out various issues in log management. “These applications are aided by big data’s processing power, machine learning, predictive analytics, and advanced search capabilities”. A big data implemented log analytics platform:    

Collects and stores raw log data from various information systems Executes the data through buffers Loads it into a log analytics stack for query parsing, search indexing, and trend visualization Allows organizations to perform overall analysis of user trends, pattern analysis. Market forecasting ,clustering trends etc.

Because of above underlying benefits of log management, it has become a priority for most of organizations and Big data provides a key for efficient log management in terms of processing, analysis and visualisation. Log management using Big data technologies has become a very popular trendy combination because of its successful results/output. In conclusion it can be said that log management is a very crucial activity in any forward-looking organization and using big data tools for log management provides the deep insights that is unparalleled by any other source.

Bibliography BBC, 2014. Big Data: Are you ready for blast-off?. [Online] Available at: http://www.bbc.com/news/business-26383058

10 | P a g e

Josef, 2017. [Online] Available at: https://www.joe0.com/2017/02/05/applying-big-data-analytics-to-logging/ Kent, K., 2006. Guide to Computer Security log management, s.l.: NIST. Mac, 2015. [Online] Available at: https://www.eukhost.com/blog/webhosting/server-side-scripting-pros-and-cons/ TechTarget, 2016. big data. [Online] Available at: http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data Torre, D., 2010. What is log management and how to choose the right tools. [Online] Available at: https://www.csoonline.com/article/2126060/network-security/network-security-what-is-logmanagement-and-how-to-choose-the-right-tools.html

11 | P a g e...


Similar Free PDFs