DATA Science PDF

Title DATA Science
Author Tabassum Shaikh
Course Bsc. Computer Science
Institution University of Mumbai
Pages 43
File Size 5.4 MB
File Type PDF
Total Downloads 361
Total Views 588

Summary

T.Y.B Computer SciencePRACTICAL LAB MANUALFORData ScienceCOMPLIED BY :-Thabasum ShaikhRoll_No:INDEXSerial No: Contents Page No1 Practical of Data collection, Datacuration and management for Large-scale Data system (such as MongoDB)2 Practical of Principal ComponentAnalysis3 Practical of Time-series ...


Description

T.Y.B.Sc Computer Scien

PRACTICAL LAB MANUAL FOR Data Science COMPLIED BY :Thabasum Shaikh Roll_No:47

INDEX Serial No: 1

2 3 4 5 6 7

Contents Practical of Data collection, Data curation and management for Largescale Data system (such as MongoDB) Practical of Principal Component Analysis Practical of Time-series forecasting Practical of Simple/Multiple Linear Regression Practical of Hypothesis testing Practical of Analysis of Variance Practical of Decision Tree

Page No

Practical-2 Aim :- Practical of Data collection, Data curation and mana Unstructured data (NoSQL). cmd 1.create a collection

2.insert a Record in collections

3.MongoDb Insert Document

4.Insert Multiple Document in collection

Query Document based on the criteria Equality Criteria:

Greater Than Criteria:

Not Equals Criteria:

Greater than equals Criteria:

7. MongoDB Update Document

MongoDB – limit( ) and skip( ) method

MongoDB sort() method Sorting Documents using sort() method

Let’s display the EmployeeId of all the documents in desce

To display the student_id field of all the students in ascend

How to create index in MongoDB

MongoDB – Finding the indexes in a collection

Practical 3 Aim: Practical of Principal Component Analysis(PCA PCA- PCA is a method of obtaining important variables (in form of components) from a available in a data set. 

It extracts low dimensional set of features by taking a projection of irrelevant dim dimensional data set with a motive to capture as much information as possible.

 With fewer variables obtained while minimising the loss of information, visualizatio more meaningful.  PCA is more useful when dealing with 3 or higher dimensional data.  It is always performed on a symmetric correlation or covariance matrix. 

This means the matrix should be numeric and have standardized data.

 PCA is very useful in situations when the data at hand is very large.  Example, in case of image compression, PCA can be used to store the image in t components and useless number of pixels.

iris - iris is a datasets library in R contains it .  Below example we are 150 observations (rows) with 4 features .

()

eigen() -- function computes the eigenvalues and eigenvectors simultaneously. Therefore save the results in a variable and access the appropriate vector .

princomp -- performs a principal components analysis on the given numeric data matrix an as an object of class princomp .

Eigen_data$values output-

> PCA_data$loadings[,1:4]

> Eigen_data$vectors

> biplot (PCA_data) --A Principal Components Analysis Biplot (or PCA Biplot for dimensional chart that represents the relationship between the rows and columns of a

> screeplot(PCA_data, type="lines")

This plot shows the bend at the second principal component. Let us now fit two naive Bayes models. 1. one over the entire data. 2. The second on the first principal component. > model2 = PCA_data$loadings[,1]

> model2_scores mod1mod2table(predict(mod1, iris[,1:4]), iris[,5])

>table(predict(mod2, model2_scores), iris[,5])

Practical 5 AIM: Practical of Time-series forecasting >data("AirPassenger") The classic Box & Jenkins airline data. Monthly totals of international airline passengers, 19 >class(AirPassengers) - R possesses a simple generic function mechanism which can be used oriented style of programming. Method dispatch takes place based on the class of the first generic function. > start(AirPassengers) - This is the start of the time series. > end(AirPassengers) - This is the end of the time series. > frequency(AirPassengers) - The cycle of this time series is 12months in a year. >summary(AirPassengers) - The number of passengers are distributed across the spectrum

>abline(reg=lm(AirPassengers~time(AirPassengers))) This function adds one or more straight lines through the current plot.

>cycle(AirPassengers) - time creates the vector of times at which a time series was sampled positions in the cycle of each observation. frequency returns the number of samples per un the time interval between observations

>plot(aggregate(AirPassengers,FUN=mean)) - This will aggregate the cycles and display a ye

boxplot(AirPassengers~cycle(AirPassengers))-Box plot across months will give us a sense on

>acf(diff(log(AirPassengers)))

> (fit...


Similar Free PDFs