ML Mini Project PDF

Title	ML Mini Project
Author	Atif Hussain
Course	machine learning
Institution	Savitribai Phule Pune University
Pages	9
File Size	570.6 KB
File Type	PDF
Total Downloads	418
Total Views	956

Preview

CLICK TO PREVIEW PDF

Summary

Description

A MINI PROJECT REPORT ON “OPTIMIZING IRIS DATASET USING GENETIC ALGORITHM” SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE. FOR LAB PRACTICE III (MACHINE LEARNING) BACHELOR OF ENGINEERING (COMPUTER ENGINEERING) SUBMITTED BY SUBMITTED BY

Name: Ayush Singh Name: Sanchit Sarode Name: Gaurav Shankhpal

Seat No. B1721008 Seat No. B1721010 Seat No. B1721015

DEPARTMENT OF COMPUTER ENGINEERING D.Y.PATIL COLLEGE OF ENGINEERING AKURDI, PUNE-44. SAVITRIBAI PHULE PUNE UNIVERSITY, 2020-21.

1

INDEX Content Abstract Introduction Problem Statement Objective Scope and Limitation Description What is Genetic Algorithm? General Principles of Genetic Algorithm Operators of Genetic Algorithm Initialization Operator Fitness Assignment Operator Selection Operator Crossover Operator Mutation Operator Process and Results Dataset to use Advantages of Genetic Algorithm Disadvantages of Genetic Algorithm Conclusion

2

Page No. 3 3 3 3 3 3 3 4 4 4 5 6 7 8 8 9 9 9 9

Abstract The main objective of this project is to apply the Genetic Algorithm for optimization on a dataset obtained from UCI ML repository. Genetic algorithms are evolutionary techniques used for optimization purposes according to survival of the fittest idea. These methods do not ensure optimal solutions; however, they give good approximation usually in time.

Introduction Problem Statement Apply the Genetic Algorithm for optimization on a dataset obtained from UCI ML repository. For Example: IRIS Dataset or Travelling Salesman Problem or KDD Dataset

Objective The goal of Genetic Algorithm is to solve constrained and unconstrained optimization problems based on a natural selection process that mimics biological evolution.

Scope and Limitation This project demonstrates the optimization of a dataset. It makes us understand Genetic Algorithms. Genetic algorithms are non-deterministic methods. Thus, the solutions they provide may vary each time you run the algorithm on the same instance.

Description What is Genetic Algorithm? This is a stochastic method for function optimization based on the mechanics of natural genetics and biological evolution. In nature, organisms' genes tend to evolve over successive generations to better adapt to the environment. The genetic algorithm is a heuristic optimization method inspired by the procedures of natural evolution. Genetic algorithms operate on a population of individuals to produce better and better approximations. At each generation, a new population is created by selecting individuals according to their level of fitness in the problem domain and recombining them together using operators borrowed from natural genetics. The offspring might also undergo mutation. This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals that they were created from, just as in natural adaptation.

3

General Principle of Genetic Algorithm

In our case, each individual in the population represents a neu ra l net wo rk . Genes here are binary values, and represent the inclusion or not of particular features in the model. The number of genes is the total number of i np ut v ar iab les in the d at a s et . The number of individuals, or po pul ati on si ze , must be chosen for each application. Usually, this is set to be 10 N, being N the number of features.

Operators of Genetic Algorithm Initialization Operator The first step is to create and initialize the individuals in the population. As the genetic algorithm is a stochastic optimization method, the individuals' genes are usually initialized at random. 4

To illustrate this operator, consider a predictive model represented by a neural network with 6 possible features. If we generate a population of 4 individuals, we have 4 different neural networks with random features. The next figure illustrates this population.

As we can see, each individual is represented by 6 binary genes. Each positive gen means that the corresponding feature is included in the neural network. Fitness Assignment Operator After the initialization, we need to assign a fitness value to each individual in the population. We train each neural network with the training instances and then evaluate their error with the selection instances. A significant selection error means low fitness. Those individuals with greater fitness have a higher probability of being selected for recombination. The most used method for fitness assignment is known as a rank-based fitness assignment. With this method, the selection errors of all the individuals are sorted. Then, the fitness assigned to each individual only depends on its position in the individual's rank and not on the actual selection error. The fitness value assigned to each individual with the rank-based method is: Φ(i) = k⋅R(i), where i = 1,...,N. Here κ is a constant called selective pressure, and its value is fixed between 1 and 2. Higher selective pressure values make the fittest individuals have more probability of recombination. The parameter R(i) is the rank of individual i. Coming back to our example, the following table depicts the selection error, the rank, and the corresponding fitness of each individual. Selection error Rank Fitness Individual 1 0.9 1 1.5 5

Individual 2

0.6

3

4.5

Individual 3 Individual 4

0.7 0.5

2 4

3.0 6.0

Where we have chosen κ=1.5 to calculate the fitness value. Note that, for calculating the population fitness, we have trained 4 different neural network. We can plot the above fitness values in a pie chart. Here the area for each individual in the pie is proportional to its fitness. The following picture shows the fitness pie.

As we can see, the fittest individual (#4) is the one who has the most significant area (40%), and the least fitted individual (#1) has the smallest area (10%). Selection Operator After a fitness assignment has been performed, the selection operator chooses the individuals that recombine for the next generation. The individuals most likely to survive are those more fitted to the environment. Therefore, the selection operator selects the individuals according to their fitness level. The number of selected individuals is N/2, being N the population size. El iti sm s el ec ti on makes the fittest individuals survive directly for the next generation. The elitism size controls the number of directly selected individuals, and it is usually set to a small value (1,2...). One of the most used selection methods is the rou let te w hee l , also-called stochastic sampling with replacement. This method places all the individuals on a roulette, with areas proportional to their fitness, as we saw above. Then, the roulette is turned, and the individuals are selected at random. The corresponding individual is selected for recombination. The next figure illustrates the selection process for our example.

6

In this case, neural network #4 has been selected by elitism, and #3 has been selected by the roulette wheel. Note that, although the individual #2 has more fitness than the #3, it has not been selected due to the stochastic nature of the genetic algorithm. Here the number of selected individuals is half of the population size. Crossover Operator Once the selection operator has chosen half of the population, the crossover operator recombines the selected individuals to generate a new population. This operator picks two individuals at random and combines their features to get four offsprings for the new population until the new population has the same size as the old one. The u nif or m cro sso ve r method decides whether each of the offspring's features comes from one parent or another. The next figure illustrates the uniform crossover method for our example.

Here we have generated four offspring from two parents. Some features of each neural network correspond to one ancestor, and some other features to the other. Here the population size remains constant.

7

Mutation Operator The crossover operator can generate offsprings that are very similar to the parents. This might cause a new generation with low diversity. The mutation operator solves this problem by changing the value of some features in the offsprings at random. To decide if a feature is mutated, we generate a random number between 0 and 1. If this number is lower than a value called the m ut ati on ra te , that variable is flipped. The mutation rate is usually chosen to be 1/m, where m is the number of features. With that value, we mutate one feature of each individual (statistically). The next image shows the mutation of one of the offsprings of the new generation.

As we can see, the fourth input of the neural network has been mutated. At this point, we have a new population. Process and Results The whole fitness assignment, selection, recombination, and mutation process is repeated until a stopping criterion is satisfied. Each generation is likely to be more adapted to the environment than the old one. The following figure shows the typical behavior of the genetic algorithm. The blue line represents the mean selection error of the individuals at each generation, and the red line represents the selection error of the best individual, along with all generations.

8

Dataset to use For this project we are using the IRIS Flower Dataset on UCI ML Repository

Advantages of Genetic Algorithm • •

They usually perform better than traditional feature selection techniques. Genetic algorithms can manage data sets with many features.

• •

They don't need specific knowledge about the problem under study. These algorithms can be easily parallelized in computer clusters.

Disadvantages of Genetic Algorithm • •

Genetic Algorithms might be costly in computational terms since the evaluation of each individual requires the training of a model. These algorithms can take a long time to converge since they have a stochastic nature.

Conclusion Thus, we successfully optimized a dataset by applying Genetic Algorithm

9...