Lab-5 - To determine best probability distribution fits to unknown data. People often PDF

Title	Lab-5 - To determine best probability distribution fits to unknown data. People often
Course	Stochastic Processes Lab
Institution	Eastern Washington University
Pages	16
File Size	465.9 KB
File Type	PDF
Total Downloads	19
Total Views	143

Preview

CLICK TO PREVIEW PDF

Summary

To determine best probability distribution fits to unknown data. People often make assumptions about probability models without really checking. This lab explores how determining actual models can be difficult. ...

Description

LAB 5 EENG 338

(AUTHOR NAME)

Objectives To determine best probability distribution fits to unknown data. People often make assumptions about probability models without really checking. This lab explores how determining actual models can be difficult.

Part – 1 Given datasets (each 50x1 double) are: r1= 20 11 8 5 8 8 19 10 13 1 12 16 14 11 17 20 7 16 6 1 2 7 7 1 4 9 16 14 15 10 19 20 19 10 19 8 13 12 17 6 14 13 14 7 7 4 3 6 8 11 r2= 2 4 0 9 1

1 3 0

1 1

9 0

0 6

4 3

1 2

1 1

0 0

8 4

0 0

2 2

2 0

3 1

1 3

3 2

2 0

2 0

1 0

1 5

0 5

r3= 0 5 0 2 0

3 2 4

1 2

3 2

3 3

1 3

4 0

1 4

3 2

1 0

2 0

2 2

1 0

3 2

3 1

3 0

3 1

0 3

2 2

2 4 2 1 2 2 1 2

r4= 2 4 3 2 3

3 5 3

5 1

1 5

1 3

0 1

2 6

0 1

3 4

3 4

7 3

5 7

1 1

1 6

2 0

2 1

3 1

2 1

3 2

4 3

2 4

0 0

6 2

2 1

2 2

Following function is used to find fits of all valid parametric probability distributions to the datasets: [D PD] = ALLFITDIST(DATA)

(MATLAB Code is given in the Appendix)

It fits all valid parametric probability distributions to the data in vector DATA, and returns a struct D of fitted distributions and parameters and a struct of objects PD representing the fitted distributions. PD is an object in a class derived from the ProbDist class. List of distributions it will try to fit: Continuous (default)

   

Beta Birnbaum-Saunders Exponential Extreme value

            

Gamma Generalized extreme value Generalized Pareto Inverse Gaussian Logistic Log-logistic Lognormal Nakagami Normal Rayleigh Rician t location-scale Weibull

Discrete ('DISCRETE')

 Binomial  Negative binomial  Poisson Dataset - r1 [D PD] = allfitdist(r1,'PDF') Pr obab i l i t yDen s i t yFun ct i o n empi r i c al ge ner a l i z edpa r et o r ay l ei g h r i c i an na k aga mi

0. 4

Pr obabi l i t yDen s i t y

0. 3 5 0. 3 0. 2 5 0. 2 0. 1 5 0. 1 0. 0 5 0

Dataset – r2

2

4

6

8

10 12 Va l ue

14

16

[D PD] = allfitdist(r2,'PDF')

18

20

Pr obab i l i t yDen s i t yFun ct i o n 2. 5 empi r i c al ge ner a l i z edex t r emev a l ue ex pone nt i a l ge ner a l i z edpa r et o t l o cat i o ns c al e

Pr obab i l i t yDen s i t y

2

1. 5

1

0. 5

0 1

2

3

4

5 Va l ue

6

7

8

9

Dataset – r3 [D PD] = allfitdist(r3,'PDF')

Pr obab i l i t yDen s i t yFun ct i o n 3. 5 empi r i c al ge ner a l i z edpa r et o ex pone nt i a l no r ma l l og i s t i c

3

Pr obab i l i t yDen s i t y

2. 5 2

1. 5 1

0. 5

0 1

0

1

2

3 Va l ue

Dataset – r4

4

5

6

[D PD] = allfitdist(r4,'PDF') Pr obab i l i t yDen s i t yFun ct i o n 1. 8 empi r i c al ge ner a l i z edpa r et o ex pone nt i a l ge ner a l i z edex t r emev a l ue no r ma l

1. 6

Pr obab i l i t yDen s i t y

1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 1

0

1

2

3 4 Va l ue

5

6

7

8

As seen from the plots, rician distribution for r1, generalized pareto distribution for r2 and normal distributions for r3 and r4 fit well compared to other distributions.

Part – 2 Mean and Standard Deviations of the datasets are: Dataset – r1 mean(r1) ans = 10.7600 std(r1) ans = 5.5238

Dataset – r2 mean(r2) ans = 1.9800 std(r2) ans = 2.2901

Dataset – r3

mean(r3) ans = 1.9000 std(r3) ans = 1.2817

Dataset – r4 mean(r4) ans = 2.7600 std(r4) ans = 1.8020

Part – 3 Histograms of random vectors (Normal Distribution, 50 samples) generated using randtool: Di st r i b ut i on

Sampl es

14 12

Count s

10 8 6 4 2 0 8

6

4

2

0 Val ues

2

4

6

8

Histograms of random vectors (Generalized Pareto Distribution, 50 samples) generated using randtool:

Di st r i but i on

Sampl es

16 14

Count s

12 10 8 6 4 2 0

0

5

10 Val ues

15

Histogram Comparisons Hi s t ogr am -Dat asetr 2

Di st r i but i on

15

Sa mpl es

16 14 12

10

10 8 6 5

4 2 0 0

0 1

2

3

4

5

6

7

8

9

10

8

Di s t r i b ut i on 14 12 10 8

6

5

10 Vl

Hi s t ogr am-Da t as e tr 4 12

0

Sampl e s

15

4 2

0 0

2 1

2

3

4

5

6

7

0 8

6

4

2

0

2

4

6

Part – 4 Histograms of random vectors (Normal Distribution, 1000 samples) generated using randtool: Di st r i but i on

Sampl es

90 80 70

Count s

60 50 40 30 20 10 0 8

6

4

2

0 Val ues

2

4

Now, using allfitdist function, [D PD] = allfitdist(normrv,'PDF')

6

8

8

Pr obabi l i t yDen s i t yFun ct i o n 0. 5 empi r i c al no r ma l t l o cat i o ns c al e l og i s t i c ge ner a l i z edex t r emev a l ue

0. 45 0. 4 Pr obab i l i t yDens i t y

0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 4

3

2

1

0 Val ue

1

2

3

4

As seen from the above plot, as the number of measurement samples is increased, noise is reduced and parametric probability distribution model fits better.

Summary Determining actual probability models to randomly generated data is difficult and best probability distribution fits are obtained as number of samples is increased, thereby reducing random noise.

Appendix – Function allfitdist function [D PD] = allfitdist(data,sortby,varargin)

%ALLFITDIST Fit all valid parametric probability distributions to data. % [D PD] = ALLFITDIST(DATA) fits all valid parametric probability % distributions to the data in vector DATA, and returns a struct D of % fitted distributions and parameters and a struct of objects PD % representing the fitted distributions. PD is an object in a class % derived from the ProbDist class. % % [...] = ALLFITDIST(DATA,SORTBY) returns the struct of valid distributions % sorted by the parameter SORTBY % NLogL - Negative of the log likelihood % BIC - Bayesian information criterion (default) % AIC - Akaike information criterion % AICc - AIC with a correction for finite sample sizes % % [...] = ALLFITDIST(...,'DISCRETE') specifies it is a discrete % distribution and does not attempt to fit a continuous distribution % to the data % % [...] = ALLFITDIST(...,'PDF') or (...,'CDF') plots either the PDF or CDF % of a subset of the fitted distribution. The distributions are plotted in % order of fit, according to SORTBY. % % List of distributions it will try to fit % Continuous (default) % Beta % Birnbaum-Saunders % Exponential % Extreme value % Gamma % Generalized extreme value % Generalized Pareto % Inverse Gaussian % Logistic % Log-logistic % Lognormal % Nakagami % Normal % Rayleigh % Rician % t location-scale % Weibull % % Discrete ('DISCRETE') % Binomial % Negative binomial % Poisson % %% Check Inputs if nargin == 0 data = 10.^((normrnd(2,10,1e4,1))/10); sortby='BIC'; varargin={'CDF'}; end if nargin==1 sortby='BIC'; end sortbyname={'NLogL','BIC','AIC','AICc'};

if ~any(ismember(lower(sortby),lower(sortbyname))) oldvar=sortby; %May be 'PDF' or 'CDF' or other commands if isempty(varargin) varargin={oldvar}; else varargin=[oldvar varargin]; end sortby='BIC'; end if nargin < 2, sortby='BIC'; end distname={'beta', 'birnbaumsaunders', 'exponential', ... 'extreme value', 'gamma', 'generalized extreme value', ... 'generalized pareto', 'inversegaussian', 'logistic', 'loglogistic', ... 'lognormal', 'nakagami', 'normal', ... 'rayleigh', 'rician', 'tlocationscale', 'weibull'}; if ~any(strcmpi(sortby,sortbyname)) error('allfitdist:SortBy','Sorting must be either NLogL, BIC, AIC, or AICc'); end %Input may be mixed of numeric and strings, find only strings vin=varargin; strs=find(cellfun(@(vs)ischar(vs),vin)); vin(strs)=lower(vin(strs)); %Next check to see if 'PDF' or 'CDF' is listed numplots=sum(ismember(vin(strs),{'pdf' 'cdf'})); if numplots>=2 error('ALLFITDIST:PlotType','Either PDF or CDF must be given'); end if numplots==1 plotind=true; %plot indicator indxpdf=ismember(vin(strs),'pdf'); plotpdf=any(indxpdf); indxcdf=ismember(vin(strs),'cdf'); vin(strs(indxpdf|indxcdf))=[]; %Delete 'PDF' and 'CDF' in vin else plotind=false; end %Check to see if discrete strs=find(cellfun(@(vs)ischar(vs),vin)); indxdis=ismember(vin(strs),'discrete'); discind=false; if any(indxdis) discind=true; distname={'binomial', 'negative binomial', 'poisson'}; vin(strs(indxdis))=[]; %Delete 'DISCRETE' in vin end strs=find(cellfun(@(vs)ischar(vs),vin)); n=numel(data); %Number of data points data = data(:); D=[]; %Check for NaN's to delete deldatanan=isnan(data); %Check to see if frequency is given indxf=ismember(vin(strs),'frequency'); if any(indxf) freq=vin{1+strs((indxf))}; freq=freq(:); if numel(freq)~=numel(data)

error('ALLFITDIST:PlotType','Matrix dimensions must agree'); end delfnan=isnan(freq); data(deldatanan|delfnan)=[]; freq(deldatanan|delfnan)=[]; %Save back into vin vin{1+strs((indxf))}=freq; else data(deldatanan)=[]; end

%% Run through all distributions in FITDIST function warning('off','all'); %Turn off all future warnings for indx=1:length(distname) try dname=distname{indx}; switch dname case 'binomial' PD=fitbinocase(data,vin,strs); %Special case case 'generalized pareto' PD=fitgpcase(data,vin,strs); %Special case otherwise %Built-in distribution using FITDIST PD = fitdist(data,dname,vin{:}); end NLL=PD.NLogL; % -Log(L) %If NLL is non-finite number, produce error to ignore distribution if ~isfinite(NLL) error('non-finite NLL'); end num=length(D)+1; PDs(num) = {PD}; %#ok k=numel(PD.Params); %Number of parameters D(num).DistName=PD.DistName; D(num).NLogL=NLL; D(num).BIC=-2*(-NLL)+k*log(n); D(num).AIC=-2*(-NLL)+2*k; D(num).AICc=(D(num).AIC)+((2*k*(k+1))/(n-k-1)); D(num).ParamNames=PD.ParamNames; D(num).ParamDescription=PD.ParamDescription; D(num).Params=PD.Params; D(num).Paramci=PD.paramci; D(num).ParamCov=PD.ParamCov; D(num).Support=PD.Support; catch err %#ok %Ignore distribution end end warning('on','all'); %Turn back on warnings if numel(D)==0 error('ALLFITDIST:NoDist','No distributions were found'); end %% Sort distributions indx1=1:length(D); %Identity Map sortbyindx=find(strcmpi(sortby,sortbyname));

switch sortbyindx case 1 [~,indx1]=sort([D.NLogL]); case 2 [~,indx1]=sort([D.BIC]); case 3 [~,indx1]=sort([D.AIC]); case 4 [~,indx1]=sort([D.AICc]); end %Sort D=D(indx1); PD = PDs(indx1); %% Plot if requested if plotind; plotfigs(data,D,PD,vin,strs,plotpdf,discind) end end function PD=fitbinocase(data,vin,strs) %% Special Case for Binomial % 'n' is estimated if not given vinbino=vin; %Check to see if 'n' is given indxn=any(ismember(vin(strs),'n')); %Check to see if 'frequency' is given indxfreq=ismember(vin(strs),'frequency'); if ~indxn %Use Method of Moment estimator %E[x]=np, V[x]=np(1-p) -> nhat=E/(1-(V/E)); if isempty(indxfreq)||~any(indxfreq) %Raw data mnx=mean(data); nhat=round(mnx/(1-(var(data)/mnx))); else %Frequency data freq=vin{1+strs(indxfreq)}; m1=dot(data,freq)/sum(freq); m2=dot(data.^2,freq)/sum(freq); mnx=m1; vx=m2-(m1^2); nhat=round(mnx/(1-(vx/mnx))); end %If nhat is negative, use maximum value of data if nhat 1e4; lgx = true; else lgx = false; end subplot(2,1,1) if lgx semilogx(xi,fi,'k',xi,ys) else plot(xi,fi,'k',xi,ys) end legend(['empirical',{D(inds).DistName}],'Location','NE') xlabel('Value'); ylabel('Cumulative Probability'); title('Cumulative Distribution Function'); grid on subplot(2,1,2) y = 1.1*bsxfun(@minus,ys,fi); if lgx semilogx(xi,bsxfun(@minus,ys,fi)) else plot(xi,bsxfun(@minus,ys,fi)) end ybnds = max(abs(y(:))); ax = axis; axis([ax(1:2) -ybnds ybnds]); legend({D(inds).DistName},'Location','NE') xlabel('Value'); ylabel('Error'); title('CDF Error'); grid on else %Discrete Data indxf=ismember(vin(strs),'frequency'); if any(indxf) [fi xi] = ecdf(data,'frequency',freq); else [fi xi] = ecdf(data);

end %Check unique xi, combine fi [xi,ign,indx]=unique(xi); %#ok fi=accumarray(indx,fi); inds = 1:min([max_num_dist,numel(PD)]); ys = cellfun(@(PD) cdf(PD,xi),PD(inds),'UniformOutput',0); ys=cat(2,ys{:}); subplot(2,1,1) stairs(xi,[fi ys]); legend(['empirical',{D(inds).DistName}],'Location','NE') xlabel('Value'); ylabel('Cumulative Probability'); title('Cumulative Distribution Function'); grid on subplot(2,1,2) y = 1.1*bsxfun(@minus,ys,fi); stairs(xi,bsxfun(@minus,ys,fi)) ybnds = max(abs(y(:))); ax = axis; axis([ax(1:2) -ybnds ybnds]); legend({D(inds).DistName},'Location','NE') xlabel('Value'); ylabel('Error'); title('CDF Error'); grid on end end end...