Title | Engineering Data Analysis Handsout Module 1-6 |
---|---|
Course | Engineering Data Analysis |
Institution | Technological Institute of the Philippines |
Pages | 12 |
File Size | 706 KB |
File Type | |
Total Downloads | 113 |
Total Views | 180 |
ENGINEERING DATAANALYSIS(Summary)Lesson 1: Data CollectionData Collection – is a systematic way of gathering and measuring information on different groups of people. The data collected can be used on research, testing hypothesis, and other intended purposes.TYPES OF DATA1. Quantitative - sets of dat...
ENGINEERING DATA ANALYSIS (Summary)
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis)
EXPERIMENTATION
Lesson 1: Data Collection Data Collection – is a systematic way of gathering
On the other hand, experimentation
and measuring information on different groups of
is the collection of data in a more controlled
people. The data collected can be used on research,
manner. One example is the data you collected
testing hypothesis, and other intended purposes.
as a result of your laboratory experiments. Kindly note that the experimentation process is
TYPES OF DATA
not limited inside a laboratory. Most of the
1. Quantitative - sets of data in numerical form, can
companies use experimentation in order to test
be either counted or measured.
their hypothesis. For example, a company can
• Discrete Data - data that can be "counted"
launch
a
salespeople
(e.g. No. of Pencils, No. of People)
sales
competition to
react
to
different
test levels
how of
performance incentives. •
Continuous
"measured"
(e.g.
Data
-
data
Height,
can
Weight,
be and
Temperature)
Probability - is a measure of the likelihood that a
2. Qualitative - sets of data that is more on
• Binary Data - falls under two mutually categories
particular event will occur. To compute the probability of a particular event to happen:
characteristics and classification
exclusive
Lesson 2: Introduction to Probability
(e.g.
right/wrong,
true/false) • Nominal Data -named categories with no specific rank or order (e.g. blue/red/green)
Probability of an event = (number of ways it can happen) / (total number of outcomes) If an event is certain to happen, Probability = 1. If an event is impossible to happen, the Probability of that event = 0. Therefore, the Probability value is ranging from 0 to 1.
• Ordinal Data - categories with specific rank or natural order (e.g. short, medium, tall)
Probability can be expressed into a decimal, fraction, or percentage. Let's take a look at these
Note: Data collection involves either sampling or experimentation. SAMPLING If you are collecting data about a group of people, say, about 10 students. It is easy to tally and record them accordingly. But if the statistical population is too large to survey, it is better to use gather data within a sample size only. This process is called sampling. Sampling is the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population
examples:
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis)
Lesson 2.1: Basic Rules of Combining
2. MULTIPLICATION RULE
Probabilities
(a) The basic idea for calculating the number of
There are basic rules to follow on combining
ways can be described as follows: If an operation
probabilities:
can be performed in n1 ways and if for each of these ways a second operation can be performed
1. ADDITION RULE
in n₂ ways, then the two operations can be
(a) If the events are mutually exclusive, there is no
performed together in n₁n₂ ways.
overlap: if one event occurs, other events cannot
Note: For more than two operations: If an operation can be
occur. In that case, the probability of occurrence
performed in n ₁ ways, and if for each of these a second
of one or another of more than one event is the
operation can be performed in n ₂ ways, and for each of
sum of the probabilities of the separate events.
the first two a third operation can be performed in n ₃ ways,
Mutually exclusive events mean two or more events cannot happen at the same time. (b) If the events are not mutually exclusive, there can be overlap between them. This can be visualized using a Venn diagram. The probability of overlap must be subtracted from the sum of probabilities of the separate events
and so forth, then the sequence of k operations can be performed in n ₁n ₂ ··· nk ways.
(b) The simplest form of the Multiplication Rule for probabilities is as follows: If the events are independent, then the occurrence of one event does not affect the probability of occurrence of another event. In that case, the probability of occurrence of more than one event together is the
Set Relations on Venn Diagram Let's look at the Venn diagram (b) and (c)
product of the probabilities of the separate events. (This is consistent with the basic idea of counting stated above.) If A and B are two separate events
• P [A ∩ B) = P [occurrence of both A and B],
that
the intersection of events A and B.
probability of occurrence of both A and B together
• P [A ∪ B) = P [occurrence of A or B or both],
is given by P [A ∩ B] = P [A] × P [B]
are independent
of
one another,
the
the union of the two events A and B. (c) If the events are not independent, one event •If two events being considered, A and B, are
affects the probability of the other event. In this
not mutually exclusive, and so there may be the
case, conditional probability must be used. The
overlap between them, the Addition Rule
conditional probability of B given that A occurs, or
becomes P (A ∪ B) = P (A) + P (B) – P (A ∩ B)
on condition that A occurs, is written P [B | A].This
If three events A, B, and C are not mutually
is read as the probability of B given A, or the
exclusive:
probability of B on condition that A occurs.
P (A ∪ B ∪ C) = P (A) + P (B) + P (C) – P (A ∩ B) – P (A ∩ C) – P (B ∩ C) + P (A ∩ B ∩ C)
Note: The multiplication rule for the occurrence of both A and B together when they are not independent is the product of the probability of one event and the conditional probability of the other:
P [A ∩ B] = P [A] × P [B | A] = P [B] × P [A | B]
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis)
Lesson 3: Permutation and Combination
Lesson 3.1: SAMPLING DISTRIBUTION
Permutation - is an arrangement of all or part of a
Population and Sample
set of objects. The number of permutations is the number of different arrangements in which items can be placed. Notice that if the order of the items is changed, the arrangement is different, so we have a different permutation. In permutations, the order is important!
Often in practice, we are interested in drawing valid conclusions about a large group of individuals or objects. Instead of examining the entire group, called the population, which may be difficult or impossible to do, we may examine only a small part of this population, which is called a
• Rule1. The number of permutations of n objects
sample. We do this with the aim of inferring
is n!
certain facts about the population from results
• Rule2. The number of permutations of n distinct
found in the sample, a process known as statistical inference. The process of obtaining
objects taken r at a time is nPr = n! / (n − r)!
samples is called sampling. Let's take a look at • Rule3. If n items are arranged in a circle, the
these examples below.
arrangement doesn’t change if every item is moved by one place to the left or the right. Therefore in this situation, one item can be placed at random, and all the other items are placed concerning
the first
item.
The number
of
a. We may wish to draw conclusions about the weights of 12,000 adult students (the population) by examining only 100 students (a sample) selected from this population.
permutations of n objects arranged in a circle is (n
b. We may wish to draw conclusions about the
− 1)!
percentage of defective bolts produced in a
• Rule4. The number of distinct permutations of n things of which n1 are of one kind, n2of a second kind, ... , nk of a kth kind is
factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. In this case, all bolts produced during the week comprise the population, while the 120 selected bolts constitute a sample.
Combinations - are similar to permutations, but with
c. We may wish to draw conclusions about the
the important difference that combinations take no
fairness of a particular coin by tossing it
account of order. Thus, AB and BA are different
repeatedly. The population consists of all possible
permutations but the same combination of letters.
tosses of the coin. A sample could be obtained by
Then the number of permutations must be larger
examining, say, the first 60 tosses of the coin and
than the number of combinations, and the ratio
noting the percentages of heads and tails.
between them must be the number of ways the chosen items can be arranged.
d. We may wish to draw conclusions about the colors of 200 marbles (the population) in an urn by
In general, the number of combinations of n items
selecting a sample of 20 marbles from the urn,
taken r at a time is
where each marble selected is returned after its color is observed.
Handsout for CE 023 (Engineering Data Analysis) Sampling With or Without Replacement
1st Semester, S.Y. 2020-2021 Sample Mean
If we draw an object from an urn, we have the choice of replacing or not replacing the object into the urn before we draw again. In the first case, a particular object can come up again and again, whereas in the second it can come up only once. Sampling where each member of a population may be chosen more than once is called sampling with replacement, while sampling where each member cannot be chosen more than once is called sampling without replacement. A finite population that is sampled with replacement
Sampling Distribution of Means
can theoretically be considered infinite since samples of any size can be drawn without exhausting the population. For most practical purposes, sampling from a finite population that is very large can be considered as sampling from an infinite population. Sample Distribution The sampling distribution describes the expected behavior of a large number of simple random samples drawn from the same population.
Lesson 3.2: POINT ESTIMATION Point Estimate A Point Estimate of a parameter θ is a single number that can be regarded as a sensible value for θ. A point estimate is obtained by selecting a suitable statistic and computing its value from the given sample data. The selected statistic is called the point estimator of θ.
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis) Unbiased Estimators Suppose
(2) Statistics helps in the proper and efficient
we
have
two
measuring
instruments; one instrument has been accurately
planning of a statistical inquiry in any field of study.
calibrated, but the other systematically gives
(3) Statistics helps in collecting appropriate
readings smaller than the true value being
quantitative data.
measured.
When each instrument
is
used
repeatedly on the same object, because of measurement error, the observed measurements will not be identical. However, the measurements
(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic, and graphic form for easy and clear comprehension of the data.
produced by the first instrument will be distributed about the true value in such a way that on
(5) Statistics helps in understanding the nature
average this instrument measures what it purports
and pattern of variability of a phenomenon
to measure, so it is called an unbiased instrument.
through quantitative observations.
The second instrument yields observations that have a systematic error component or bias. Note: A Point Estimator theta estimator of θ. If
(6) Statistics helps in drawing valid inferences, along with a measure of their reliability about
is said to the unbiased
is not unbiased, the difference E(θ )
the population parameters from the sample data.
Descriptive statistics - is the term given to the analysis
- θ is called the bias of .
of data that helps describe, show, or summarize data
Point Estimates and Interval Estimates
in a meaningful way such that, for example, patterns An estimate of a population parameter given by a single number is called a point estimate of the parameter. An estimate of a population parameter given by two numbers between which
might emerge from the data. Descriptive statistics is at the heart of all quantitative analysis. Descriptive statistics do not, however, allow us to make
the parameter may be considered to lie is called
conclusions beyond the data we have analyzed or
an interval estimate of the parameter.
reach conclusions regarding any hypotheses we might
Note: A statement of the error or precision of an
have made. They are simply a way to describe our data.
estimate is often called its reliability.
Typically, there are two general types of statistic that
Lesson 4: Introduction to Statistics
are used to describe data: Measures of Central Statistics - is defined as a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of masses of numerical data. What is the use of statistics? (1)
Statistics
understanding
helps and
in exact
phenomenon of nature.
providing
a
description
better of
a
Tendency and Measures of Variability.
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis) Measures of Central Tendency - used a single
4. Standard Deviation - It is defined as the square root of the variance.
value to describe the center of a data set. The mean, median, and mode are all the three measures of central tendency.
NOTE:
1.) Mean - is the arithmetic average, calculated
Kurtosis - the sharpness of the peak of a frequency-
by finding the sum of the study data and dividing
distribution curve.
it by the total number of data
Skewness - the measure of the asymmetry of the probability
2.) Median - is the middle value of the
distribution of a real-valued random variable about its mean
distribution. It is calculated by first listing the data in numerical order then locating the value in the
Lesson 5: Curve Fitting, Regression, and
middle of the list.
Correlation Odd set of data - the middle value Even set of data - the average between two
Curve Fitting The general problem of
middle values
3. Mode - is the value that appears most
finding equations of
approximating curves that fit given sets of data is called curve fitting.
frequently in the set of data Measures of Variation - indicates how spread out the study data is from a central value, i.e. the mean.
Underfitting (high bias, low variance.) - too simple to explain the variance. If we have underfitted, this means that the model
The following are the commonly used measures
function does not have enough complexity
of variation:
(parameters) to fit the true function correctly.
1. Range
-
the
difference
between
the
maximum and minimum data
Overfitting (low bias, high variance) - forcefitting, too good to be true, If we have
2. Interquartile Range – quartiles divide the
overfitted, this means that we have too many
range of values into four parts, each
parameters to be justified by the actual
containing one quarter of the values. The
underlying data and therefore build an overly
difference between Q3 and Q1 is called
complex model.
Interquartile range. Like in finding median, it is necessary to list the values in numerical order. In case there will be 2 values lying on Q1 or Q3, get the average. 3. Variance - in statistics is a measurement of the spread between numbers in a data set. That is, itmeasures how far each number in the set is from the mean and therefore from every other number in the set.
Regression - is a statistical method used to determine the strength and character relationship
between
one
dependent
of
the
variable
(usually denoted by Y) and a series of other variables (known as independent variables). Simple Linear Regression In statistics simple linear regression is a linear regression model with a single explanatory variable.
1st Semester, S.Y. 2020-2021
Handsout for CE 023 (Engineering Data Analysis) That is, it concerns two-dimensional sample points
Below are called the least-squares equations or
with one independent variable and one dependent
normal equations for estimating the coefficients, a
variable (conventionally, the x and y coordinates in a
and b, by the points (xi, yi)
Cartesian coordinate system and finds a linear function (a
non-vertical straight
line
that,
as
accurately as possible, predicts the dependent variable values as a function of the independent variable. The simplest situation is a linear or straight-line relation between a single input and the response. Say the input and response are x and y, respectively. For this simple situation: EY ( Y) =α + β x Where: α and β are constant parameters that we want to regression
Referring to the equation, y = a + bx, if we get the
coefficients. From a sample consisting of n pairs of
equation of regression given by the points (xi, yi),
data (x, y), we calculate estimates, for α and b for β.
these are the formulas you need to remember:
estimat...