Lecture Notes Applied stat PDF

Title	Lecture Notes Applied stat
Course	Applications For Economics, Management And Finance
Institution	Università Commerciale Luigi Bocconi
Pages	37
File Size	1.1 MB
File Type	PDF
Total Downloads	53
Total Views	155

Preview

CLICK TO PREVIEW PDF

Summary

Download Lecture Notes Applied stat PDF

Description

30280 APPLICATIONS FOR MANAGEMENT Quantitative methods without tears

Please note that these lecture notes are available to BIEM students ONLY. They should not be distributed to other students at Bocconi or any other University.

Arnstein Aassve

BACHELOR IN INTERNATIONAL MANAGEMENT (BIEM)

Part 1: Doing applied research, data sources, sampling, STATA, data screening

1

Please read this first These lecture notes are not complete. The first part (Part I) is fairly complete as it covers what you are supposed to know about sampling, descriptive statistics and data sources. The part on Regression analysis is incomplete. Here I have not taken the effort to go through all the material as it is very well covered by our text book (and numerous other text books concerning linear regression). It does however, give an example of linear regression applied to the Capital Asset Pricing Model (CAPM), which is an excellent way to introduce what linear regression is all about. The section on ANOVA is also incomplete (to put it mildly). Again this is a topic that is covered in most text books. Here I recommend you to follow the lecture slides and to supplement by the material provided in your text book. The notes concerning factor analysis and cluster analysis are also incomplete, and here you should follow the lecture slides and references given in the course outline.

2

1. What is applied research? We start this part by discussing what applied research really is. We will of course revisit this issue throughout the course, but it is important to start by having a clear idea about what applied research entails. A fairly general way to define it is to argue that applied research is about “systematic enquiry in real world problems”. A more specific definition would be to argue that applied research is “original investigation undertaken in order to acquire new knowledge but is directed primarily towards a specific, practical aim or objective” 1. Another definition is “Research directed toward a current need. The purpose of the research is to discover results that can be applied to the need” 2. There are of course numerous examples of applied research. It might be that we are asked to help out in the design of modern hotels and guest services. Based on surveys, we could find out the needs of prospective guests, preferences and attitudes. Perhaps we work in human resources and are asked to study employee satisfaction on the grounds that satisfied workers are both more productive and tend to stay longer in the job. Perhaps we work in the financial sector and have been asked to analyse the main determinants of bank failure. In this case, we would have to collect information about banks that failed and perhaps compare them with those that did not. Typically, we would need to collect information about the characteristics of the banks as a means to estimate how these different characteristics tend to influence the likelihood of bank failure. Perhaps we are employed in marketing or advertising and we are asked to identify target groups for advertising campaigns with the aim to increase sales and hence profit. Such analysis is done on a grand scale and typically – again through surveys – we find out the characteristics of different television programs for which we have to decide which kind of advertising to show. What these different examples have in common is that the research aim or research questions are fairly well defined. This is an important point. We always need to have a clear focus on what we want to find out. Without a clear research question, it becomes much harder to understand both which data sources to look for and which techniques to employ.

Statistical methods in applied research In applied research, we use a range of statistical techniques. Whereas qualitative research is often employed, we focus here on quantitative research methods. This means that we want to use data and statistical techniques in order to answer our research question. Qualitative research refers to techniques such as in-debt interviews where the sample size does not allow us to establish statistical relationships. Thus, a good example is interviews where one asks experts as a means to improve our knowledge about a certain theme. In quantitative research, we use data and typically we use what is known as multivariate statistics. We tend to distinguish between univariate, bivariate and multivariate statistics. 

Univariate statistics – summaries and inferences about a single [uni-] variable.



Bivariate statistics – summaries and inferences about the relationships between two [bi-] variables.

1

Australian Research Council www.arc.gov.au/general/glossary.htm

2

Global Market Insite Inc., www.marketresearchterms.com/a.php

3



Multivariate statistics – summaries and inferences about the relationships between three or more [multiple] variables.

We also tend to distinguish two branches of statistics. These are: 

Descriptive statistics – Summarise the data in hand such as computing the mean, median, the range, the variance and so on.



Inferential statistics – Draw conclusions about a population using data drawn from a sample of the population.

As you might have guessed, we will focus on multivariate statistics in this course. The idea is to use multivariate statistics as a model of real world problems. In other words, we build these mathematical models in order to understand, explain, and predict events or phenomena. An inference is a conclusion based on what we know and a statistical inference is a conclusion about the state of the world based on the statistical model that we have built (i.e. the model represents what we know). An important goal is of course to find the best possible model, while recognizing that it is a model and not the real world itself. Let’s consider a simple regression equation as an example. Household expenditure = 0.6 * Household income This is a simple regression equation where “household expenditure” is the dependent variable (more about this in part II) and “household income” is an explanatory variable. Suppose the R-squared of this model is 0.35 which means that household income as an explanatory variable explains 35 percent of the variation in consumption expenditure. Whether this is a good model or not is debatable, and we can certainly include more explanatory variables in order to increase the Rsquared. What the regression equation says is that as household income is increased by one unit, the level of household consumption increases by 0.6 units. This is very useful information, but at the same time we should also be aware that there is a lot of variation in household consumption that we are not able to explain. That is not surprising because individuals might vary a lot in the way they decide to consume and it does not all depend on their income. As we said, we can improve on the model by including another explanatory variable – for instance the price level. Household expenditure = 0.6*Household income + 0.3*Prices As we add more explanatory variables, we are able to explain more of the variation in the dependent variable. Now the R-squared is 0.49, which means that we explain 49 percent of the total variation in the explanatory variable. The model is better, but still, there is a lot of variation that we are not able to explain. It is a model after all. There are two basic categories of bivariate and multivariate statistical models, and it is of utmost importance that you understand the difference between the two. They are: 

Dependency (or “causal”) models, which are concerned with causality, i.e. models that estimate the effects of independent variables, interventions, or time on dependent variables.  Interdependency or interrelationship models, which are concerned with structure (relationships among variables or groups). It is very important that we understand the difference between the two approaches. The term causality means that we are estimating how a change in one variable leads to a change in another variable. Regression analysis is an example of dependency models because here we estimate the 4

effect of a change in an explanatory variable on the change in the dependent variable. In the example above, we estimated that a one unit change in income had the effect of changing the consumption expenditure by 0.6. We can make this statement as long as we are sure that all the assumptions underlying the linear regression technique are valid (more about this in part II). In order to understand better what causality means, it is useful to remind ourselves how applied research is done in medical and life sciences. Suppose we are running a medical laboratory and we have invented a new drug that we believe increases longevity. In other words, we believe that if one takes this drug one ends up living longer. However, we are not sure of course, so we need to devise an experiment where we test the new drug. One way to approach this is to do the experiment on mice. We inject the drug in 10 mice for instance, whereas another 10 mice are not given the drug. Typically, we tend to refer to the first group as the treatment group and the second group (i.e. those mice not being given the drug) as the control group. We monitor the two groups of mice and record how long they live. If the drug works, then we should expect the treatment group to live longer. In medical science, we also see such experiments done on humans. Again, we split patients into different groups. In contrast to the case with mice, we might want to device three groups. Suppose we are again talking about introducing a new drug. Assume it is a drug that is supposed to cure cancer patients. We make a sample from cancer patients where the first group is not given the drug (the control group), the second group is given the drug (the treatment group) and the third group is given a placebo – that is they are given a substance that is not actually the drug. It is necessary in these experiments to ensure that the patients do not actually know whether they were given the drug or not. Why is this so? The key idea behind an experiment is to isolate the effect of the drug. That is, as we observe the extent in which our cancer patients are recovering, we need to make sure how much this is due to the new drug that they have been administered (and nothing else). Again, over time we compare the treatment group, the control group and the group given the placebo, to see if there are significant improvements in recovery time among the treatment group. If there is not, then we need to conclude that our new drug does not actually work. But if they do recover more quickly, we can make the statement that taking the drug leads to a quicker recovery. The statistical technique we would employ in order to establish if the drug works is called Analysis of Variance (ANOVA). ANOVA will establish to what extent the average recovery time for the treatment group is lower than the recovery time for the control group. We will introduce the ANOVA technique later in this course. A lot of management research seeks to discover the effect of an intervention on an outcome. In this case, the intervention can be thought about as the drug, and we are interested in whether the intervention had an impact on the outcome of interest. Perhaps the intervention is a new incentive scheme offered to workers, and the outcome variable is whether there is increased productivity. Again, we would want to define two groups of workers, one that is offered the incentive schemes and the other not. After some time has passed we need to compare productivity of the two groups, and check if the one receiving the incentive has improved in terms productivity over the group that did not. As we can see, there is an important time aspect here. We want to know the productivity of both groups before the intervention, and then the productivity some time after the invention. The problem in applied research in social sciences, is that we cannot easily subject humans to experiments. Certainly, we cannot put them into a laboratory and inject them with various drugs. This has clear ethical problems, nor is at all clear that you would find willing subjects for the study. Instead, in social sciences (including economics, management, sociology and finance) we need to rely on surveys and multivariate statistical techniques. It is here where regression analysis becomes so handy. Let’s take the example of the imagined incentive scheme mentioned above. We can device a survey where we collect information about productivity and we can also record whether they were exposed to the incentive scheme or not. However, there might be many other factors 5

(other than the incentive scheme itself) that influence productivity. Multivariate techniques, and in particular regression analysis, are very useful in controlling for these other factors. So for instance, we take our survey and include variables that may influence productivity and then assess if – in addition – the intervention had a separate impact on productivity. Another example of this kind of analysis concerns government policy. Suppose we are interested in understanding if a change in social security affects poverty rates. The approach we could take is to take a survey of individuals, some of whom were exposed to the new policy – others not. We then compare the poverty rates among the two groups and check if poverty has been reduced for the group exposed to the policy. Again, the problem is that there are many factors influencing whether an individual ends up poor or not. We know that those with higher education, those in full-time employment and those with fewer children are less likely to be poor – independent of whether they were exposed to the policy or not. The solution would be to run a simple regression where poverty is the dependent variable, education and employment status being explanatory variables – together with an indicator for whether the individual was exposed to the policy or not. It is the significance of the coefficient associated with this variable that tells us whether the policy succeeded in reducing poverty or not. So far, we have discussed dependency models. The other group is termed inter-dependency models. The simplest form of an inter-dependency model is the correlation coefficient between two variables. Two variables might be positively correlated or negatively correlated or not correlated at all (in which case the correlation coefficient is zero). However, the important point to notice here is that with a positive correlation we do not know to what extent a higher value of one variable leads to a higher value of the other. In other words, correlations do not reveal causality no matter how you twist and turn it. It is also important to be aware that even if we employ regression techniques, we do not necessarily get causal effects. Consider the following example. Suppose you have information about city murder rates and the size of the police force. The city murder rate is the dependent variable and you want to explain to what extent the size of the police force might reduce the murder rate. Thus, you would expect a negative relationship in the sense that the larger the police force, less murders would take place. This makes intuitive sense, since a larger police force is able to monitor and prevent crime better than a smaller police force. The problem here is that the size of the police force might be in response to high murder rates. In other words, if the murder rate is high, the local politicians may allocate funds to increase the police force. You should realize that I have turned the causality statement on its head. Now what I am saying is that the police force is the variable to be explained, and it depends on the murder rate. In terms of linear regression, the police force is the dependent variable and the murder rate is the explanatory variable. This is indeed a tricky problem, and in this case the relationship you estimate is at best the correlation between the two variables 3. There are more complex interdependency models than the simple correlation coefficient – and we are going to deal with them in great detail in this course. The first interdependency technique we will encounter is what is termed factor analysis. Here we are looking for correlation among variables, with the aim of constructing a new variable. For instance, suppose we are interested in creating a measure of trust (i.e. a new variable that can be used in regression analysis – either as a dependent variable or an explanatory variable). The important issue here is that we typically do not observe a variable that measures an individual’s level of trust. Instead, we have variables derived from questions posed in surveys that refer to “To what extent do you trust politicians?”, “To what extent do you trust other people” and so on. Factor analysis is a way to combine such variables into 3

The reason we cannot rely on the linear regression to provide a measure of causality is because in this example one of the fundamental assumptions underlying the linear regressioni s violated. We will discuss this in much more detail in Part II.

6

one new variable (or index). Needless to say, we only want to combine variables that are correlated, as there is little point in creating a new index that consists of very different variables. Importantly though, when performing factor analysis, we do not care about causality – correlation is what matters. Another important type of interdependency models is cluster analysis. Instead looking for correlations between variables, we are here looking for correlations between observations. For instance, suppose we have information about country level characteristics for European countries. This could be for instance GDP per capita, Human Development Index (HDI), Life Expectancy and average level of schooling. The idea of cluster analysis is to use such information to see if different countries clusters into distinct groups – here being different groups (of countries) of distinct levels of development. Again, we are not saying anything about causality here – we are simply interested to see if there are groups whose members correlate strongly – and that groups themselves are not (correlated). We deal with Factor in Part III and Cluster analysis in Part IV.

What is a variable? So far, we have talked about dependent and independent variables. It is worthwhile reminding ourselves what a variable is and to recognize that variables come in many forms and shapes that will have an impact on how we set up our statistical model of the world. In the most general term, a variable is a measure of a phenomenon that varies or changes its values. Take daily temperature in Milano – this is a random variable as we do not know which value temperature will take tomorrow. Or – will the returns on the stock market go up or down tomorrow? In statistical terms we are dealing with random variables, and as you remember from your introductory course in statistics, random variables are very tricky to handle unless you know their distribution function (if you have forgotten these basic concepts then return to your statistics notes immediately). When I look around, many other things in society varies. Gender varies in the population (male or female), University grades varies among students, household income varies across households and so on. When we are dealing with statistical models, such as linear regression, then these variables are indeed random variables. In linear regression however, we distinguish between different types of random variables. They are:   

Dependent variable (denoted Y) is a variable that is to be predicted or that changes in response to an intervention. In other words, the value of this variable depends on the values of other variables. Independent variable (denoted X) is a variable that is used to predict values of another variable, that records experi...