Final Exam Responses PDF

Title Final Exam Responses
Course Data, Insights and Decisions
Institution University of New South Wales
Pages 7
File Size 247.1 KB
File Type PDF
Total Downloads 287
Total Views 807

Summary

Question 1Part Ag_40: Based on the summary statistics, it can be determined that approximately 50% of the 90,198 players who were involved in the A/B test were assigned to the treatment group. Games: Based on the summary statistics, our initial analysis provides that the average number of games play...


Description

Question 1 Part A g_40: Based on the summary statistics, it can be determined that approximately 50.4% of the 90,198 players who were involved in the A/B test were assigned to the treatment group. Games: Based on the summary statistics, our initial analysis provides that the average number of games played by individuals is 51.9 games. This figure may however, be not too accurate as outliers such as the maximum number of games (49854) may skew the data, increasing the value of games. This is further explained due to an extremely high S.D. of 195.1 relative to the mean. Furthermore, the dataviz is highly positively skewed due to a majority of the data lying to the left side of the graph. This reaffirms our previous finding as we can tell from the left-most column on the graph that approximately 55% of the data plays only between 0-5 games within the first 14 days.

Ret_1: The initial analysis suggests that approximately only 44.5% of players play Cookie Cats within one day of installation.

Part B Overall, there is not enough information to test whether random assignment has been successful as there may be other factors apart from just the gate level, which affect whether users play the game within one day of installation or the number of games played in 14 days.

To check random assignments success, Cookie Cats can add covariates to their data analysis. For example, numerical variables such as number of daily hours spent on phone or number of other games on phone can be considered. Moreover, categorical variables such as the type of apps users spend most time on can also be a good indicator for Cookie Smash to identify which groups of people are most likely to open the game within the first day and which users may have higher player activity.

Part C The estimated difference between the average ret_1 when g_40 = 1 is -0.0059, which initially indicates that using gate 40 instead of gate 30 may cause a reduction in the number of people playing the game on their first day by 0.59%. However, upon generating the 95% CI for g_40 ( -0.0059 ± 1.96 × 7.6 ), the intervals produced are [-0.12368,

0.000568]. Since this interval includes, we can argue that the null hypothesis of no treatment effect is satisfied as there isn't a precise estimation. We can also consider the R^2 = .000 value to predict noncorrelation.

Part D The 95% CI for the variable g_40 first version of regression are calculated as follows: -1.16 ± 1.96 × 1.30. The intervals generated are [-3.708, 1.388]. Since this CI will include 0, the effect of this g_40 = 1 is imprecise and therefore the null hypothesis of no treatment effect stands.

The second regression model may be more effective in predicting whether there is a treatment effect or not than the first model since by reducing the number of games from a maximum of 49854 as given in the summary stats to 3000 will reduce the effect of some outliers on the data and the regression model. However, overall, by reducing the max games to 3000 there is still not much of a difference that will be made to the findings as even though 3000 is significantly smaller than the previous max, it is still a considerably larger amount of games than the mean which is around 51 games as ascertained from the dataviz. Therefore, it is still not a sensible regression

The 95% CI for g_40 in the second regression model are are calculated as follows: -0.046 ± 1.96 × 0.68 The intervals generated are [-1.3788,1.2868]. Since the CI includes 0, the null hypothesis of no treatment effect is still satisfied despite the new model.

Part E In summary, using the data and the summary statistics produced, the null hypothesis of no treatment on both the key variables of ret_1 and games still stands. This is because using the data, it is found that the generated coefficient for g_40 cannot be trusted in finding a relationship to either of the two key output variables. Moreover, an r^2 value of .000 further indicates that there is no correlation between the placement of the entry level gate and the two key variables of concern. In answering the primary question, the given data and regression models suggest that the entry level gate moved from 30 to 40 doesn't have an impact. However, a causal relationship in-game purchases can neither be confirmed or denied as there is no variable present which shows data for the number of in-game purchases present. Moreover, we would need covariates to omit the effect of OVB and confoundment issues if we were to confirm/deny a causal relationship.

Question 2 Part A I would use a data-driven, exploratory chart to address the manager's queries. Firstly, the chart will be datadriven because it will use the data collected as seen in the dataviz used in the question to find quantifiable differences and trends. Moreover, the chart will be exploratory as the manager is trying to understand how trends have changed across different time periods instead of simply viewing the trends.

Part B Graph 1: 100% Stacked Column Chart

Graph 2: Line Chart

By combining the data to annual, I simplified my data according to the preferences of the manager. Secondly I left-aligned the subtitle and title as it is naturally easier to interpret. I put the legend on top of the graph to make a direct link between the legend and the coloured columns/lines. Next, I chose a dark background with white font which is of an appropriate proportion (Title and subtitle taking up approximately 20% of the page combined) to provide a good ratio for viewers. The colour palette is not overtly bright, and is instead a lighter more neutral palette which can be easily distinguished and is easier on the eyes

I chose the column chart as it accurately reflects the managers main query concerning the proportional change in the customer service mix. Having only 3 columns as well as gridlines makes the chart much simpler to interpret and the gridlines are helpful in assisting viewers approximate the % of each mix. I also chose to omit axis titles in the column chart as the use of percentages and 2019 is obvious enough, not requiring further elements.

The line chart is also simplified to reflect the number of calls per year. The lines also don't follow a colour scheme and there is no use of saturated colours instead all the colours are different to signify the different service channels.

Part C Personally, I would select the 100% stacked column chart to present to my manager as It more accurately allows the manager to identify the changing composition of service channels on a proportional level than the line chart. E.g. In Graph 1, the declining % of phone calls is very easily comprehended and can immediately be compared by the increases/decreases in other service channels at the top.

Since the manager is more interested in determining the composition instead of simply identifying trends, I would further develop the column chart by presenting two versions of the charts. After presenting this Graph 1, I would present another stacked column chart without proportions. This would allow the manager to see how changes in the customer service mix have potentially also been impacting the total number of enquiries. From this the manager may decide to identify trends. In this case, I would isolate the two variables which might be of interest (i.e phone calls and virtual agents) and then stack them against each other, perhaps coming the two other channels into one variable and making it a very neutral colour to signify non-interest. Furthermore, it may be appropriate to include a predictive column for 2021 customer service mix to explore future possibilities regarding change.

Question 3 1) The predictive models that we used were highly effective in enabling us to provide valuable insights into how different factors and variables play a role in predicting a key variable of concer. In the case of the team assessment, the use of a logistic regression and a classification tree were invaluable in allowing us to substantiate answers as required by our hypothetical firm to predict Employee Attrition. This highlights the great extent to which these types of predictive models can be utilised outside a university setting and in the business environment to further explain or predict likely future outcomes for things such as employee attrition which can have a significant impact on an organisations’ productivity and even overall success. In the team assessment, one of the challenges we faced in creating the predictive models was determining what variables to include in our logistic regression, as including all the variables would have decreased the viability and accuracy of our final predictions. This prompted us to seek further advice and helped us to come to the determination that it would be viable to only choose variables in the logistic regression which were statistically significant (having low p-values). This challenge then further developed when we tried to figure out which variables to use for the classification tree, as some variables, even if they were significant, were having disproportionate impacts on our findings. In the end, using confusion matrices helped us choose which

models were most suited to be presented.

2) One of the key organisational benefits of using predictive models, especially ones which contain multiple variables of interest, is that it can allow organisations to take a holistic approach towards implementing strategies which encompass multiple factors, when their aim is to reach a key finding. E.g. using both logistic regression and classification tree models, we were able to suggest strategies based on findings from both such as the high impact of Overtime on attrition which was identified in both models. Furthermore, predictive models which are complex and include many variables can help organisations isolate different negative factors which are of interest with reasonable certainty as the covariates can mitigate confoundment issues. On the other hand, due to the complexity of predictive models, it can often be difficult for a data analyst to identify which variables may be perhaps of the greatest interest as variables can have a disproportionate impact. E.g When creating our models, we found that MonthlyIncome had a disproportionately large impact on attrition, and this created confusion amongst our team in creating recommendations. Finally, the viability of predictive models is limited by data and if there are skews, it can create problems. E.g. In the team report dataset, both our predictive models had high accuracies according to the confusion matrices for when attrition was negative but very low when it was positive. This indicates that predictive models should not be used on their own, instead other more human solutions should also be implemented.

Question 4 Part A The first risk associated with K-W Vision to inform decision-making is that the business must ensure that the data they receive and implement into their dashboards is accurate. If the data of any of the key factors (production levels, channel price, product formulation, and promotional spending) are inaccurate, they may cause Kelsey-White to overestimate or underestimate their figures. For example, if the data Kelsey-White receive from all their channels across USA is inaccurate, then this may lead to incorrect projections of sales in following years, negatively impacting decisions regarding production. These can have significant impacts on the company’s profits. The second risk of using K-W Vision would be potential lapses in internal communication. Potential consequence of not having the data and the dashboard available and communicated to all levels of the business may cause inefficiencies and inconsistencies when it comes to sections of the business using the data to facilitate different decisions i.e. whether it be different levels of marketing or finance, or when determining how much Blue to produce. The final risk would be if K-W Vision is not an efficient dashboard. If K-W Vision is optimised to benefit external stakeholders such as shareholders and government agencies, it might not meet the needs of specific

internal sections of the business. This may create inefficiencies in decision-making as managers and executives may not be able to access the information they need.

Part B To mitigate the first risk of data inaccuracy, Kelsey White can implement internal controls on data analysts and can perhaps employ risk managers who ensure that the data being collected is accurate. This role will involve frequent communication with all the individual sources of data.

To avoid the second risk of lapses in communication, Kelsey White will need to ensure that the dashboard is made available to all key business decision makers in the business. This will allow the same information to be made consistently available to all parts of the business including finance, operations, marketing and HR. Since all departments will have access to the same dashboard and same information, it will allow managers of each department to monitor and be aware of key changes and change their decisions accordingly.

Finally, a methods Kelsey-White may consider to prevent a risk of inefficiencies in the dashboard, it could invest in creating a separate dashboard which is aimed at external stakeholders and differentiate it from the internal dashboard (K-W Vision). This will allow the company to meet the needs of all stakeholders and prevent inefficiency.

The risk which should be given the highest priority would be to ensure the data, graphs and reports are accurate (i.e. Risk 1) as this will have the most direct impact on the decision-making of each department of the business....


Similar Free PDFs