Resources page - Museum Victoria
Following on from the workshop this webpage contains a number of resources that will be useful as you work on your data projects. There are three main resources:
- An overview of the Smarter Data model and a summary of the three types of problems.
- A video series working through the Smarter Data model, applying this to the Monty Hall problem and the three types of problems. In total there's about 20 minutes of video content.
- Links to Microsoft Excel plugins that were used in the workshop and other information.
Smarter Data Model
The Smarter Data model is a structured way to think about solving problems. The model is useful for working through novel problems as each step informs the next, including the implications of the first round informing the intentions of the second round.
Intentions - This is what you want to achieve. A good intention is clear and links to one of the three types of problems.
Inputs - This is the data set needed for the intention. This may either be data you already have access to, or data that you need to collect.
Interpretation - This is the statistical test or visualisation (graph) you'll run to analyse the data from the input. This will usually be either (i) calculating probability (ii) testing for differences or (iii) measuring a relationship.
Implications - This is the conclusion you draw for the business. The implications often requires further analysis or investigation.
The video above is a quick overview of the four components of the Smarter Data model. Intention, Inputs, Interpretation and Implications. The Smarter Data model works both both looking at the item before and the item after the current point of focus.
Monty Hall Problem
The Monty Hall problem is a probability puzzle and a brain teaser. It was loosely based on the American TV game show Let's Make a Deal and is named after its original host, Monty Hall.
This video shows how to approach the Monty Hall problem using the Smarter Data model. I start by framing the intention clearly and this leads to determining the input data required. I don't formally conduct an interpretation, but we see in running this over a number of times that a clear pattern emerges.
In the video below I discuss ways to solve the Monty Hall Problem without using the Smarter Data model.
Three Types of Problems
Broadly speaking there are three types of data problems. There are problems that involve counting the number of outcomes (probability). There are problems that involve making a decision between two different situations (differences). There are problems that involve how two or more variables relate to each other (relationships).
Probability - When you're trying to assess how likely an outcome is compared to an expectation, probability is usually the right approach. The classic probability problem involves drawing coloured balls from a bag, or the chance of rolling a certain set out faces on dies. To solve a probability problem you need to be able to calculate the chance of possible outcomes and ensure these equal one.
The most useful application of probability is the binomial distribution. This is the distribution used when there are only two possible outcomes (a sale or no sale) the chance of success is constant (always 20% chance of a sale) the number of trails is fixed (30 phone calls per day) and the result of one trial doesn't influence the next trial (ust because one person buys doesn't influence the next person).
Differences - When you have two (or more) possible approaches testing to see if there is a difference is usually the right approach. Testing of differences asks the question whether the mean (average) of one data set is far enough away from the mean of another data set to state they are different. To run a test of differences you'll need to set a level to draw this conclusion (use alpha = 0.05).
The most useful application of differences is the t-test. This is the test of whether the distribution of one average is different to another distribution of an average. To form the distribution of these averages individual data is collected. These distributions are called sampling distributions (of the mean) and are different to distributions of individual data as they use the standard error instead of the standard deviation. The standard error is result of the standard deviation divided by the square root of the sample size (n). To use a t-test you'll need samples of 30 or more to meet normality requirements. An example of testing differences might be to assess whether a change in the layout of the shop results in increased sales.
Relationships - When you have multiple measures on one item you can determine if there is a relationship. When assessing relationships you need to have a data set which links different measures across individuals, locations or time. A relationship doesn't necessarily imply causation, it simply shows correlation. Causation (A caused B) is surprisingly difficult to 'prove'.
The most useful application of relationships is (simple) linear regression. Simple linear regression is the measure of the strength of a linear (straight line) relationship between two variables or measures. This is usually plotted on a graph a series of x-y points with the line of best fit determined by mathematics. There are number of ways to interpret simple linear regression models. The two values of most interest are the r-squared value and the coefficient of the slope. The r-squared value tells you how much of the variation in in y is predicted by x. The coefficient of the slope tells you for a one unit increase in x what the change in y will be. E.g you might develop a simple linear regression model based o the amount of revenue of an event and post event online sales. If there was a strong linear trend with r-squared = 0.8 then we could 80% of the online sales are determined by the revenue of the event. The other 20% could come from online advertising or other factors. If the coefficient of the slope was 0.3 then we could conclude that on average for each dollar spent at the event $0.3 dollars will be spent on the post event online sales.
The video above is short overview of the three types of problems that can be solved with statistics and identifies the key features and an example of each. The three types of problems are: (i) probability (ii) testing differences and (iii) relationships.
Supplemental Resources
I highly recommend Real Statistics by Dr Charles Zaiontz. Real Statistics works like a fully fledged statistics package but inside Excel. The Resources page is where you download the add-in for your version of Excel, and there are number of useful tutorials as well.
James Surowiecki wrote The Wisdom of the Crowds and is a renowned TED speaker. His work utilises research and case studies.