# Introduction

Following on from the speaker session on Wednesday 16th November 2016, this webpage contains a series of resources that will be useful as you work on future data projects. The main resources are:

1. An analysis of the jelly bean problem and its implications for conducting analysis with crowds

2. An overview of the Smarter Data model.

3. A video series working through the Smarter Data model with the Monty Hall Problem.

4. An overview of the three types of problems in statistics.

**Jelly Bean Problem**

The Jelly Bean problem shows the power of a group of independent, non-experts in making estimates about thing they do not know about.

This video shows how to approach the Jelly Bean problem through statistics. I start by introducing the rare nature of perfectly accurate guesses, yet show how as the sample size increases the accuracy of the average improves and outlier values disappear.

The outcome is that crowds can give reasonable approximations. If you are looking to make estimates, you don't need big data. You can construct the information from a crowd.

There are a couple of issues in using crowds that you should be aware of, however:

- For the crowd average to be accurate, guesses must be independent. Otherwise, people tend to follow other people's guesses.
- Crowds are effective for measurement activities, yet can not come up with specific knowledge. The solution is to have groups of people within a specialised division make an estimate of a given activity. From these teams a series of small data sets can be formed.
- Sometimes there is no prior data to base an estimate off. The solution is to break unfamiliar activities into specific activities which can be estimated by all members of a division or team and then form results.

These were your results:

**Smarter Data Model**

The Smarter Data model is a structured way to think about solving problems. The model is useful for working through novel problems as each step informs the next, including the implications of the first round informing the intentions of the second round.

**Intentions** - This is what you want to achieve. A good intention is clear and links to one of the three types of problems.

**Inputs** - This is the data set needed for the intention. This may either be data you already have access to, or data that you need to collect.

**Interpretation** - This is the statistical test or visualisation (graph) you'll run to analyse the data from the input. This will usually be either (i) calculating probability (ii) testing for differences or (iii) measuring a relationship.

**Implications** - This is the conclusion you draw for the business. The implications often requires further analysis or investigation.

The video above is a quick overview of the four components of the Smarter Data model. Intention, Inputs, Interpretation and Implications. The Smarter Data model works both both looking at the item before and the item after the current point of focus.

**Smarter Data Example – The Monty Hall Problem**

The Monty Hall problem is a probability puzzle and a brain teaser. It was loosely based on the American TV game show Let's Make a Deal and is named after its original host, Monty Hall.

This video shows how to approach the Monty Hall problem using the Smarter Data model. I start by framing the intention clearly and this leads to determining the input data required. I don't formally conduct an interpretation, but we see in running this over a number of times that a clear pattern emerges.

A walkthrough of the Monty Hall Problem with the Smarter Data model:

**Intention - **We want to know whether to swap or stay. This means we are going to be calculating the probability of getting the car under each action, or the difference between swapping and staying. This intention influences our next step. We need to know what input is needed to answer our question.

**Input - **We are going to need to run both cases of stay and swap understand the possible outcomes of each action. I produce a table to show these options, and then run through all options.

**Interpretation - **We then interpret our input data from our table in the form of a graph.

**Implication - **The implication from our table is that we should swap. However, the Smarter Data model does not end at this point. We might want to test our analysis with a larger sample size, or have new questions that arise from our analysis. This feeds back into new intentions.

In the video below I discuss ways to solve the Monty Hall Problem without using the Smarter Data model.

**Three Types of Problems**

Broadly speaking there are three types of data problems. There are problems that involve counting the number of outcomes (probability). There are problems that involve making a decision between two different situations (differences). There are also problems that involve how two or more variables relate to each other (relationships).

**Probability **- When you're trying to assess how likely an outcome is compared to an expectation, probability is usually the right approach. The classic probability problem involves drawing coloured balls from a bag, or the chance of rolling a certain set of faces on dies. To solve a probability problem you need to be able to calculate the chance of possible outcomes and ensure these equal to one.

The most useful application of probability is the binomial distribution. This is the distribution used when there are only two possible outcomes (a sale or no sale), the chance of success is constant (always 20% chance of a sale), the number of trails is fixed (30 phone calls per day), and the result of one trial doesn't influence the next trial (just because one person buys doesn't influence the next person).

**Differences** - When you have two (or more) possible approaches testing to see if there is a difference is usually the right approach. Testing differences asks the question whether the mean (average) of one data set is far enough away from the mean of another data set to state they are different. To run a test of differences you'll need to set a level to draw this conclusion (use alpha = 0.05).

The most useful application of differences is the t-test. This is the test of whether the distribution of one average is different to another distribution of an average. To form the distribution of these averages individual data is collected. These distributions are called sampling distributions (of the mean) and are different to distributions of individual data as they use the standard error instead of the standard deviation. The standard error is result of the standard deviation divided by the square root of the sample size (n). To use a t-test you'll need samples of 30 or more to meet normality requirements. An example of testing differences might be to assess whether a change in the layout of the shop results in increased sales.

**Relationships** - When you have multiple measures on one item you can determine if there is a relationship. When assessing relationships you need to have a data set which links different measures across individuals, locations or time. A relationship doesn't necessarily imply causation, it simply shows correlation. Causation (A caused B) is surprisingly difficult to 'prove'.

The most useful application of relationships is (simple) linear regression. Simple linear regression is the measure of the strength of a linear (straight line) relationship between two variables or measures. This is usually plotted on a graph of a series of x-y points with the line of best fit determined by mathematics. There are number of ways to interpret simple linear regression models. The two values of most interest are the r-squared value and the coefficient of the slope. The r-squared value tells you how much of the variation in y is predicted by x. The coefficient of the slope tells you for a one unit increase in x what the change in y will be. E.g you might develop a simple linear regression model based o the amount of revenue of an event and post event online sales. If there was a strong linear trend with r-squared = 0.8 then we could 80% of the online sales are determined by the revenue of the event. The other 20% could come from online advertising or other factors. If the coefficient of the slope was 0.3 then we could conclude that on average for each dollar spent at the event $0.3 dollars will be spent on the post event online sales.

The video above is short overview of the three types of problems that can be solved with statistics and identifies the key features and an example of each. The three types of problems are: (i) probability (ii) testing differences and (iii) relationships.

**Supplemental Resources**

I highly recommend **Real Statistics** by Dr Charles Zaiontz. Real Statistics works like a fully fledged statistics package but inside Excel. The Resources page is where you download the add-in for your version of Excel, and there are number of useful tutorials as well.

James Surowiecki wrote *The Wisdom of the Crowds* and is a renowned TED speaker. His work utilises research and case studies.