POSTED ON 20 MAR 2020

READING TIME: 5 MINUTES

Hypothesis testing of proportion-based samples

author-image
Daniel Bray
Team Leader

Part 2 of our Introduction to Hypothesis Testing series.

 

In part one of this series, I introduced the concept of hypothesis testing, and described the different elements that go into using the various tests. It ended with a cheat-sheet to help you choose which test to use based on the kind of data you’re testing.

In this second post I will go into more detail on proportion-based samples.

If any of the terms Null Hypothesis, Alternative Hypothesis, p-value are new to you, I’d suggest reviewing the first part of this series before moving on.

 

What is a proportion-based sample?

In these cases we’re interested in checking proportions. For example 17% of a sample matches some profile, and the rest does not. This could be a test comparing a single sample against some expected value, or comparing two different samples.

Note: These tests are only valid when there are only two possible options; and if the probability of one option is p, then the probability of the other must be (1 - p).

 

Requirements for the quality of the sample

For these tests the following sampling rules are required:

RandomThe sample must be a random sample from the entire population
NormalThe sample must reflect the distribution of the underlying population. For these tests a good rule of thumb is that:
  • Given a sample size of n
  • Given a sample proportion of p
  • Then both np and n(1-p) must be at least 10
For example: if a sample finds that 80% of issues were resolved in 5 days, and 20% were not, then that sample must have at least 10 issues resolved within 5 days, and at least 10 issues resolved in more than 5 days.
IndependentThe sample must be independent - for these tests, a good rule of thumb is that the sample size is less than 10% of the total population.

 

Code Samples for Proportion-based Tests

Note that all of these code samples are available on Github. They use the popular statsmodels library to perform the tests.

1-sample z-test

Compare the proportion in a sample to an expected value

Here we have a sample and we want to see if some proportion of that sample is greater than/less than/different to some expected test value.

In this example:

  • We expect more than 80% of the tests to pass, so our null hypothesis is: 80% of the tests pass
  • Our alternative hypothesis is: more than 80% of the tests pass
  • We sampled 500 tests, and found 410 passed
  • We use a 1-sample z-test to check if the sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from statsmodels.stats.proportion import proportions\_ztest

# can we assume anything from our sample
significance = 0.05

# our sample - 82% are good
sample\_success = 410
sample\_size = 500

# our Ho is  80%
null\_hypothesis = 0.80

# check our sample against Ho for Ha > Ho
# for Ha < Ho use alternative='smaller'
# for Ha != Ho use alternative='two-sided'
stat, p\_value = proportions\_ztest(count=sample\_success, nobs=sample\_size, value=null\_hypothesis, alternative='larger')

# report
print('z\_stat: %0.3f, p\_value: %0.3f' % (stat, p\_value))

if p\_value > significance:
   print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print ("Reject the null hypothesis - suggest the alternative hypothesis is true")

 

2-sample z-test

Compare the proportions between 2 samples

Here we have two samples, defined by a proportion, and we want to see if we can make an assertion about whether the overall proportions of one of the underlying populations is greater than / less than / different to the other.

In this example, we want to compare two different populations to see how their tests relate to each other:

  • We have two samples - A and B. Our null hypothesis is that the proportions from the two populations are the same
  • Our alternative hypothesis is that the proportions from the two populations are different
  • From one population we sampled 500 tests and found 410 passed
  • From the other population, we sampled 400 tests and found 379 passed
  • We use a 2-sample z-test to check if the sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from statsmodels.stats.proportion import proportions\_ztest
import numpy as np

# can we assume anything from our sample
significance = 0.025

# our samples - 82% are good in one, and ~79% are good in the other
# note - the samples do not need to be the same size
sample\_success\_a, sample\_size\_a = (410, 500)
sample\_success\_b, sample\_size\_b = (379, 400)

# check our sample against Ho for Ha != Ho
successes = np.array(\[sample\_success\_a, sample\_success\_b\])
samples = np.array(\[sample\_size\_, sample\_size\_b\])

# note, no need for a Ho value here - it's derived from the other parameters
stat, p\_value = proportions\_ztest(count=successes, nobs=samples,  alternative='two-sided')

# report
print('z\_stat: %0.3f, p\_value: %0.3f' % (stat, p\_value))

if p\_value > significance:
   print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print ("Reject the null hypothesis - suggest the alternative hypothesis is true")

 

In the next post I will focus on hypothesis testing mean-based samples.

 


TECH LEADERS TRUST SONALAKE

We make software better every day

Get in touch

Copyright © 2023 Sonalake Limited, registered in Ireland No. 445927. All rights reserved.Privacy Policy

Nr Cert. 18611 ISO/IEC 27001
By clicking “Accept” you understand that Sonalake uses cookies to analyse site usage, understand our users, and assist in our marketing efforts. For more details, please see the Cookies section of our Privacy Policy.
Privacy Policy