https://www.coursera.org/learn/quantitative-methods/lecture/Dp5ip/2-01-empirical-cycle
In the first module we discussed how the scientific method developed, general philosophical approaches and the types of knowledge science aims to find. In this second module we'll make these abstract principles and concepts a little more concrete by discussing the empirical cycle and causality in more detail.
We’ll see how, and in what order these concepts are implemented when we conduct a research study. We'll also consider the main criteria for evaluating the methodological quality of a research study: Validity and reliability. The focus will be on internal validity and how internal validity can be threatened.
What would be your 'recipe' for the scientific method?
The scientific method leaves room for creativity when it comes to forming research questions; but once we have formulated a research question we need to test it methodically. The empirical cycle provides a general framework for combining systematic observation and logic to test our research questions. As you watch the video, ask yourself which principles or concepts you recognize from the videos in the first module.
2.01 Empirical Cycle
The empirical cycle captures the process of coming up with hypotheses about how stuff works and testing these hypotheses against empirical data in a systematic and rigorous way.
Being familiar with the five different phases of the cycle, will really help you keep the big picture in mind, especially when we get into specifics of things like experimental design and sampling.
So we'll start with the observation phase, obviously. This is where the magic happens. It's where an observation, again, obviously, sparks the idea for a new research hypothesis. We might observe an intriguing pattern an unexpected event. Anything that we find interesting and that we want to explain.
How we make the observation really doesn't matter. It can be a personal observation, an experience that somebody else shares with us. Even an imaginary observation that takes place entirely in your head.
Of course, observations generally come from previous research findings, which are systematically obtained. But in principle, anything goes. Okay, so, let's take as an example a personal observation of mine.
I have a horrible mother-in-law. I've talked to some friends and they also complain about their mother-in-law. So this looks like an interesting pattern to me. A pattern between type of person and likability.
Okay, so the observation phase is about observing a relation in one or more specific instances. In the induction phase this relation, observed in specific instances, is turned into a general rule. That's what induction means, taking a statement that's true in specific cases, and inferring that the statement is true in all cases, always. For example, from the observation that my friends and I have horrible mothers-in-law, I can induce the general rule that all mothers-in-law are horrible. Of course, this rule or hypothesis is not necessarily true. It could be wrong. That's what the rest of the cycle is about, testing our hypothesis.
In the induction phase, inductive reasoning is used to transform specific observations into a general rule or hypothesis. In the deduction phase, we deduce that the relations specified in the general rule should also hold in new, specific instances. From our hypothesis, we deduce an explicit expectation or prediction about new observations. For example, if all mothers-in-law are indeed horrible, then if I asked ten of my colleagues to rate their mother-in-law as either likeable, neutral or horrible, then they should all choose the category horrible.
Now in order to make such a prediction we need to determine the research setup. We need to decide on a definition of the relevant concepts, measurement instruments, procedures, the sample that we'll collect new data from. Et cetera, et cetera.
So, in the deduction phase, the hypothesis is transformed by deductive reasoning. And specification of the research setup into a prediction about new empirical observations.
In the testing phase, the hypothesis is actually tested, by collecting new data, and comparing them to the prediction.
Now this almost always requires statistical processing, using descriptive statistics to summarize the observations for us, and inferential statistics to help us decide if the prediction was correct.
In our simple example we don't need statistics. Let's say that eight out of ten colleagues rate their mother-in-law as horrible. But, two rate her as neutral. Now we can see right away, our prediction didn't come true, it was refuted. All ten mothers-in-law should have been rated as horrible. So in the testing phase new empirical data is collected, and with the aid of statistics, the prediction is confirmed or disconfirmed. In the evaluation phase we interpret the results in terms of our hypothesis. If a prediction was confirmed, this only provides provisional support for a hypothesis. It doesn't mean that we've definitively proven the hypothesis, because it's always possible that in the future, we will find somebody who just loves their mother-in-law. In our example, the prediction was actually refuted.
This doesn't mean we should reject our hypothesis outright. In many cases there are plausible explanations for our failure to confirm. If these explanations have to do with a research setup, the hypothesis is preserved and investigated again. But, with a better research design. In other cases, the hypothesis is adjusted based on the results. The hypothesis is rejected and discarded only in very rare cases. In the evaluation phase, the results are interpreted in terms of the hypothesis, which is provisionally supported, adjusted, or rejected.
The observations collected in the testing phase can serve as new specific observations in the observation phase. This is why the process is described as a cycle. New empirical data obtained in the testing phase give rise to new insights that lead to a new run through. And that’s what empirical science comes down to. We try to hone in on the best hypotheses and build our understanding of the world as we go through the cycle, again and again.
What will it take for you to accept a hypothesis?
The empirical cycle describes how we transform an observation into a hypothesis, that is in turn, transformed into a prediction by specifying a research setup. So far, so good. But what does it mean if our prediction is confirmed? What if it's disconfirmed? What does this mean for our hypothesis: Do we accept it or do we reject it? These questions were only briefly addressed in the previous video, so we will take a closer look at them in the following video.
2.02 (Dis)confirmation
We're going to take a look at how we should interpret results that confirm or disconfirm our predictions, and whether we should confirm or reject our hypothesis accordingly.
Let's consider the hypothesis that all mothers-in-law are horrible. I formulated this hypothesis based on personal observations.
To test the hypothesis, I came up with a research setup. I selected to measure horribleness using a rating scale with the options likeable, neutral and horrible. I also decided to collect data from ten colleagues in my department.
With research set up in place, I formulated the following prediction. If the hypothesis all mothers-in-law are horrible is true, then all ten colleagues should choose the category horrible to describe their mother in law.
Okay, let's look at confirmation first. Suppose all ten colleagues rated their mother-in-law as horrible, the prediction is confirmed, but this doesn't mean the hypothesis' been proven. It's easily conceivable that we will be proven wrong in the future. If we were to repeat the study, we might find a person that simply adores their mother-in-law.
The point is that confirmation is never conclusive. The only thing we can say is that our hypothesis is provisionally supported. The more support from different studies, the more credence we afford a hypothesis. But we can never prove it.
Let me repeat that. No scientific empirical statement can ever be proven once and for all. The best we can do is produce overwhelming support for a hypothesis. Okay, now let's turn to disconfirmation. Suppose only eight out of ten colleagues rate their mother-in-law as horrible, and two actually rate her as neutral.
Obviously, in this case our prediction turned out to be false. Logically speaking, empirical findings that contradict the hypothesis should lead to its rejection. If our hypothesis states that all swans are white, and we then find black swans in Australia, we can very conclusively reject out hypothesis.
And practice, however, especially in the social sciences, there are often plausible alternative explanations for our failure to confirm. These are in fact so easy to find, that we rarely reject the hypothesis outright. In many cases, these explanations have to do with methodological issues. The research design or the measurement instrument wasn't appropriate. Maybe relevant background variables weren't controlled for. Et cetera, et cetera.
Coming back to our mother-in-law example, I could have made a procedural error while collecting responses from the two colleagues who rated their mother-in-law as neutral. Maybe I forgot to tell them that their responses were confidential, making them uncomfortable to choose the most negative category.
Is there a plausible methodological explanation for the failure to confirm? We preserve the hypothesis and instead choose to reject the auxiliary, implicit assumptions concerning the research design and the measurement.
Sometimes result do give rise to a modification of the hypothesis. Suppose that the eight colleagues who did have horrible mothers-in-law were all women, and the other two were men. Perhaps all mothers-in-law are indeed horrible but only to their daughters in law.
What do you look for in a good research study?
If our predictions are confirmed we can't automatically conclude our hypothesis is supported. Alternatively, if our predictions are refuted, we don't necessarily reject our hypothesis. So how do we decide whether our results provide strong or weak support for our hypothesis? In the next video we'll discuss the general criteria to evaluate the methodological quality of a study. We will return to these criteria in much more detail later on.
2.03 Criteria
We follow the empirical cycle to come up with hypothesis and to test and evaluate them against observations. But once the results are in, a confirmation doesn't mean hypothesis was been proven and a disconfirmation doesn't automatically mean we reject it.
Well there are two main criteria for evaluation: reliability and validity. Reliability is very closely related to replicability. A study is replicable if independent researchers are in principle able to repeat it. A research finding is reliable if we actually repeat the study and then find consistent results.
Validity is more complicated. A study is valid if the conclusion about the hypothesized relation between properties accurately reflects reality. In short, a study is valid if the conclusion based on the results, is true.
Suppose I hypothesize that loneliness causes feelings of depression. I deduce that if I decrease loneliness in elderly people, by giving them a cat to take care of. Their feelings of depression should also decrease. Now suppose I perform this study in a retirement home and find that depression actually decreases after residents take care of a cat.
Well, because this is still a pretty general questions, we'll consider three more specific types of validity. Construct, internal and external validity.
Construct validity is an important prerequisite for internal and external validity. A study has high construct validity if the properties or constructs that appear in the hypothesis are measured and manipulated. Accurately, in other words our methods have high construct validity if they actually measure and manipulate the properties that we intended them to.
Suppose I accidentally measured an entirely different construct with for example my depression questionnaire. What if it measures feelings of social exclusion instead of depression?
Or suppose that taking care of the cat didn’t affect loneliness at all, but instead increased feelings of responsibility and self worth.
Will then the results only seem to support the hypothesis that loneliness caused depression. When in reality, we've manipulated a different cause and measured a different effect.
Developing accurate measurement and manipulation methods is one of the biggest challenges in the social and behavioral sciences.
But for now, I'll move on to internal validity. Internal validity is relevant when our hypothesis describes a causal relationship.
Let's assume our measurement and manipulation methods are valid for a second. Can we conclude depression went down, because the elderly felt less lonely. Well, maybe something else caused the decrease in depression. For example, if the study started in the winter and ended din the spring, then maybe the change in season lowered depression.
Or maybe it wasn't the cat's company, but the increased physical exercise from cleaning the litter box and feeding bowl.
Alternative explanations like these threaten internal validity. If there's a plausible alternative explanation internal validity is low.
Now there are many different threats to internal validity that I will discuss in much more detail in later videos.
Okay, let's look at external validity. A study is externally valid if the hypothesized relationship supported by our findings also holds in other settings and other groups, in other words, if the results generalize to different people, groups, environments and times.
Let's return to our example. Will taking care of a cat decrease depression in teenagers and middle-aged people, too? Will the effect be the same for men and women? What about people from different cultures? Will a dog be as effective as a cat?
Of course this is all hard to say based on the results of only elderly people and cats. If we'd included younger people from different cultural background and used other animals we might have been more confident of the studies external validity.
I'll come back to external validity and how it can be threatened when we come to the subject of sampling.
How do you identify what caused an effect?
Causality is a very important concept in relation to internal validity. So before we consider internal validity in more detail, we'll first have a look at causality. When do we consider a relation to be causal? What is required? Try to answer this question before watching the video and see if your answers match up to the criteria listed in the video. Causality can be a controversial topic, we are very interested to hear what you think in the forums!
5.03 Probability Sampling
Probability sampling minimizes the selection threat to external validity. Before I discuss different types of probability sampling, let's consider the essential feature of probability sampling, and how this feature helps to minimize the risk of systematic bias in our selection of participants.
The essential feature of probability sampling is that for each element in the sampling frame, the probability of being included in the sample is known and non-zero. In other words, some form of random selection is required where any element could, in principle, end up in the sample.
To use probability sampling, we need to have a sampling frame. A list of all elements in the population that can be accessed or contacted. A sampling frame is necessary to determine each element's probability of being selected.
Now, let's see why probability sampling minimizes the threat of a systematic bias in our selection of participants.
Reducing systematic bias means reducing the risk of over or under representation, of any population sub-group with a systematically higher or lower value on the property.
We used random assignment to get rid of systematic differences between the experimental and control condition.
In the long run, any specific participant characteristic will be divided equally over the two groups. This means that any characteristic associated with a systematically higher or lower score on the dependent variable cannot bias the results, in the long run.
The same principle can be applied, not in the assignment, but in the selection of participants. We avoid systematic difference between the sample and the population by randomly selecting elements from the population.
In the long run, any specific participant characteristics will be represented in the sample, proportionally to their presence in the population. We call this a representative sample.
Suppose a population consists of 80% women. With repeated random sampling, we can expect the sample to contain 80% women, in the long run. Each individual element has the same probability to be selected. And, since there are more women, female elements will be selected more often.
Besides resulting a representative sample, in the long run, probability sampling has another advantage. Probability sampling allows us to assess the accuracy of our sample estimate.
Probability sampling allows us to determine that with repeated sampling, in a certain percent of the samples, the sample value will differ from the real population value by no more than a certain margin of error.
This sounds, and is complicated. But, it basically means that we can judge how accurate our sample estimate is, in the long run.
Given a certain risk to get it wrong, we can assess what the margin of error is on average. Meaning by how much the sample and population value will differ on average.
Consider an election between Conservative candidate A and Democratic candidate B. We want to estimate the proportion of people in the population that will vote for candidate A, as accurately as possible.
Random sampling allows us to make statements such as this. If we were to sample voters repeatedly, then in 90% of the samples, the true population proportion of votes for A would lie within 8% points of our sample estimate.
5.04 Probability Sampling - Simple
There are several types of probability sampling. In this video, I'll discuss the two simplest types, simple random sampling and systematic sampling. The most basic form of probability sampling is simple random sampling. In simple random sampling, each element in the sampling frame has an equal and independent probability of being included in the sample.
Independent means the selection of any single element does not depend on another element being selected first. In other words, every possible combination of elements is equally likely to be sampled.
To obtain a simple random sample, we could write every unique combination of sampled elements on a separate card, shuffling the cards, and then blindly drawing one card. Of course, if the population is large, then writing out all possible combinations is just too much work.
This can be done using random number tables still found in the back of some statistics books, but these tables have come obsolete. We can now generate random number sequences with a computer.
For example, if our population consists of 12 million registered taxpayers, then we can generate a sequence of 200 unique random numbers between one and 12 million.
Systematic sampling is a related method aimed to obtain a random sample also. In systematic sampling, only the first element is selected using a random number. The other elements are selected by systematically skipping a certain number of elements.
A random number gives us a starting point, say, the seventh bag. We then sample each tenth bag. So we select bag number seven, 17, 27, etc. It would much harder to select elements according to random numbers, say, bag number seven, 30, 36, 41, et cetera, especially if the assembly line moves very fast.
With this approach, each element has an equal probability of being selected, but the probabilities are no longer independent.
Element 17, 27, 37, et cetera, are only chosen if seven is chosen as the starting point. This is not a real problem. It just requires a little more statistical work to determine things like the margin of error. The real problem with systematic sampling is that it only results in a truly random sample if there's absolutely no pattern to the list of elements. What if the assembly line alternately produces cat food made with fish and cat food made with beef? Let's say all odd-numbered elements are made with fish. In our example, we would never sample the quality of cat food made with beef. Of course, this an exaggerated example, but it illustrates that systematic sampling can be dangerous. A preexisting list or ordering of elements can always contain a pattern that were unaware of, resulting in a biased sample. So, systematic sampling only results in a truly random sample if it's absolutely certain that the list of elements is ordered randomly. We can make sure of this by randomly reordering the entire list. We can generate a sequence of random numbers of the same size as the list and then select elements from this list using systematic sampling. Of course, this is equivalent to random selection directly from the original list using random numbers.
5.04 Probability Sampling - Simple
There are several types of probability sampling. In this video, I'll discuss the two simplest types, simple random sampling and systematic sampling. The most basic form of probability sampling is simple random sampling. In simple random sampling, each element in the sampling frame has an equal and independent probability of being included in the sample.
Independent means the selection of any single element does not depend on another element being selected first. In other words, every possible combination of elements is equally likely to be sampled.
To obtain a simple random sample, we could write every unique combination of sampled elements on a separate card, shuffling the cards, and then blindly drawing one card. Of course, if the population is large, then writing out all possible combinations is just too much work.
This can be done using random number tables still found in the back of some statistics books, but these tables have come obsolete. We can now generate random number sequences with a computer.
For example, if our population consists of 12 million registered taxpayers, then we can generate a sequence of 200 unique random numbers between one and 12 million.
Systematic sampling is a related method aimed to obtain a random sample also. In systematic sampling, only the first element is selected using a random number. The other elements are selected by systematically skipping a certain number of elements.
A random number gives us a starting point, say, the seventh bag. We then sample each tenth bag. So we select bag number seven, 17, 27, etc. It would much harder to select elements according to random numbers, say, bag number seven, 30, 36, 41, et cetera, especially if the assembly line moves very fast.
With this approach, each element has an equal probability of being selected, but the probabilities are no longer independent.
Element 17, 27, 37, et cetera, are only chosen if seven is chosen as the starting point. This is not a real problem. It just requires a little more statistical work to determine things like the margin of error. The real problem with systematic sampling is that it only results in a truly random sample if there's absolutely no pattern to the list of elements. What if the assembly line alternately produces cat food made with fish and cat food made with beef? Let's say all odd-numbered elements are made with fish. In our example, we would never sample the quality of cat food made with beef. Of course, this an exaggerated example, but it illustrates that systematic sampling can be dangerous. A preexisting list or ordering of elements can always contain a pattern that were unaware of, resulting in a biased sample. So, systematic sampling only results in a truly random sample if it's absolutely certain that the list of elements is ordered randomly. We can make sure of this by randomly reordering the entire list. We can generate a sequence of random numbers of the same size as the list and then select elements from this list using systematic sampling. Of course, this is equivalent to random selection directly from the original list using random numbers.
5.05 Probability Sampling - Complex
There are many sophisticated probabilities sampling methods. I'll discuss two methods that go beyond the basic idea of random sampling, but are still relatively simple. These are stratified random sampling and multi-stage cluster sampling.
In stratified random sampling, we divide the population into mutually exclusive strata. We sample from each stratum separately, using simple random sampling.
Stratified random sampling is useful for two reasons. First, it allows us to ensure that at least in terms of the sample strata, our sample is representative. This means sub-populations are represented in the sample in exactly the same proportion as they appear in the population.
With simple random sampling, we can expect the sample to be representative in the long run, but due to chance, in any particular sample, strata might be over or underrepresented. Second, stratification is useful because it can make sampling more efficient.
This means, all other things being equal, that we achieve a smaller margin of error with the same sample size.
Stratifying only increases efficiency if the strata differ strongly from each other relative to the differences within each strata.
Imagine we want to sample the quality of cat food produced on an assembly line. The line produces cat food made with fish and cat food made with beef.
Suppose the average quality of beef cat food is higher than that of fish cat food. Also, the quality varies relatively little when we consider each type of food separately.
Under these circumstances, we will obtain a more accurate estimate of the populations food quality if we stratify on food type.
This is because quality is related to food type. Even a small overrepresentation of one food type can distort our overall estimate of food quality.
Stratifying prevents this distortion. If the quality does not differ between food types, then overrepresentation of one food type will not distort the overall estimate, and stratification will not improve efficiency. It is important to realize that stratified sampling requires that we know which stratum each element belongs to. If we can identify strata, then we also know their size.
As a consequence, the size of our sub-samples does not have to correspond to the size of the strata. We can calculate a representative estimate by weighting the sub-samples according to stratum size.
Why would we do this? Well, suppose our stratum of fish cat food is relatively small or is known to strongly vary in quality. In both cases, our estimate of the quality of fish cat food might be much less likely to be accurate than that of beef cat food. It might be worth it to take a bigger sample of fish cat food so we have a better chance of getting an accurate estimate. Of course, this means overrepresenting fish cat food.
We can correct for this overrepresentation by weighing the separate estimates of fish and beef cat food according to their stratum sizes before averaging them into an overall estimate of food quality. This way, the sample value is representative, efficient and more likely to be accurate.
Let's turn to multi-stage cluster sampling, the final type of random sampling I want to discuss. Multi-stage cluster sampling allows us to use random sampling without going bankrupt.
Consider sampling frames that consist of all inhabitants, students or eligible voters in a certain country.
If we were to randomly select elements from these frames, we would have to travel all over the country. In most cases, this is just too expensive.
A solution is to randomly sample in stages by first selecting clusters of elements. Say, we want to sample math performance in the population of all Dutch students currently in their third year of secondary education.
We start by forming a sampling frame of all school districts. This is the first stage, where students are clustered in districts. We randomly select a very small sample of school districts. We can use stratification to make sure we include districts in urban and rural areas.
In the second stage, we randomly select schools from the previously selected districts. Students are now clustered in schools. In the third stage, third year math classes are randomly sampled from the previously selected schools.
We can even include a fourth stage where students are randomly sampled from the previously selected classes.
Stratification can be used in all of these stages. Multi-stage cluster sampling makes random sampling feasible. But the margin of error is harder to determine because the probability to be included in the sample is no longer the same for all elements like it was with simple random sampling.
5.06 Non-Probability Sampling
Probability sampling can be contrasted with non-probability sampling. In non-probability sampling, some elements in the sampling frame have either 0 probability to be selected, or their probability is unknown. As a consequence, we cannot accurately determine the margin of error.
There are several types of non-probability sampling. I'll discuss the four most common types. Convenience sampling, snowball sampling, purposive sampling, and quota sampling.
Convenience sampling or accidental sampling is the simplest form of non-probability sampling. In convenience sampling, elements are selected that are the most convenient, the most easily accessible.
For example, if I'm interested in investigating the effectiveness of online lectures on study performance, I could recruit students in courses that I teach myself.
Of course, this is a highly selective sample of students from a particular university in a particular bachelor program.
Results will almost certainly be influenced by specific characteristics of this group, and might very well fail to generalize to all university students in my country, let alone students in other countries. So the risk of bias is high, and we have no way to determine how closely the sample value is likely to approach the population value.
Even so, convenience samples are used very often. Because sometimes, it's simply impossible to obtain a sampling frame.
In other cases, the effort and expense necessary to obtain a sampling frame are just not worth it. For example, when a universalistic causal hypothesis is investigated.
Snowball sampling is a specific type of convenience sampling. In snowball sampling, initially, a small group of participants is recruited. The sample is extended by asking the initial participants to provide contact information for possible new participants. These new participants are also asked to supply contacts. If all participants refer new ones, the initially small sample can grow large very quickly.
Suppose we want to sample patients who suffer from a rare type of cancer. We could approach a patient interest group for example, and ask the initial participants if they can put us in contact with other patients that they know through other interest groups, or through their hospital visits.
We continue to ask new participants to refer others to us until the required sample size is reached.
Snowball sampling is very useful for hard to reach, closed community populations. Of course, all disadvantages of convenience sampling also apply to snowball sampling, maybe even more so. Because there's the added risk that we're selecting a clique of friends, colleagues, or acquaintances. These people could share a characteristics that differ systematically from others in the population.
In purposive sampling, elements are specifically chosen based on the judgement of the researcher. A purposive sample can consist of elements that are judged to be typical for the population, so that only a few element values are needed to estimate the population value.
A purposive sample can consist of only extreme elements. For example, to get an idea of the effectiveness of social workers working with extremely uncooperative problem families.
Elements can also be purposively chosen, because they're very much alike, or reversely very different. For example, to get an idea of the range of values in the population.
Or, elements can consist of people who are judged to be experts. For example, when research concerns opinions on matters that require special knowledge. Purposive sampling is used mostly in qualitative research, so I won't go into further details here. Suffice it to say that purposive sampling suffers all the same disadvantages that convenience sampling does. The researcher's judgment can even form an additional source of bias.
Quota sampling is superficially similar to stratified random sampling. Participants in the sample are distinguished according to characteristics such as gender, age, ethnicity, or educational level.
The relative size of each category in the population is obtained from a national statistics institute, for example. This information is used to calculate how many participants are needed in each category, so that the relative category sizes in the sample correspond to the category sizes in the population. But instead of randomly selecting elements from each stratum, participants for each category are selected using convenience sampling.
Although this approach might seem to result in a representative sample, all kinds of biases could be present. Suppose the choice of participants is left to an interviewer, then it's possible that only people who seem friendly and cooperative are selected.
If a study uses non-probability sampling, the results should always be interpreted with great caution, and generalized only with very great reservation.
To what extent does a sample reflect the population?
Inherent in the concept of sampling is that a sample provides an incomplete picture of the population. We expect scores on the dependent variable in any particular sample to differ from the scores in the population, simply because the sample is a small subset of this population. This error is referred to as sampling error. Of course it would be useful to know, on average, how large this error is, or in other words, how precise and accurate our sample is in providing an estimate of the population values. This difficult concept will be discussed in the video on sampling error. Of course there are also other sources of error, both random and systematic. These errors, including potential biases, are referred to as non-sampling error and are discussed in a separate video.
5.07 Sampling Error
The goal of sampling is to estimate a value in the population as accurately as possible. But even if we use the most advanced sampling methods, there will always be some discrepancy between our sample value, the estimate, and the true value in the population.
This error can be categorized into two general types, sampling error and non-sampling error. In this video, I'll only discuss the first type, sampling error.
It's important to keep in mind that the true value in the population is almost always unknown. If we knew the population value, then we wouldn't need a sample.
This also means that for any particular sample, we cannot assess how large the error is exactly. However, for sampling error, it is relatively easy to estimate how large the error is.
then under certain conditions, the average sample value of all these samples will correspond to the population value.
But, of course, individual samples will result in sample values that are different from the population value. Sampling error is the difference between sample and population value that we would expect due to chance.
We can estimate how large the sampling error is on average if we were to repeatedly draw new samples from the population. Note that this only works for randomly selected samples. The average error, called the standard error, can be estimated based on the values obtained in a single sample. We can then use the standard error to calculate a margin of error. You might think the margin of error tells us by how much our sample differs from the population at most. But we can't calculate between what boundaries the true population value lies exactly because we're estimating the sampling error in the long run over repeated samplings.
This information is captured in a confidence interval. A confidence interval allows us to say that with repeated sampling,
in a certain percentage of these samples, the true population value will differ from the sample value by no more than the margin of error.
Suppose we want to estimate the proportion of people that will vote for candidate A in an election. We sample 100 eligible voters and find that 60% of the sample says they'll vote for A.
We have to decide how confident we want to be. Let's say that with repeated sampling, we want the population value to fall within the margin of error at least 90% of the time.
With this decision, we can now calculate the margin of error. Let's say that the margin of error is 8%. This means we can say that with repeated sampling, the population value will differ from the sample value by no more than 8% in 90% of all the samples.
Sampling error is related to the sample size. As sample size increases, sampling error will become smaller.
Sampling error is also influenced by the amount of variation in a population. If a population varies widely on the property of interest, then the sample value can also assume very different values. For a given sample size, sampling error will be larger in a population that shows more variation.
Okay, so to summarize. Sampling error is the difference between population and sample value due to chance, due to the fact that our sample is a limited incomplete subset of the population.
Sampling error is unsystematic, random error. It's comparable to the random error that makes a measurement instrument less reliable.
We can estimate how large the sampling error will be in the long run, which allows us to conclude how accurate our sample value is likely to be.
This only works under certain conditions. One of these conditions is that the sample is a random sample from the population.
5.08 Non-Sampling Error
Sampling error can be contrasted with non-sampling error. Sampling error is the difference between population and sample value, due to the fact that our sample is a limited, incomplete subset of the population. Non-sampling error, is the difference between population and sample value, due to source other than sampling error.
Two major sources of non-sampling error are sampling bias and error due to non-response. They're both related to the sampling procedure.
Sampling bias is a systematic form of error. Sampling bias, is the difference between sample and population value due to a systematic under or over representation of certain elements in the population.
Sampling bias occurs, when some elements have a much smaller or larger chance to be selected than was intended. Sampling bias can also occur when certain elements, have no chance to be selected at all.
Suppose we want to estimate the proportion of people that will vote for candidate A in an election. Sampling bias could occur, if participants were recruited on the street by an interviewer during working hours. This could lead to an under representation of people who are employed full time. If these people would vote for candidate A more often, then we would systematically underestimate the percentage of voters for candidate A.
The risk of sampling bias is eliminated at least in the long run, by using a probability sampling method.
With non-probability sampling, the risk of sampling bias is strong. Sampling bias is comparable to the systematic error, that makes a measurement instrument less valid or less accurate. Non-response is another source of non-sampling error. Non-response refers to a lack of response to invitations or the explicit refusal to participate in a study. Non-response also includes participants who drop out during the study, or participants whose data are invalid, because they did not participate seriously, because something went wrong or they did not understand or failed to comply with some aspect of the procedure.
If non-response is random, then you could say that non-response results in a smaller sample size and thereby slightly increase the margin of error. But sometimes non-response is not random. Sometimes specific subgroups in a population are less likely to participate.
If the subgroup has systematically different values on the property of interest, then non-response is a source of systematic error.
Suppose people with a lower social economic status, are less likely to participate in polls, and also prefer other candidates to candidate A.
In that case, we're missing responses of people that would not vote for A, which could lead to a systematic overestimation of the percentage of people that will vote for A.
Besides sampling bias and non-response, there are other sources of non-sampling error. Related to the sampling procedure. One example is an incomplete or inaccurate sampling frame. For example, because the frame is out of date. Apart from non-sampling error related to the sampling procedure, there are two other general types of non-sampling error.
This type of error could be caused by errors in the instrument. Such as, poorly worded questions or untrained observers.
Data collection errors, can also be due to the errors in the procedure. Such as, giving inaccurate instructions, a failure of equipment, or distraction by fellow participants during data collection.
A final source of non-sampling error, lies in the processing of data after they've been collected. Data entry errors can be made, for example, when data is entered into a data file manually. Or when responses need to be re-coded or aggregated in the data file.
As you can see, non-sampling error includes systematic error such as sampling bias, systematic non-response error, and systematic collection error due to faulty instruments or procedures. However, non-sampling error also includes random error such as random non-response error, random data collection and data processing errors.
One final remark, for random samples, the sampling error can be estimated. The size of non-sampling error, is much harder to assess even in random samples. There are all kinds of sophisticated techniques available to assess sampling bias, and systematic, and random non-response errors. Unfortunately these techniques usually require rigorous sampling methods, large sample sizes, and all kinds of additional assumptions.
How large should your sample be?
The final question a researcher needs to answer - once the population is identified and a sampling method is chosen - is how many participants should be recruited. In principle more is always better, but there is a point of diminishing returns, where the cost of adding participants outweighs the benefit of higher precision. For probability sampling it is possible to calculate how large the sample should be given certain criteria for precision. If you found the concepts margin of error difficult, it is a good idea to re-watch the video on sampling error before watching this last video on sampling size!
5.09 Sample Size
The goal of sampling is to obtain the best possible estimate of a population value within the limits of our budget and our time.
Suppose we've decided on a sampling method for our study, preferably a probability sampling method if this is at all possible.
The question now remains, how many elements we need to sample in order to get an accurate estimate of the population value?
Because as sample size increases, the margin of error will decrease. Accidental over or under-representation of certain elements will be less extreme and will become less likely. In other words, a bigger sample is always better in terms of accuracy. But this doesn't mean we should all collect samples consisting of tens of thousands of elements. This is because as the sample size grows, the decrease in the margin of error becomes smaller and smaller.
At a certain point, the cost of collecting more elements outweighs the decrease in the margin of error.
Say we want to estimate the proportion of votes for candidate A in an upcoming election. Suppose we have a sample of 500 eligible voters. Then the error won't be cut in half if we double the sample to 1,000 elements. The decrease in error will be much, much smaller.
Note that it's the absolute size of the sample that matters, not the relative size. It doesn't matter if we're estimating election results in Amsterdam with slightly more than half a million eligible voters or national elections with more 13 million voters. As long as the samples are both randomly selected, the margin of error will be the same, all other things being equal. This seems very counterintuitive, but it's true nonetheless.
Of course, there are other factors to consider when deciding on sample size. The variability of the population is an important factor. Heterogeneity or strong variation in the population on the property of interest results in a larger margin of error, all other things being equal.
If values in the population vary widely, then a sample is more likely to accidentally over or underestimate the true population value.
If the population is more homogenous or similar, meaning it takes on a narrow, limited set of values, well, then the sample value will automatically lie closer to the population value.
If a population is more homogenous, we can sample more efficiently. This means, all of the things being equal, that we could achieve a smaller margin of error with the same sample size. Or conversely, we could obtain the same margin of error with a smaller sample, more efficient.
If a probability sampling method is used, we could determine what margin of error we're willing to accept given a certain confidence level. We can say that we want our sample estimate of election results to differ by no more than 5% from the final results in 95% of the cases if we were to sample repeatedly.
We, or rather a computer, can now calculate exactly what sample size we need to obtain this margin of error at this confidence level.
This does require that we use random sampling and that we can estimate the variability in the population. For example, based on previous studies, old census data, or just a best guess if necessary.
I'll just mention one other important factor to consider when determining the sample size. It's a good idea to plan ahead and compensate for non-response. Non-response refers to elements in the sample that cannot be contacted, that refuse to participate, fail to complete the study, or provide invalid responses. If the response rate can be estimated based on previous or comparable research, then we can take non-response into account and sample extra elements that will compensate for the expected loss of elements due to non-response.
Không có nhận xét nào:
Đăng nhận xét