This session starts where the Data Analysis Tools course left off. This first set of videos provides you with some conceptual background about the major types of data you may work with, which will increase your competence in choosing the statistical analysis that’s most appropriate given the structure of your data, and in understanding the limitations of your data set. We also introduce you to the concept of confounding variables, which are variables that may be the reason for the association between your explanatory and response variable. Finally, you will gain experience in describing your data by writing about your sample, the study data collection procedures, and your measures and data management steps.
Lesson 1: Observational Data
So where did the data you're working with come from? Data can be generated from multiple sources. But mostly consists of two types, observational and experimental. How the data are generated is important because it determines what kinds of conclusions we can draw from the data that we're working with.
There's one question we can ask to determine whether our data can be considered observational or experimental.
And that question is was the explanatory variable manipulated? By manipulated we mean that the values of the variable are controlled by the researcher. For example, suppose you participated in a study in which you completed a questionnaire by responding to questions about your level of exercise.
The data generated from this type of study is called observational data because the researcher is not manipulating your level of exercise. You're simply reporting how much you exercise. On the other hand, if you are participating in a study in which researchers asked you to exercise at a certain level for a week, then compare you to other individuals who are asked to not exercise at all for the same week. The data generated from this type of study are what we call experimental data. Because the researchers manipulating the amount of exercise that you do. So if our answer to the question, was the explanatory variable manipulated, is no, then we're working with observational data.
Observational data can be generated in a number of ways. One way that observational data can be generated is through data reporting.
Data reporting is the process of collecting and organizing data, typically to monitor a process or phenomenon. Data reporting does not collect data with any kind of specific hypotheses in mind. Although it's often analyzed later on to test specific hypothesis.
Data reporting can tell you what is happening, but it's data analysis that can tell you why it's happening.
An example, individual states may report rates of diseases to the Centers for Disease Control and the Centers for Disease Control can combine the information from each date, and from all over the world, into a data set that allows them to monitor global disease rates. Data reporting can also occur when you go the doctor's office, and they record information from your visit into an electronic medical record. This information is primarily collected for monitoring and reporting purposes. However, electronic medical record data can be compiled into a database that can be analyzed to gain insight into disease processes, health behaviors, and other health-related phenomena.
The Gapminder dataset that some of you are working with consists of data collected through data reporting. Observational data can also be collected through surveys of samples of individuals in the population. Much of our data are collected this way. In these survey research studies, a sample of individuals or observations is drawn from the population, and are asked to respond to questions posed by researchers, either in an interview or on a questionnaire.
Responses to survey questions provide the data that we can analyze. Sample survey data is typically collected with certain hypotheses in mind to test. However, we often use existing sample survey data to test secondary hypotheses or to mine the data for additional insights. If you're working with the NESARC data set or the AddHealth dataset or the Outlook On Life dataset. You're working with observational data generated by survey research studies. Observational data might also be collected by simply observing the values of a variable or variables of interest as they naturally occur.
For example, a researcher might observe families at play on the beach, and measure the number of times parents touch or speak to their children. Another example of an observational study is when surface weather observations such as temperature, humidity, wind speed, and precipitation are collected in order to predict future weather patterns or other climatological events. Such as drought and flooding.
In both of these cases, family behavior observation and weather observation, the collection of data does not involve any interaction, or intervention, with the process as it is occurring.
Lesson 2: Experimental Data
So what if we go back to our question asking if the explanatory variable was manipulated? And the answer is yes. Data that come from studies in which the explanatory variable is manipulated, are called experimental data. Experimental data come from studies in which groups of observations are either pre-selected or randomly assigned, and the values of an explanatory variable and then observed on some response variable. There are two major types of experimental studies, True Experimental studies and Quasi Experimental studies. There are three components of an experimental study, first, only one explanatory variable is manipulated. Meaning that all other variables, that could also be related to the response variable, are held constant. The only thing that changes, is the value of the explanatory variable that is being manipulated by the experimenter. Second, there must be a control group, to which other values of the explanatory variable are compared to, on the response variable. And third, observations must be randomly assigned to values of the explanatory variable. This means that every observation starts out with an equal probability of being in each group, but is then randomly chosen to be in one group or another. For example, an agricultural researcher might be interested in determining the effect of a new fertilizer on plant growth. In this study, each plant is an observation. Fertilizer application is the explanatory variable, and plant growth is the response variable.
The researcher takes a sample of seedlings, and randomly divides a sample into two groups. The first group of seedlings are fertilized and kept for three months in a room with a controlled amount of sunlight, watering and air temperature.
The second group is kept in the same room with identical conditions for the same three months, with the exception that the plants in this group were not fertilized. Because the researcher is interested in the effect of the new fertilizer on plant growth, the plants that were not fertilized were the control group. And the plants that were fertilized were the treatment group.
After the three month period, the researcher measures the height of each plant in both groups. The researcher found that the plants that were fertilized grow an average of two inches higher than the plants that were not fertilized. As a result, the researcher then concluded that the fertilizer significantly increased plant growth, and recommended that farmers should be encouraged to use the fertilizer. So you can see in this experimental study, that all other variables, with the exception of the explanatory variable of interest, are held constant in each group as a result of the experimental design. Because all other factors that could affect plant growth were held constant in this experiment, the researcher could conclude that the fertilizer caused the plants to grow higher. Most of the data we work with however is not produced by a true experiment. Most of the time we can't physically control all, or even any of the other factors that might affect our response variable. So for most studies we are not able to determine whether one variable causes another variable. But we are able to determine associations.
Random assignment is another way we can control for these other factors. The idea is that if every observation in the sample has an equal probability of being in each of the groups, and truly, randomly end up in one group or another, then the groups end up balanced in terms of the other factors. So if age is a factor, then the group should have the same age variability and this equal variability essentially controls for that factor. And this should be the case for any other factor, however randomization doesn't always work the way we want it to. In fact randomization works best as your sample size approaches infinity. Unfortunately we work with finite samples, which can often be pretty small. The smaller the sample the greater the risk that the groups will be unbalanced on factors that could affect how the treatment affects the response variable. If part of your job as a data analyst is to evaluate data from studies with random assignment, one of the first things you'll wanna do is to check for any imbalances between your treatment and control groups on key variables that could change how the treatment effects the response variable. If imbalances are identified, then those variables can be included in the statistical model to predict the response variable, so that they can be statistically controlled. Statistical control is another commonly used strategy. If we include additional explanatory variables that could effect the association between the treatment and the response, than we could examine that association after adjusting to the other explanatory variables. Well, these are all good strategies, from posing as much control on a study as possible. They're not perfect. Nor can we possibly control for everything that could affect the association between the treatment and response variable. For that reason, unlike a true experiment in which we are able to hold every other possible variable constant, we cannot determine causality. We can only determine whether the treatment is associated with the response variable. Sometimes, we can't randomly assign people to a treatment or control group. In many cases, it would be unethical to do so. For example, if we're conducting a study to examine the association between cocaine use and memory processing, there's no way we could assign some participants to use cocaine. This would be completely unethical and we put our participants at significantly greater risk of harm. It certainly would not outweigh the benefit of any knowledge that would be gained by the study. Instead, we would have to identify people who either test positive for or self report, cocaine use and then test for memory processing differences between users and non-users. The manipulation of the explanatory variable is based on the fact that our treatment and control groups are pre-selected. In this study, cocaine users would be in our treatment group and non-users would be in our control group. So while it looks like an experimental design, it is missing the random assignment piece, and we call this a quasi-experimental design. We can increase the rigor of a quasi-experimental design by measuring as many confounding variables as possible. Having a control group and using a pre and post-test design whenever possible. A quasi-experimental design will not allow us to infer causality between an explanatory variable and our response variable.
Lesson 3: Confounding Variables
We're trying to determine which method works best. Drugs to alleviate nicotine addiction, therapy, combines drugs and therapy or simply quitting.
The explanatory variable is the method, while the response variable is eventual success or failure in quitting. Our study shows that the percentage succeeding with the combination drug and therapy method was highest, while the percentage succeeding with neither therapy nor drugs was lowest. In this example, there is clear evidence of an association between method used and success rate. Can we conclude that the combination drug therapy method causes success more than using neither therapy or drugs? >> It is at precisely this point that we can front the underlying weakness of most observational studies. Some members of the sample have opted for certain values of the explanatory variable, method of quitting, while others have opted for other values. It could be that those individuals may be different in additional ways that would also play a role in the response of interest. For instance, suppose older people are more likely to choose certain methods to quit. And suppose older people in general tend to be more successful in quitting than younger people. The data would make it appear that the method itself was responsible for success. Whereas in truth, it may just be that being older is the reason for success. We can express this scenario in terms of the key variables involved. >> In addition to the explanatory variable, or method, and the response variable, success or failure, a third lurking variable age is tied in or confounded with the explanatory variable's values and may itself cause the response to be success or failure. We could control for the lurking variable age by studying older and younger adults separately. Then, if both older and younger adults who chose one method have higher success rates then those opting for another method, we would be closer to producing evidence of causation.
The diagram demonstrates how straightforward it is to control for the lurking variable age by modifying your study design. Notice that we did not claim that controlling for age would allow us to make a definite claim of causation.
This is due to the fact that other lurking variables may also be involved such as the level of the participant's desire to quit. Specifically, those who have chosen to use the drug therapy method may already be the ones who are most determined to succeed, while those who chosen to quit without investing in drugs or therapy may from the outset be less committed to quitting. >> To attempt to control for this lurking variable, we could interview the individuals at the outset in order to rate their desire to quit on a scale of one to five. With one being the weakest. Five being strongest. Then we could study the relationship between method and success separately for each of the five groups, but desire to quit is a very subjective thing, difficult to assign a specific number to. Realistically we may be unable to effectively control for the lurking variable, desire to quit.
Who's to say that gender and the desire to quit is the only working variables involved. There maybe other subtle differences among individuals that choose one of the four various methods to quit smoking. And researchers may fail to conceive of these subtle differences as they attempt to control for possible lurking variables. For example, smokers who opt to quit using neither therapy nor drugs may tend to be in the lower income bracket, then those who opt for drugs and/or therapy because they can afford this method. Perhaps smokers in a lower income bracket also tend to be less successful in quitting because more of their family members and coworkers smoke.
Thus, socioeconomic status is yet another possible lurking variable in the relationship between cessation method and success rate.
>> It's because of the existence of a virtually unlimited number of potential lurking variables that we can never be 100% certain of the claim of causation based on an observational study. Observational studies cannot prove casuality. On the other hand, observational studies are an extremely common tool used by researchers to attempt to draw conclusions about causal connections. To do this, great care must be taken to control for the most likely lurking variables. Only then can researchers assert that observational study may suggest a causal relationship.
So far, we've discussed different ways in which data can be used to explore the relationship or association between two variables. When we explore the relationship between two variables, there is often a temptation to conclude from the observed association that changes in the explanatory variable cause changes in the response variable. In other words, you might be tempted to interpret the observed association as causation. The purpose of this part of the course is to convince you that this kind of interpretation is often wrong.
The motto of this section is one of the most fundamental principles of this course. Association does not imply causation.
What variables might affect the extend of this damage? This scatter plot illustrates how the number of fire fighters sent to fires on the x axis is related to the amount of damage caused by fires on the y axis in a certain city.
The scatter plot clearly displays a fairly strong, or slightly curved positive relationship between the two variables.
Would it then be reasonable to conclude that sending more firefighters to a fire causes more damage or that the city should send fewer firefighters to a fire in order to decrease the amount of damage done by the fire? Of course not. So what's going on here?
There is a third variable in the background, the seriousness of the fire, that's responsible for the observed relationship. More serious fires require more firefighters, and also cause more damage. This model will help you visualize this situation. The seriousness of the fire is a confounding variable. In statistics, a confounding variable also known as confounding factor, a lurking variable, a confound or confounder is an extraneous variable that is associated positively or negatively with both the explanatory variable and response variable. We need to control for these factors to avoid incorrectly believing that the response variable is associated with the explanatory variable.
>> Confounding is a major threat to the validity of inferences made about the statistical associations. In the case of a confounding variable the observed association with the response variable should be attributed to the cofounder rather than the explanatory variable. In science, we test for confounders by including these third variables, or fourth or fifth or sixth, in our statistical models that may explain the association of interest. In other words, we want to demonstrate that our association of interest is significant even after controlling for potential confounders.
Lesson 4: Introduction to Multivariate Methods
>> Because adding potential confounding variables towards statistical model can help us to gain a deeper understanding of the relationship between variables or lead us to rethink an association. It's important to learn about statistical tools that will allow us to examine multiple variables simultaneously, that is look at more than two or three variables at the same time. The general purpose of multi variant modeling techniques, such as multiple regression and logistic regression is to learn more about the relationship between several explanatory variables and one response variable. >> These regression procedures are very widely used in research. In general, they allow us to ask and hopefully answer the question, what is the best predictor of? And does variable A or variable B confound the relationship between my explanatory variable of interest and my response variable?
>> For example, educational researchers might want to learn about the best predictors of success in high school.
Sociologists may wanna find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt to their new country of residence.
Biologists may want to find out which factors, such as temperature or barometric pressure or humidity best predict caterpillar reproduction.
So how can multivariate models help us to evaluate the presence or absence of confounding or lurking variables?
Since the difficulty arises because of the lurking variables values being tied in with those of the explanatory variable, one way to attempt to unravel the true nature of the relationship between explanatory and response variables is to separate out the effects of the lurking variable. >> You may have already identified a significant relationship between your explanatory and response variables. And now want to think about whether this is a real relationship or if instead, the relationship is confounded by one or more lurking variables. >> For example, here's a graphical association between birth order and number of cases of down syndrome per 100,000 live births. As you can see, it looks like a linear association where the first born in a family has the lowest likelihood of having down syndrome.
With later birth order up to a fifth born child, there's increased risk of being born with down syndrome. This is a statistically significant association when analyzed via a chi square test of independence with birth order as the categorical explanatory variable in the presence or absence of down syndrome as the two level categorical response variable.
Another statistically significant relationship is the association between maternal age at a child's birth and the likelihood that the child will have down syndrome. You can see here that babies of younger women up to about the age of 29 or 30 to 34 have really low rates of down syndrome. Among mothers age 35 to 39 and older, you see the rates are clearly higher.
>> Remember, in the case of the confounding variable the observed association with the response variable should be attributed to the confounder rather than the explanatory variable. We test for confounders by including these third variables or fourth or fifth in our statistical modules that may explain the association of interest.
>> In these examples, it's possible that the association between a child's birth order and risk for down syndrome could be confounded by maternal age. Alternately, the association between maternal age and down syndrome might be confounded by birth order or both birth order and maternal age might independently predict the likelihood of a diagnosis of down syndrome after controlling for each other. Here's a graph that answers this question by showing that maternal age confounds the relationship between birth rank and down syndrome and that it's really maternal age rather than birth rank, that's associated with Down Syndrome.
Here, you see birth order along the horizontal axis. The maternal age groups are along the z-axis. Then on the y-axis, we have cases of down syndrome per 100,000 live births. If we look across birth order separately for each maternal age group, we see that there really is no difference in rates of downs syndrome by birth order. In other words, once we control for the age of the mother that is examine the rates of down syndrome across birth order but one maternal age group at a time, there's no association between birth order and down syndrome.
If we look at rates of down syndrome across maternal age for each individual birth order we see an upward trend as maternal aging, Increases.
So if you look across these colors, this is a great graphical representation where we see that it isn't birth order that is associated with down syndrome, it's maternal age. In other words, once we control for birth order, there's still an association between maternal age and down syndrome. Birth order does not confound the relationship between maternal age and Down Syndrome. The relationship holds even after controlling for birth order.
Không có nhận xét nào:
Đăng nhận xét