Thứ Ba, 7 tháng 6, 2016

Introduction to Regression

https://www.coursera.org/learn/regression-modeling-practice/home/week/1


This session starts where the Data Analysis Tools course left off. This first set of videos provides you with some conceptual background about the major types of data you may work with, which will increase your competence in choosing the statistical analysis that’s most appropriate given the structure of your data, and in understanding the limitations of your data set. We also introduce you to the concept of confounding variables, which are variables that may be the reason for the association between your explanatory and response variable. Finally, you will gain experience in describing your data by writing about your sample, the study data collection procedures, and your measures and data management steps.

Lesson 1: Observational Data



Lesson 2: Experimental Data


So what if we go back to our question asking if the explanatory variable was manipulated? And the answer is yes. Data that come from studies in which the explanatory variable is manipulated, are called experimental data. Experimental data come from studies in which groups of observations are either pre-selected or randomly assigned, and the values of an explanatory variable and then observed on some response variable. There are two major types of experimental studies, True Experimental studies and Quasi Experimental studies. There are three components of an experimental study, first, only one explanatory variable is manipulated. Meaning that all other variables, that could also be related to the response variable, are held constant. The only thing that changes, is the value of the explanatory variable that is being manipulated by the experimenter. Second, there must be a control group, to which other values of the explanatory variable are compared to, on the response variable. And third, observations must be randomly assigned to values of the explanatory variable. This means that every observation starts out with an equal probability of being in each group, but is then randomly chosen to be in one group or another. For example, an agricultural researcher might be interested in determining the effect of a new fertilizer on plant growth. In this study, each plant is an observation. Fertilizer application is the explanatory variable, and plant growth is the response variable. 
Random assignment is another way we can control for these other factors. The idea is that if every observation in the sample has an equal probability of being in each of the groups, and truly, randomly end up in one group or another, then the groups end up balanced in terms of the other factors. So if age is a factor, then the group should have the same age variability and this equal variability essentially controls for that factor. And this should be the case for any other factor, however randomization doesn't always work the way we want it to. In fact randomization works best as your sample size approaches infinity. Unfortunately we work with finite samples, which can often be pretty small. The smaller the sample the greater the risk that the groups will be unbalanced on factors that could affect how the treatment affects the response variable. If part of your job as a data analyst is to evaluate data from studies with random assignment, one of the first things you'll wanna do is to check for any imbalances between your treatment and control groups on key variables that could change how the treatment effects the response variable. If imbalances are identified, then those variables can be included in the statistical model to predict the response variable, so that they can be statistically controlled. Statistical control is another commonly used strategy. If we include additional explanatory variables that could effect the association between the treatment and the response, than we could examine that association after adjusting to the other explanatory variables. Well, these are all good strategies, from posing as much control on a study as possible. They're not perfect. Nor can we possibly control for everything that could affect the association between the treatment and response variable. For that reason, unlike a true experiment in which we are able to hold every other possible variable constant, we cannot determine causality. We can only determine whether the treatment is associated with the response variable. Sometimes, we can't randomly assign people to a treatment or control group. In many cases, it would be unethical to do so. For example, if we're conducting a study to examine the association between cocaine use and memory processing, there's no way we could assign some participants to use cocaine. This would be completely unethical and we put our participants at significantly greater risk of harm. It certainly would not outweigh the benefit of any knowledge that would be gained by the study. Instead, we would have to identify people who either test positive for or self report, cocaine use and then test for memory processing differences between users and non-users. The manipulation of the explanatory variable is based on the fact that our treatment and control groups are pre-selected. In this study, cocaine users would be in our treatment group and non-users would be in our control group. So while it looks like an experimental design, it is missing the random assignment piece, and we call this a quasi-experimental design. We can increase the rigor of a quasi-experimental design by measuring as many confounding variables as possible. Having a control group and using a pre and post-test design whenever possible. A quasi-experimental design will not allow us to infer causality between an explanatory variable and our response variable. 

Lesson 3: Confounding Variables


The explanatory variable is the method, while the response variable is eventual success or failure in quitting. Our study shows that the percentage succeeding with the combination drug and therapy method was highest, while the percentage succeeding with neither therapy nor drugs was lowest. In this example, there is clear evidence of an association between method used and success rate. Can we conclude that the combination drug therapy method causes success more than using neither therapy or drugs? >> It is at precisely this point that we can front the underlying weakness of most observational studies. Some members of the sample have opted for certain values of the explanatory variable, method of quitting, while others have opted for other values. It could be that those individuals may be different in additional ways that would also play a role in the response of interest. For instance, suppose older people are more likely to choose certain methods to quit. And suppose older people in general tend to be more successful in quitting than younger people. The data would make it appear that the method itself was responsible for success. Whereas in truth, it may just be that being older is the reason for success. We can express this scenario in terms of the key variables involved.  >> In addition to the explanatory variable, or method, and the response variable, success or failure, a third lurking variable age is tied in or confounded with the explanatory variable's values and may itself cause the response to be success or failure. We could control for the lurking variable age by studying older and younger adults separately. Then, if both older and younger adults who chose one method have higher success rates then those opting for another method, we would be closer to producing evidence of causation. 

Lesson 4: Introduction to Multivariate Methods


Không có nhận xét nào:

Đăng nhận xét

Tìm kiếm Blog này

Lưu trữ Blog