--- title: "Week 02 lab 2: Correlation" subtitle: "Correlation practice worksheet" format: html: toc: true embed-resources: true execute: echo: true warning: false message: false --- ## General remarks This week you are asked to try to report the results of the statistical analyses in a more "professional" way. A good place to start is the APA standard (or "APA style"). There are many materials on the web, but I like this short guide from the University of Washington (read it!): https://psych.uw.edu/storage/writing_center/stats.pdf ## 0) Setup Data files used in this worksheet (place them next to this `.qmd`): - `infmort.xlsx` ## Sub-Saharan Africa and infant mortality In Sub-Saharan Africa, more than half of mothers lose at least one child before the child's first birthday. In `infmort.xlsx` are data on 36 countries in the region, giving country, infant mortality, per capita income (in U.S. dollars), percentage of births to mothers under 20, percentage of births to mothers over 40, percentage of births less than 2 years apart, percentage of married women using contraception, and percentage of women with unmet family planning need. ## Exercise 0 (prerequisite, not graded) Load the data from the `xlsx` file. To do that you will need to look for an appropriate library. Take this opportunity to do some research on loading different types of files into R. What files can you load? What are the most comprehensive packages to do that? Can you, for example, load SPSS, SAS or Stata files? How? What about more exotic filetypes? Lets say you need to work with HDF5 files. Can you find (and install!) a library that makes this possible? ```{r} #| lab: student #| include: true # Put your code here ``` ## Exercise 1 Make a scatterplot of `InfMort` and `Income` and draw the line that best fits the data (regression line). We did not cover it during the class so treat it as an opportunity to do independent research into the capabilities of R! ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ## Exercise 2 Calculate the correlations among all numeric variables in the previous exercise. Prepare a visualization of the resulting correlation matrix. What are the strongest predictors of infant mortality? ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ## Exercise 3 Run a statistical test for the two strongest predictors of infant mortality and describe the results of your analysis. What can you conclude from these findings? What are the limitations of the data? ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ## Exercise 4 How large a correlation would you need for the relationships shown in the previous exercises to be significant? For the Pearson correlation test, use the test statistic: $$ t = \frac{r\sqrt{N-2}}{\sqrt{1-r^2}} $$ where $N$ is the sample size and $df = N - 2$. Do this exercise in **both** ways: 1. Algebraic manipulation of the formula above. 2. Numerical solution with `uniroot`. Then compare the two thresholds and relate them to the observed correlation from your data. How `uniroot()` works (quick intuition): - You define a function whose root you want (here: left side minus right side of the threshold equation). - You give an interval where the function changes sign (one end positive, the other negative). - `uniroot()` repeatedly narrows that interval until it finds the value where the function is approximately 0. - The result is returned as a list; the actual solution is in `$root`. ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ## Exercise 5 If you have multivariate data at hand, `cor` function is useful but has some limitations. For example, in contrast to `cor.test` it does not compute p-values and confidence intervals. Of course, you can run multiple `cor.test`s but it is rather cumbersome. Several packages aim to streamline this process. My favorite is `rstatix` library that contains `cor_test` function. Using this function, calculate the correlations among all numeric variables in the previous exercises. Scaffolding for this task: 1. Select only numeric variables from your working dataset. 2. Run `rstatix::cor_test()` on that numeric table. 3. Sort results by absolute value of correlation. 4. Print a compact table with at least: variable names, correlation, test statistic, p-value, and confidence interval. 5. Briefly compare the strongest associations with what you found in Exercises 2 and 3. How `rstatix::cor_test()` works (quick intuition): - You pass a data frame (usually numeric columns only). - The function runs correlation tests and returns a tidy table (one row per variable pair). - Typical output columns include variable names, `cor`, test statistic, `p`, and confidence interval bounds. - Because the output is already a data frame, it is easy to sort/filter (for example, by `abs(cor)`). ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ## Exercise 6 (SHARKS) We focused on testing a null hypothesis stating that the "true" correlation (population correlation) is equal to 0 ($H_0: \rho = 0$). There are, however, other interesting hypotheses about the correlation that we may want to test. For example, we can ask whether two population correlations differ ($H_0: \rho_1 = \rho_2$) or whether the population correlation is different from some specified value other than 0 (e.g. $H_0: \rho = 0.1$). To run those tests you need to implement them yourself (hard) or find an appropriate library (easy). ### a) Run a statistical test to determine whether the two highest correlations between demographic variables and infant mortality are different ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here* ### b) Run a statistical test to determine whether the highest correlation differs significantly from $0.1$ ```{r} #| lab: student #| include: true # Put your code here ``` *Put your answer here*