--- title: "Week 07 lab: Logistic regression workflow" subtitle: "Binary outcomes, odds ratios, predicted probabilities, and model checks" format: html: toc: true embed-resources: true execute: echo: true warning: false message: false --- ## Goal By the end of this lab you will be able to: 1. fit a logistic regression model with `glm(..., family = binomial)`, 2. interpret logistic-regression coefficients as log-odds and odds ratios, 3. compute and plot predicted probabilities with confidence intervals, and 4. compare and check logistic regression models. ## Reporting reminder As in previous labs, each exercise should leave behind two things: code that reproduces the analysis, and a short write-up in your own words. For logistic regression, the most readable interpretation is often not the raw coefficient table. Use the model output, odds ratios, and predicted probabilities together to make a clear statistical judgment. ## 0) Setup Put this file next to the worksheet: - `week07_lab_titanic.csv` Load packages: ```{r} library(tidyverse) library(broom) library(performance) ``` This week we will work with a cleaned Titanic passenger dataset. Each row represents one passenger from a classic Titanic contingency table. The crew rows were removed, and the remaining data were expanded into one row per passenger. The `passenger_id` column is only an arbitrary row identifier. The outcome is `survived`, coded as `1` for survived and `0` for died. Logistic regression models the probability that the outcome equals `1`, so here we will model the probability of survival. ```{r} titanic <- read.csv("week07_lab_titanic.csv") |> mutate( pclass = factor(pclass, levels = c("1st", "2nd", "3rd")), sex = factor(sex, levels = c("male", "female")), age_group = factor(age_group, levels = c("adult", "child")), survived = as.integer(survived) ) glimpse(titanic) summary(titanic) ``` ## 1) Exercise 1 Start with the observed survival rates before fitting any model. Create a table showing the proportion of passengers who survived within each combination of `sex` and `pclass`. Then make a plot that shows the same pattern. ```{r} #| lab: student #| include: true survival_rates <- NULL ``` Questions: - Which group has the highest observed survival rate? - Which group has the lowest observed survival rate? - Why can this problem not be solved with ordinary linear regression? *Put your answer here* ## 2) Exercise 2 Fit a first logistic regression model using only `sex` as a predictor. Store the model as `fit_sex`, print the model summary, and then make an odds-ratio table. The function `glm()` fits generalized linear models. The argument `family = binomial` tells R that the outcome is binary and that the model should use the logit link. In the printed summary, coefficients are on the log-odds scale. Exponentiating them converts them to odds ratios. You can build the odds-ratio table using any method you find clear; the essential operation is `exp()`. ```{r} #| lab: student #| include: true fit_sex <- NULL ``` Questions: - What does the intercept refer to in this model? - What does the `sexfemale` coefficient say on the log-odds scale? - What does the corresponding odds ratio say in plain language? *Put your answer here* ## 3) Exercise 3 Now add passenger class and age group. Fit an additive model called `fit_add`: `survived ~ sex + pclass + age_group` Compare `fit_sex` and `fit_add` with `anova(fit_sex, fit_add, test = "Chisq")`. For GLMs, this comparison is a likelihood-ratio test: it asks whether the larger model improves fit enough to justify the extra terms. ```{r} #| lab: student #| include: true fit_add <- NULL ``` Make an odds-ratio table for `fit_add`. Use it to interpret the effects of class and age group after accounting for sex. ```{r} #| lab: student #| include: true or_add <- NULL ``` Next, compute predicted probabilities for adult passengers in each combination of `sex` and `pclass`. The prediction grid is provided below. When you only need point predictions, `predict(..., type = "response")` returns fitted probabilities directly. ```{r} #| lab: student #| include: true pred_add <- tidyr::expand_grid( sex = levels(titanic$sex), pclass = levels(titanic$pclass), age_group = "adult" ) ``` Questions: - Does adding class and age group improve on the sex-only model? - How would you interpret the odds ratios for `pclass2nd` and `pclass3rd`? - What does the predicted-probability table add beyond the coefficient table? *Put your answer here* ## 4) Exercise 4 The additive model assumes that the effect of passenger class is the same for male and female passengers on the log-odds scale. Now fit an interaction model called `fit_int`: `survived ~ sex * pclass + age_group` Compare `fit_add` and `fit_int` with a likelihood-ratio test. Then use the provided prediction grid to plot predicted probabilities for adult passengers from the interaction model. ```{r} #| lab: student #| include: true fit_int <- NULL pred_int <- tidyr::expand_grid( sex = levels(titanic$sex), pclass = levels(titanic$pclass), age_group = "adult" ) ``` Finally, do a compact model check for `fit_int`. Print a few fit statistics, compute Nagelkerke's pseudo-R2, make the built-in residuals-vs-fitted plot, and then use a binned residual plot. The function `broom::glance()` gives one-row model summaries such as deviance and AIC. The function `performance::r2_nagelkerke()` reports a pseudo-R2 for logistic regression. The command `plot(fit_int, which = 1)` asks R for the residuals-vs-fitted diagnostic plot. The function `performance::binned_residuals()` groups observations with similar predicted probabilities and checks whether the average residuals stay near zero. ```{r} #| lab: student #| include: true binned_int <- NULL ``` Questions: - Does the interaction model improve on the additive model? - How does the class pattern differ between male and female passengers? - What do the model-checking outputs suggest about the fit of the interaction model? - If you had to report one model, which one would you report and why? *Put your answer here* ## 5) Checklist (before you submit anything) - Check that `survived == 1` is the event your model is predicting. - Interpret coefficients as log-odds only when needed; use odds ratios and predicted probabilities for clearer reporting. - Use likelihood-ratio tests to compare nested logistic models. - Treat pseudo-R2 and diagnostic plots as model checks, not as direct analogues of ordinary linear-regression diagnostics. - For logistic regression, binned residuals are often easier to read than raw residuals.