---
title: "Week 07 lab: Logistic regression workflow"
subtitle: "Binary outcomes, odds ratios, predicted probabilities, and model checks"
format:
  html:
    toc: true
    embed-resources: true
execute:
  echo: true
  warning: false
  message: false
---

## Goal

By the end of this lab you will be able to:

1. fit a logistic regression model with `glm(..., family = binomial)`,
2. interpret logistic-regression coefficients as log-odds and odds ratios,
3. compute and plot predicted probabilities with confidence intervals, and
4. compare and check logistic regression models.

## Reporting reminder

As in previous labs, each exercise should leave behind two things: code that reproduces the analysis, and a short write-up in your own words. For logistic regression, the most readable interpretation is often not the raw coefficient table. Use the model output, odds ratios, and predicted probabilities together to make a clear statistical judgment.

## 0) Setup

Put this file next to the worksheet:

- `week07_lab_titanic.csv`

Load packages:

```{r}
library(tidyverse)
library(broom)
library(performance)
```

This week we will work with a cleaned Titanic passenger dataset. Each row represents one passenger from a classic Titanic contingency table. The crew rows were removed, and the remaining data were expanded into one row per passenger. The `passenger_id` column is only an arbitrary row identifier.

The outcome is `survived`, coded as `1` for survived and `0` for died. Logistic regression models the probability that the outcome equals `1`, so here we will model the probability of survival.

```{r}
titanic <- read.csv("week07_lab_titanic.csv") |>
  mutate(
    pclass = factor(pclass, levels = c("1st", "2nd", "3rd")),
    sex = factor(sex, levels = c("male", "female")),
    age_group = factor(age_group, levels = c("adult", "child")),
    survived = as.integer(survived)
  )

glimpse(titanic)
summary(titanic)
```

## 1) Exercise 1

Start with the observed survival rates before fitting any model. Create a table showing the proportion of passengers who survived within each combination of `sex` and `pclass`. Then make a plot that shows the same pattern.

```{r}
#| lab: student
#| include: true
survival_rates <- NULL
```

Questions:

- Which group has the highest observed survival rate?
- Which group has the lowest observed survival rate?
- Why can this problem not be solved with ordinary linear regression?

*Put your answer here*

## 2) Exercise 2

Fit a first logistic regression model using only `sex` as a predictor. Store the model as `fit_sex`, print the model summary, and then make an odds-ratio table.

The function `glm()` fits generalized linear models. The argument `family = binomial` tells R that the outcome is binary and that the model should use the logit link. In the printed summary, coefficients are on the log-odds scale. Exponentiating them converts them to odds ratios. You can build the odds-ratio table using any method you find clear; the essential operation is `exp()`.

```{r}
#| lab: student
#| include: true
fit_sex <- NULL
```

Questions:

- What does the intercept refer to in this model?
- What does the `sexfemale` coefficient say on the log-odds scale?
- What does the corresponding odds ratio say in plain language?

*Put your answer here*

## 3) Exercise 3

Now add passenger class and age group. Fit an additive model called `fit_add`:

`survived ~ sex + pclass + age_group`

Compare `fit_sex` and `fit_add` with `anova(fit_sex, fit_add, test = "Chisq")`. For GLMs, this comparison is a likelihood-ratio test: it asks whether the larger model improves fit enough to justify the extra terms.

```{r}
#| lab: student
#| include: true
fit_add <- NULL
```

Make an odds-ratio table for `fit_add`. Use it to interpret the effects of class and age group after accounting for sex.

```{r}
#| lab: student
#| include: true
or_add <- NULL
```

Next, compute predicted probabilities for adult passengers in each combination of `sex` and `pclass`.

The prediction grid is provided below. When you only need point predictions, `predict(..., type = "response")` returns fitted probabilities directly.

```{r}
#| lab: student
#| include: true
pred_add <- tidyr::expand_grid(
  sex = levels(titanic$sex),
  pclass = levels(titanic$pclass),
  age_group = "adult"
)
```

Questions:

- Does adding class and age group improve on the sex-only model?
- How would you interpret the odds ratios for `pclass2nd` and `pclass3rd`?
- What does the predicted-probability table add beyond the coefficient table?

*Put your answer here*

## 4) Exercise 4

The additive model assumes that the effect of passenger class is the same for male and female passengers on the log-odds scale. Now fit an interaction model called `fit_int`:

`survived ~ sex * pclass + age_group`

Compare `fit_add` and `fit_int` with a likelihood-ratio test. Then use the provided prediction grid to plot predicted probabilities for adult passengers from the interaction model.

```{r}
#| lab: student
#| include: true
fit_int <- NULL
pred_int <- tidyr::expand_grid(
  sex = levels(titanic$sex),
  pclass = levels(titanic$pclass),
  age_group = "adult"
)
```

Finally, do a compact model check for `fit_int`. Print a few fit statistics, compute Nagelkerke's pseudo-R2, make the built-in residuals-vs-fitted plot, and then use a binned residual plot.

The function `broom::glance()` gives one-row model summaries such as deviance and AIC. The function `performance::r2_nagelkerke()` reports a pseudo-R2 for logistic regression. The command `plot(fit_int, which = 1)` asks R for the residuals-vs-fitted diagnostic plot. The function `performance::binned_residuals()` groups observations with similar predicted probabilities and checks whether the average residuals stay near zero.

```{r}
#| lab: student
#| include: true
binned_int <- NULL
```

Questions:

- Does the interaction model improve on the additive model?
- How does the class pattern differ between male and female passengers?
- What do the model-checking outputs suggest about the fit of the interaction model?
- If you had to report one model, which one would you report and why?

*Put your answer here*

## 5) Checklist (before you submit anything)

- Check that `survived == 1` is the event your model is predicting.
- Interpret coefficients as log-odds only when needed; use odds ratios and predicted probabilities for clearer reporting.
- Use likelihood-ratio tests to compare nested logistic models.
- Treat pseudo-R2 and diagnostic plots as model checks, not as direct analogues of ordinary linear-regression diagnostics.
- For logistic regression, binned residuals are often easier to read than raw residuals.