library(broom)
library(gt)
library(gtsummary)
library(ggrepel)
library(kableExtra)
library(haven)
library(car)
library(GGally)
library(Hmisc)
library(janitor)
library(knitr)
library(mosaic)
library(naniar)
library(patchwork)
library(sessioninfo)
library(tidyverse)
opts_chunk$set(comment=NA)
theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)5 Cumulative smoking years and BMI among women in Texas
Using Behavioral Risk Factors Survey System (BRFSS)
6 Setup and Data Ingest
This study aims to explore and measure the association between cumulative smoking years and body mass index among women in Texas. The study sample was extracted from the Behavioral Risk Factor Surveillance System (BRFSS) which is a national telephone based survey that collects state-level data. The target population (aged 18 years and older) for cellular telephone samples in 2022 consists of people residing in a private residence or college housing who have a working cellular telephone. Our data set will only include adult women below the age of 65 whom were Texas residents in 2022. The outcome of interest is body mass index (bmi) and the key predictor is the number of years the subjects smoked cigarettes (smoke_years). Other factors will be observed, including race category, weight and average drinking per day.
6.1 Initial Setup and Package Loads in R
6.2 Loading the Raw Data into R
We will load the BRFSS data set.
BRFSS <- read_xpt("C:/Users/albal/OneDrive/Desktop/MPHP 431/BRFSS/LLCP2022.XPT")7 Cleaning the Data
7.1 The Raw Data
Because the original data set is big, we will narrow down the raw data set to only include the variables of interest to our work.
BRFSS_study <- BRFSS|>
select(SEQNO, `_STATE`,`_SEX`,`_AGE65YR`, `_RFBMI5`, `_BMI5`, `_BMI5CAT`, WEIGHT2,`_YRSSMOK`, `_RFSMOK3`, SMOKDAY2, AVEDRNK3, `_RFDRHV8`, `_IMPRACE`)
glimpse(BRFSS_study)Rows: 445,132
Columns: 14
$ SEQNO <chr> "2022000001", "2022000002", "2022000003", "2022000004", "20…
$ `_STATE` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `_SEX` <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2,…
$ `_AGE65YR` <dbl> 2, 2, 1, 3, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 1,…
$ `_RFBMI5` <dbl> 9, 2, 2, 1, 1, 2, 1, 2, 9, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ `_BMI5` <dbl> NA, 2657, 2561, 2330, 2177, 2608, 2296, 2781, NA, 2905, 292…
$ `_BMI5CAT` <dbl> NA, 3, 3, 2, 2, 3, 2, 3, NA, 3, 3, 2, 3, 4, 3, 3, 2, 4, 2, …
$ WEIGHT2 <dbl> 9999, 150, 140, 140, 119, 187, 138, 162, 9999, 180, 165, 13…
$ `_YRSSMOK` <dbl> NA, NA, NA, 56, NA, NA, 42, NA, 1, NA, NA, NA, 8, NA, NA, N…
$ `_RFSMOK3` <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1,…
$ SMOKDAY2 <dbl> NA, NA, NA, 2, NA, NA, 3, NA, 3, NA, NA, NA, 3, NA, NA, NA,…
$ AVEDRNK3 <dbl> NA, NA, NA, NA, 2, NA, 2, NA, NA, 1, NA, NA, 2, 1, 1, NA, N…
$ `_RFDRHV8` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9,…
$ `_IMPRACE` <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,…
The BRFSS_study includes 16 variables and 445,132 subjects. For each subject we selected the following variables:
- Demographic information including their state
_STATE, sex_SEX, age_AGE65YR, and race / ethnicity_IMPRACE - Their reported weight
WEIGHT2and the computed BMI level_BMI5and category_BMI5CAT - whether they are categorized as overweight
_RFBMI5 - The number of years they smoked cigarettes
_YRSSMOK - Their smoking frequency category
SMOKDAY2and whether they are current smokers or not_RFSMOK3 - Their average alcohol consumption per day
AVEDRNK3and whether they are categorized as heavy drinkers or not_RFDRHV8
7.2 Variable Filter
We will select our population of interest (Adult women, below the age of 65, residing in Texas) and filter other variables to include appropriate values for our analysis. We will also filter out missing values (such as un-answered and refused to answer categories):
TX <- BRFSS_study |>
filter( `_STATE` == 48, `_SEX`== 2, `_AGE65YR`== 1, WEIGHT2 < 0776, `_RFBMI5`< 3, `_RFSMOK3`< 3, SMOKDAY2 < 4, `_BMI5CAT` < 5 )|>
select( SEQNO, `_RFBMI5`, `_BMI5`, `_BMI5CAT`, WEIGHT2,`_YRSSMOK`, `_RFSMOK3`, SMOKDAY2, AVEDRNK3, `_RFDRHV8`, `_IMPRACE`)This is our initial data set
TX # A tibble: 1,164 × 11
SEQNO `_RFBMI5` `_BMI5` `_BMI5CAT` WEIGHT2 `_YRSSMOK` `_RFSMOK3` SMOKDAY2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20220009… 2 2663 3 170 11 1 3
2 20220009… 1 2001 2 124 28 1 3
3 20220010… 2 3004 4 175 41 2 1
4 20220010… 2 3986 4 225 11 1 3
5 20220010… 1 2455 2 143 23 2 1
6 20220010… 2 2887 3 163 NA 1 3
7 20220010… 1 2036 2 130 24 1 3
8 20220011… 2 2523 3 147 16 1 3
9 20220011… 2 4241 4 210 33 1 3
10 20220011… 2 4207 4 230 35 1 3
# ℹ 1,154 more rows
# ℹ 3 more variables: AVEDRNK3 <dbl>, `_RFDRHV8` <dbl>, `_IMPRACE` <dbl>
As we can see we need to adjust the decimal place of our outcome _BMI5. We will also rename our variables.
TX <- TX |>
mutate(bmi = `_BMI5`/100)|>
rename(race = `_IMPRACE`,
overweight = `_RFBMI5`,
bmi_category = `_BMI5CAT`,
weight = WEIGHT2,
smoke_years = `_YRSSMOK`,
smoke_currently = `_RFSMOK3`,
smoke_frequency = SMOKDAY2,
drink_daily = AVEDRNK3,
drink_heavy = `_RFDRHV8`)7.3 Which variables should be included in the tidy data set?
Our regression model will only include 5 predictors: smoke_years, race, weight, and drink_daily for our outcome of interest bmi.
7.4 Checking our outcome and key predictor
df_stats(~ bmi + smoke_years, data = TX) response min Q1 median Q3 max mean sd n missing
1 bmi 15.97 24.675 28.72 34.75 92.01 30.29957 8.247614 1164 0
2 smoke_years 1.00 11.000 21.00 32.00 56.00 22.18367 13.239163 1078 86
There are no missing observations in the outcome variables, but there are a few missing values in the main predictor variable. The bmi range extends from 16 to 65 with an average of 30, which is considered higher than the healthy level. Cumulative smoking years extends from 1 year to 56 years with an average of 19 years.
7.5 Checking the Quantitative Predictors
df_stats(~ weight + drink_daily, data = TX) response min Q1 median Q3 max mean sd n missing
1 weight 80 145 170 204.25 400 179.356529 49.01472 1164 0
2 drink_daily 1 1 2 3.00 99 4.560732 13.20164 601 563
There are no missing values in the weight values. The height range extends from 30.5 to 60.6 inches with low variation. This implies that the majority of our study sample is shorter than the average american woman. The weight has a high range extending from 80 pounds to 400 pounds, which indicates that a large proportion of our study sample is on the overweight side.
7.6 Checking the Categorical Variables
One of the predictors race is a non-ordinal, categorical variable which requires some modifications before analysis.
7.6.1 Race
First we will mutate the variable race into a factor then rename the variable categories:
TX$race <- as.factor(as.numeric(TX$race))Now we will re-level the categories.
TX$race <-
fct_relevel(TX$race,
"1", "2", "3", "4", "5", "6")Then, we will rename each category to clarify its meaning.
TX_b <- TX |>
mutate(
race = fct_recode(race,
"White" = "1",
"Black" = "2",
"Asian" = "3",
"American Indian" = "4",
"Hispanic" = "5",
"Other" = "6"))
tabyl(TX_b$race) TX_b$race n percent
White 807 0.693298969
Black 75 0.064432990
Asian 8 0.006872852
American Indian 15 0.012886598
Hispanic 229 0.196735395
Other 30 0.025773196
As we can see the Asian, American Indian, and Others categories have relatively low frequencies. So we will combine them under the Others category.
We will order our categories according their frequencies from the highest to lowest.
TX_b <- TX_b |>
mutate(race = fct_infreq(race))Then we will collapse the categories with the lowest frequencies under a category called Others and rename our variable from race to race_cat.
TX_b <- TX_b |>
mutate(race_cat = fct_lump_n(TX_b$race, n = 4))
tabyl(TX_b$race_cat) TX_b$race_cat n percent
White 807 0.69329897
Hispanic 229 0.19673540
Black 75 0.06443299
Other 53 0.04553265
The majority of the subjects are White, followed by Hispanic. The Black and Other races are remarkably lower. It is true that the unbalances distribution may impact external validity of our findings. However, it may also reflect the distribution of the target population being studied (women in Texas). It is important to note that the Other category includes Asians, Native Americans, and other races, all of which are non-Hispanic.
7.6.2 What about the subjects?
Each subject should have a sequential number SEQNO.
nrow(TX_b)[1] 1164
n_distinct(TX_b |> select(SEQNO))[1] 1164
The number of rows matches the number of SEQNO in our data set. We can rest assured that each subject has a sequential number.
7.7 Dealing with Missingness
The outcome does not have any missing observations. Neither do the predictors except for the key predictor smoke_years (86 NA) and drink_daily (563 NA).
miss_var_summary(TX_b)# A tibble: 13 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 drink_daily 563 48.4
2 smoke_years 86 7.39
3 SEQNO 0 0
4 overweight 0 0
5 _BMI5 0 0
6 bmi_category 0 0
7 weight 0 0
8 smoke_currently 0 0
9 smoke_frequency 0 0
10 drink_heavy 0 0
11 race 0 0
12 bmi 0 0
13 race_cat 0 0
None of the subjects are missing more than 2 variables as demonstrated below.
miss_case_summary(TX_b)# A tibble: 1,164 × 3
case n_miss pct_miss
<int> <int> <dbl>
1 6 2 15.4
2 59 2 15.4
3 70 2 15.4
4 80 2 15.4
5 84 2 15.4
6 86 2 15.4
7 91 2 15.4
8 99 2 15.4
9 130 2 15.4
10 184 2 15.4
# ℹ 1,154 more rows
For this analysis we will assume that the observations are missing completely at random. In this case, using complete cases method to deal with the missing data is appropriate.
This is the complete data set for the analysis.
TX_CC <- TX_b |>
select(SEQNO, race_cat, overweight, bmi, bmi_category, weight, smoke_years, smoke_currently, smoke_frequency, drink_daily, drink_heavy) |>
drop_na()
TX_CC# A tibble: 565 × 11
SEQNO race_cat overweight bmi bmi_category weight smoke_years
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2022000985 White 2 26.6 3 170 11
2 2022001128 White 2 25.2 3 147 16
3 2022001266 White 2 25.1 3 185 33
4 2022001354 Hispanic 2 30.2 4 160 31
5 2022001400 Black 2 41.6 4 220 16
6 2022001422 White 2 25.8 3 160 25
7 2022001512 Hispanic 2 43.6 4 270 30
8 2022001541 White 2 31.6 4 196 50
9 2022001655 White 2 27.1 3 148 18
10 2022001845 White 2 26.6 3 165 47
# ℹ 555 more rows
# ℹ 4 more variables: smoke_currently <dbl>, smoke_frequency <dbl>,
# drink_daily <dbl>, drink_heavy <dbl>
We are left with a total of 565 subjects with complete observations of all variables.
8 Codebook and Data Description
8.1 The Codebook
| Variable | Initial Name | Type | Description/Levels |
|---|---|---|---|
SEQNO |
SEQNO | Character | Sequential number of study subjects |
race_cat |
_IMPRACE | 4 level Cat. | Race / ethnicity of respondent |
overweight |
_RFBMI5 | Binary | Adults women who have a body mass index greater than 25.00 (Overweight): yes or no |
bmi |
_BMI5 | Quantitative | Outcome: Computed body mass index |
bmi_category |
_BMI5CAT | 3 level Cat. | Computed body mass index categories: Underweight, Normal Weight, Overweight, Obese |
weight |
WEIGHT2 | Quantitative | Reported Weight in Pounds |
smoke_years |
_YRSSMOK | Quantitative | Key Predictor: Number of years respondent smoked cigarettes |
smoke_currently |
_RFSMOK3 | Binary | Adults who are current smokers: yes or no |
smoke_frequency |
SMOKDAY2 | 3 level Cat. | Reported smoking frequency: every day, some days, not at all |
drink_daily |
AVEDRNK3 | Quantitative | Average alcoholic drinks per day in past 30 days |
drink_heavy |
_RFDRHV8 | Binary | Heavy drinkers (adult women having more than 7 drinks per week): yes or no |
This table includes all variables used for both study 1 and study 2 analysis.
For this analysis (study 2) we will only use five predictors: smoke_years, weight, race_cat, and drink_daily.
Now we will create a subset from the complete cases set Texas_cc with the 5 variables of interest for study 2 analysis.
TX_analytic <- TX_CC |>
select(SEQNO, race_cat, bmi, smoke_years, weight, drink_daily) 8.2 Analytic Tibble
Our analytic tibble has 6 variables and 550 observations with complete cases.
TX_analytic# A tibble: 565 × 6
SEQNO race_cat bmi smoke_years weight drink_daily
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 2022000985 White 26.6 11 170 2
2 2022001128 White 25.2 16 147 1
3 2022001266 White 25.1 33 185 2
4 2022001354 Hispanic 30.2 31 160 1
5 2022001400 Black 41.6 16 220 6
6 2022001422 White 25.8 25 160 1
7 2022001512 Hispanic 43.6 30 270 2
8 2022001541 White 31.6 50 196 1
9 2022001655 White 27.1 18 148 1
10 2022001845 White 26.6 47 165 2
# ℹ 555 more rows
We also need to confirm the presence of our tibble through this code because we are using df_print: paged in our YAML.
is_tibble(TX_analytic)[1] TRUE
8.3 Numerical Data Description
describe(TX_analytic |> select(-SEQNO)) |> html()Warning in png(file, width = 1 + k * w, height = h): 'width=13, height=13' are
unlikely values in pixels
5 Variables 565 Observations
race_cat
| n | missing | distinct |
|---|---|---|
| 565 | 0 | 4 |
Value White Hispanic Black Other Frequency 387 115 43 20 Proportion 0.685 0.204 0.076 0.035
bmi: COMPUTED BODY MASS INDEX
| n | missing | distinct | Info | Mean | pMedian | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 565 | 0 | 345 | 1 | 29.45 | 28.69 | 8.285 | 19.87 | 21.27 | 24.03 | 28.06 | 33.66 | 40.35 | 43.74 |
smoke_years: NUMBER OF YEARS SMOKED CIGARETTES
| n | missing | distinct | Info | Mean | pMedian | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 565 | 0 | 50 | 0.999 | 19.68 | 19 | 14.27 | 3.0 | 5.0 | 9.0 | 18.0 | 29.0 | 38.6 | 44.0 |
weight: REPORTED WEIGHT IN POUNDS
| n | missing | distinct | Info | Mean | pMedian | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 565 | 0 | 123 | 0.999 | 175.8 | 171 | 51.85 | 115.0 | 125.0 | 140.0 | 165.0 | 200.0 | 240.6 | 262.4 |
drink_daily: AVG ALCOHOLIC DRINKS PER DAY IN PAST 30
| n | missing | distinct | Info | Mean | pMedian | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 565 | 0 | 14 | 0.925 | 4.384 | 2 | 5.355 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 | 6.8 |
Value 1 2 3 4 5 6 7 8 10 12 15 77
Frequency 171 197 89 41 25 13 1 6 5 3 1 5
Proportion 0.303 0.349 0.158 0.073 0.044 0.023 0.002 0.011 0.009 0.005 0.002 0.009
Value 88 99
Frequency 5 3
Proportion 0.009 0.005
9 My Research Question
The impact of smoking cigarettes on health outcomes has been extensively researched throughout recent decades. However, the majority of these studies have been focused on men as the main consumers of cigarettes. Predatory tobacco marketing also targeted women by advertising cigarettes as an effective method for weight reduction. The latest national statistics also revealed a high prevalence of smoking among women from southern states. In this analysis, we aim to explore the impact of the number of years women smoked cigarettes on their body mass index. The body mass index is an important health outcome that highly relates to chronic diseases’ incidence. We will also observe the impact of other biological factors including weight and race / ethnicity, and average alcohol consumption per day as a behavioral factor.
To what extent can the number of years women smoked cigarettes (smoke_years) predict their bmi level, and can weight, race / ethnicity (race_cat), and daily alcohol consumption (drink_daily) improve the quality of our prediction?
10 Partitioning the Data
In the following steps, we will divide our analytic data set (TX_analytic) into two subsets called the training data (TX_training) and the test data (TX_test). The subjects will be assigned in either group randomly with 70% in the training set and 30% in the test set. We will set a seed (777) to allow future replication of the results.
set.seed(777)
TX_training <- TX_analytic |> slice_sample(prop = .70)
TX_test <- anti_join(TX_analytic, TX_training, by = "SEQNO")
dim(TX_analytic) [1] 565 6
dim(TX_training) [1] 395 6
dim(TX_test) [1] 170 6
The training set has 395 subjects and the test set has 170. The sum of the two sets equals 565 subjects which matches the original number of our analytic data set.
11 Transforming the Outcome
11.1 Visualizing the Outcome Distribution
We will visualize the outcome distribution using three plots: the histogram, normal Q-Q plot, and box plot with violin.
p1 <- ggplot(TX_training, aes(x = (bmi))) +
geom_histogram(bins = 30,
fill = "slateblue", col = "white")
p2 <- ggplot(TX_training, aes(sample = (bmi))) +
geom_qq(col = "slateblue") + geom_qq_line(col = "violetred") +
labs(y = "bmi", x = "Normal (0,1) quantiles") +
theme(aspect.ratio = 1)
p3 <- ggplot(TX_training, aes(x = "", y = (bmi))) +
geom_violin(fill = "slateblue", alpha = 0.1) +
geom_boxplot(fill = "slateblue", width = 0.3, notch = TRUE,
outlier.color = "slateblue", outlier.size = 3) +
labs(x = "") + coord_flip()
p1 + p2 - p3 +
plot_layout(ncol = 1, height = c(3, 2)) +
plot_annotation(title = "Body Mass Index (bmi)",
subtitle = str_glue("Model Development Sample: ", nrow(TX_training),
" Adult Women"))The outcome is clearly right skewed due to multiple extreme values (outliers). We will attempt to transform the outcome to improve the distribution.
11.2 Box-Cox function to assess the need for transformation of our outcome
We will use the Box-Cox function to see the suggested transformation to improve the skew of our outcome.
model_temp <- lm(bmi ~ smoke_years + race_cat + weight + drink_daily,
data = TX_training)
boxCox(model_temp)This plot provides a visual estimate of the suggested transformation of our outcome.
The code below can provide a numeric estimate:
powerTransform(model_temp)Estimated transformation parameter
Y1
0.6054038
The suggested transformation power is about 0.6 which suggests that the square root transformation could be the best approach in this case.
This is the scatter plot of the outcome bmi without transformation:
p1 <- ggplot(TX_training, aes(x = smoke_years , y = bmi)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) +
labs(title = "bmi vs. years smoking")
p1Now will compare the scatter plots after transforming the outcome bmi alone and transforming both key predictor smoke_years and outcome bmi.
p2 <- ggplot(TX_training, aes(x = smoke_years, y = sqrt(bmi))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) +
labs(title = "sqrt(bmi) vs. (years smoking)")
p3 <- ggplot(TX_training, aes(x = sqrt(smoke_years), y = sqrt(bmi))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) +
labs(title = "sqrt(bmi) vs. sqrt(years smoking)")
p2 + p3Neither did applying the square root of the outcome alone nor the predictor and outcome together produce remarkable improvements, so we will continue using the data without transformation.
11.3 Numeric Summary of the outcome
We will observe the numeric summary of the outcome bmi in our training set:
favstats(~ bmi, data = TX_training) min Q1 median Q3 max mean sd n missing
16.04 23.805 28.13 33.44 65.23 29.39777 7.743024 395 0
The bmi level of the training sample ranges from 16.0 to 65.0. The average is 29.4 with a standard deviation of 7.7.
11.4 Numeric Summary of the Predictors
We will also observe the numeric summary of our predictors in the training sample:
Hmisc::describe(TX_training |>
select(race_cat, weight, smoke_years, drink_daily))select(TX_training, race_cat, weight, smoke_years, drink_daily)
4 Variables 395 Observations
--------------------------------------------------------------------------------
race_cat
n missing distinct
395 0 4
Value White Hispanic Black Other
Frequency 263 86 30 16
Proportion 0.666 0.218 0.076 0.041
--------------------------------------------------------------------------------
weight : REPORTED WEIGHT IN POUNDS
n missing distinct Info Mean pMedian Gmd .05
395 0 109 0.999 174.7 170 52.08 115.0
.10 .25 .50 .75 .90 .95
120.4 140.0 165.0 200.0 240.6 260.0
lowest : 91 94 100 105 106, highest: 313 320 330 357 380
--------------------------------------------------------------------------------
smoke_years : NUMBER OF YEARS SMOKED CIGARETTES
n missing distinct Info Mean pMedian Gmd .05
395 0 50 0.999 20.07 19.5 14.55 3
.10 .25 .50 .75 .90 .95
5 9 18 29 39 44
lowest : 1 2 3 4 5, highest: 46 47 48 50 51
--------------------------------------------------------------------------------
drink_daily : AVG ALCOHOLIC DRINKS PER DAY IN PAST 30
n missing distinct Info Mean pMedian Gmd .05
395 0 12 0.929 4.557 2 5.627 1
.10 .25 .50 .75 .90 .95
1 1 2 3 5 8
Value 1 2 3 4 5 6 8 10 12 77 88
Frequency 114 137 68 29 19 6 5 5 2 5 3
Proportion 0.289 0.347 0.172 0.073 0.048 0.015 0.013 0.013 0.005 0.013 0.008
Value 99
Frequency 2
Proportion 0.005
--------------------------------------------------------------------------------
- In terms of
race_cat, the sample is predominantly White, followed by Hispanic, Black, and Other (non-Hispanic) - The
weightof the subjects extends from 91 pounds to 380 pounds with an average of 174.5 pounds - The number of years the subjects smoked cigarettes (
smoke_years) ranges from 1 year to 51 years with an average of 20.1 years - The average alcoholic drinks per day (
drinks_daily) ranges from 1 drink per day to 99 drinks per day with an average of 4.9 drinks per day The last variable includes unrealistically high values which may have occurred due to subjects misunderstanding the questionnaire or an error when recording the answers.
11.5 Scatterplot Matrix
temp <- TX_training |>
select(race_cat, weight, smoke_years, drink_daily, bmi)
ggpairs(temp, title = "Scatterplot Matrix",
lower = list(combo = wrap("facethist", bins = 28)))As expected, the variable with the highest correlation (0.944) with the outcome is weight. The second highest is the key predictor smoke_years with a correlation of (0.044) followed by drink_daily (0.043). The two predictors with the highest correlation with each other are weight and smoke_years (0.063). The key predictor smoke_years has a slightly higher correlation with weight (0.063) than bmi (0.044). The predictor drink_daily has the least correlation with the outcome as well as other predictors.
Now we will observe our categorical predictor race_cat and the outcome of interest bmi:
favstats(bmi ~ race_cat, data = TX_training) race_cat min Q1 median Q3 max mean sd n missing
1 White 17.71 23.785 27.250 32.3650 65.23 28.70532 7.176522 263 0
2 Hispanic 16.04 25.845 29.695 36.4175 57.25 30.99837 7.860048 86 0
3 Black 18.29 23.400 28.430 40.8950 57.62 32.12800 11.088704 30 0
4 Other 16.50 23.175 28.295 30.3250 42.51 27.05750 6.600832 16 0
The Hispanic group has the highest median bmi, followed by the Black, Other, and White respectively. In this scenario, we preferred reporting the median instead of the mean due to the observed right skew in the outcome distribution. The Black group has the highest variation, which may be due to the low count of the two groups. The Hispanic and White group have similar variation (7.9 and 7.2 respectively). The Other group has the lowest count and lowest variation.
11.6 Collinearity Checking
As mentioned previously under the scatter plot matrix, the highest correlation (0.063) among predictors was found between weight and smoke_years. The correlation is not strong, therefore collinearity should not be an issue.
12 The Big Model
Now we will build the big model (AKA the kitchen sink model) which incorporates all the predictors. Our goal is to assess the effectiveness of the model in the training sample.
12.1 Fitting the kitchen sink model
The kitchen sink model predicts bmi using the key predictor smoke_years and four other predictors: weight, race_cat, and drink_daily.
model_big <- lm(bmi ~ smoke_years + race_cat + weight + drink_daily,
data = TX_training)summary(model_big)
Call:
lm(formula = bmi ~ smoke_years + race_cat + weight + drink_daily,
data = TX_training)
Residuals:
Min 1Q Median 3Q Max
-8.5123 -1.5223 0.0777 1.6020 7.9548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.524513 0.500139 5.048 6.89e-07 ***
smoke_years -0.006080 0.009664 -0.629 0.5296
race_catHispanic 1.802500 0.302960 5.950 6.01e-09 ***
race_catBlack -0.675830 0.480843 -1.406 0.1607
race_catOther -0.287459 0.628599 -0.457 0.6477
weight 0.152161 0.002581 58.949 < 2e-16 ***
drink_daily 0.018491 0.009471 1.952 0.0516 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.435 on 388 degrees of freedom
Multiple R-squared: 0.9026, Adjusted R-squared: 0.9011
F-statistic: 599.2 on 6 and 388 DF, p-value: < 2.2e-16
This is a detailed summary of the kitchen sink (big) model.
12.2 Effect Sizes: Coefficient Estimates
We will tidy the coefficient estimates and calculate a 90% confidence interval to observe meaningful size and magnitude differences.
tidy(model_big, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kable(dig = 4)| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 2.5245 | 0.5001 | 1.6999 | 3.3491 | 0.0000 |
| smoke_years | -0.0061 | 0.0097 | -0.0220 | 0.0099 | 0.5296 |
| race_catHispanic | 1.8025 | 0.3030 | 1.3030 | 2.3020 | 0.0000 |
| race_catBlack | -0.6758 | 0.4808 | -1.4686 | 0.1170 | 0.1607 |
| race_catOther | -0.2875 | 0.6286 | -1.3239 | 0.7490 | 0.6477 |
| weight | 0.1522 | 0.0026 | 0.1479 | 0.1564 | 0.0000 |
| drink_daily | 0.0185 | 0.0095 | 0.0029 | 0.0341 | 0.0516 |
The coefficients with highest absolute values are Hispanic race category (1.80), and Black race category (-0.68). However, only Hispanic race, weight, and smoke years have detectable significance in the big model.
12.3 Describing the Equation
The big model implies that the key predictor smoke_years:
For every year of smoking, the body mass index (bmi) is expected to decreased by 0.0061 points. (90% Confidence Interval: -0.0220 0.0099). Suppose we have two adult women, from Texas, from the same race, have the same weight, and alcohol daily intake. If one of them smoked cigarettes for a year longer than the other, her bmi level is predicted to be 0.0061 points lower than the lady that smoked a year less than she did. In this case, the key predictor does not yield a remarkable impact on the bmi level.
Race on the other hand has a higher impact. The Hispanic race, in particular, are predicted to have (1.80) (90% CI: 1.30, 2.30) increase in their bmi level compared to other women with the same features from White, Black and Other races.
weight increases bmi level by (0.15) with a 90% confidence interval (0.148, 0.156).
Average alcohol intake per day increases the bmi level by (0.019), with a 90% confidence interval (0.0029, 0.0341). Although its impact is lower than the other predictors, it remains higher than the key predictor.
13 The Smaller Model
Now we will build the small model which includes a fewer subset of the predictors listed in the big model. In order to decide which predictors should be removed, we will use the backwards step wise elimination approach.
13.1 Backwards Stepwise Elemination
Unfortunately this method was not suitable for our model because it removed the key predictor smoke_years. We will build our candidate models using a different method.
13.2 Another method to develop our small model:
Here, we will build three models, the first model is the original big model (modelA), the second is the first candidate to be the small model (modelB), and the third is the second candidate to be the small model (modelC).
modelA <- lm(bmi ~ smoke_years + race_cat + weight + drink_daily,
data = TX_training)
modelB <- lm(bmi ~ smoke_years + race_cat + weight,
data = TX_training)
modelC <- lm(bmi ~ smoke_years + race_cat,
data = TX_training)
modelD <- lm(bmi ~ smoke_years,
data = TX_training)anova(modelC, modelB, modelA, modelD)Analysis of Variance Table
Model 1: bmi ~ smoke_years + race_cat
Model 2: bmi ~ smoke_years + race_cat + weight
Model 3: bmi ~ smoke_years + race_cat + weight + drink_daily
Model 4: bmi ~ smoke_years
Res.Df RSS Df Sum of Sq F Pr(>F)
1 390 22927.8
2 389 2323.7 1 20604.1 3474.2192 < 2e-16 ***
3 388 2301.1 1 22.6 3.8117 0.05161 .
4 393 23576.5 -5 -21275.5 717.4843 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
repA <- glance(modelA) |> mutate(name = "modelA")
repB <- glance(modelB) |> mutate(name = "modelB")
repC <- glance(modelC) |> mutate(name = "modelC")
repD <- glance(modelD) |> mutate(name = "modelD")
fit_report <- bind_rows(repA, repB, repC, repD)
fit_report |>
select(name, r2 = r.squared, adjr2 = adj.r.squared, sigma,
AIC, BIC, nobs, df, df.res = df.residual) |>
kbl(digits = c(0,4,4,3,0,0,0,0,0)) |> kable_minimal()| name | r2 | adjr2 | sigma | AIC | BIC | nobs | df | df.res |
|---|---|---|---|---|---|---|---|---|
| modelA | 0.9026 | 0.9011 | 2.435 | 1833 | 1865 | 395 | 6 | 388 |
| modelB | 0.9016 | 0.9004 | 2.444 | 1835 | 1863 | 395 | 5 | 389 |
| modelC | 0.0294 | 0.0194 | 7.667 | 2737 | 2761 | 395 | 4 | 390 |
| modelD | 0.0019 | -0.0006 | 7.745 | 2742 | 2754 | 395 | 1 | 393 |
Models A and B are comparable in terms of the r squared, adjusted r squared, sigma, and AIC values. Model C and D’s performance is remarkably lower so we will exclude them. Model B will be selected as the small model.
13.3 Fitting the “small” model
Now we will fit the selected small model in the training sample.
model_small <- lm(bmi ~ smoke_years + race_cat + weight,
data = TX_training)
summary(model_small)
Call:
lm(formula = bmi ~ smoke_years + race_cat + weight, data = TX_training)
Residuals:
Min 1Q Median 3Q Max
-8.5548 -1.4312 0.0762 1.6328 7.9096
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.589277 0.500838 5.170 3.75e-07 ***
smoke_years -0.005799 0.009698 -0.598 0.550
race_catHispanic 1.819100 0.303933 5.985 4.91e-09 ***
race_catBlack -0.536627 0.477243 -1.124 0.262
race_catOther -0.216105 0.629799 -0.343 0.732
weight 0.152142 0.002591 58.731 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.444 on 389 degrees of freedom
Multiple R-squared: 0.9016, Adjusted R-squared: 0.9004
F-statistic: 713.1 on 5 and 389 DF, p-value: < 2.2e-16
13.4 Effect Sizes: Coefficient Estimates
We will produce 90% confidence intervals for our coefficient estimates to evaluate the size and magnitude of the predictors’ impact on the outcome.
tidy(model_small, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kable(dig = 4)| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 2.5893 | 0.5008 | 1.7635 | 3.4150 | 0.0000 |
| smoke_years | -0.0058 | 0.0097 | -0.0218 | 0.0102 | 0.5502 |
| race_catHispanic | 1.8191 | 0.3039 | 1.3180 | 2.3202 | 0.0000 |
| race_catBlack | -0.5366 | 0.4772 | -1.3235 | 0.2502 | 0.2615 |
| race_catOther | -0.2161 | 0.6298 | -1.2545 | 0.8223 | 0.7317 |
| weight | 0.1521 | 0.0026 | 0.1479 | 0.1564 | 0.0000 |
13.5 Interpreting the Small Model Regression Equation
The key predictor smoke_years:
Similar to the big model, the key predictor in the small model predicts that for every year of smoking, the body mass index (bmi) is expected to decrease by 0.0058 points (90% Confidence Interval: -0.0218, 0.0102). Suppose we have two adult women, from Texas, from the same race, have the same weight, and alcohol daily intake. If one of them smoked cigarettes for a year longer than the other, her bmi level is predicted to be 0.0058 points lower than the lady that smoked a year less than she did. In this case, the key predictor does not yield a remarkable impact on the bmi level.
Other Predictors:
In the small model, Race also yielded the highest absolute coefficient value, particularly the Hispanic race whom are predicted to have (1.82) with a 90% confidence interval (1.318, 2.320) increase in their bmi level compared to other women with the same features from White, Black and Other non-Hispanic races.
weight increases bmi level by (0.15) with a 90% confidence interval (0.148, 0.156).
14 In-Sample Comparison
In this step, we will compare the big model’s and the small model’s performance in the training sample by assessing the R squared values and adjusted R squared values.
14.1 Quality of Fit
To accomplish this task, we will build two tibbles, one for each model. Each tibble will summarize the key results. Afterwards, we will combine the results in a single table training_comp and assess the models’ performance.
temp_a <- glance(model_big) |>
select(-logLik, -deviance) |>
round(digits = 3) |>
mutate(modelname = "big")
temp_b <- glance(model_small) |>
select(-logLik, -deviance) |>
round(digits = 3) |>
mutate(modelname = "small")
training_comp <- bind_rows(temp_a, temp_b) |>
select(modelname, nobs, df, AIC, BIC, everything())training_comp# A tibble: 2 × 11
modelname nobs df AIC BIC r.squared adj.r.squared sigma statistic
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 big 395 6 1833. 1865. 0.903 0.901 2.44 599.
2 small 395 5 1835. 1863. 0.902 0.9 2.44 713.
# ℹ 2 more variables: p.value <dbl>, df.residual <dbl>
The big model (which includes all predictors) has lower R squared and adjusted R squared values. It also has higher sigma, AIC and BIC values. The small model (includes all predictors except dink_daily) has a R squared and adjusted R squared values. The sigma, AIC and BIC values are relatively higher. In the training sample, the small model outperforms the big model in all of the summaries.
14.2 Assessing Assumptions
Now we will check the assumption of linear regression: linearity, normality, homoscedasticity, and influential values.
14.2.1 Residual Plots for the Big Model
First we will augment our results:
aug_big <- augment(model_big, data = TX_training) |>
mutate(bmi = bmi) Next we will generate the residual plots:
p1 <- ggplot(aug_big, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_point(data = aug_big |>
slice_max(abs(.resid), n = 3),
col = "red", size = 2) +
geom_text_repel(data = aug_big |>
slice_max(abs(.resid), n = 3),
aes(label = SEQNO), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Residuals vs. Fitted",
x = "Fitted Value of bmi", y = "Residual")
p2 <- ggplot(aug_big, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
y = "Standardized Residual",
x = "Standard Normal Quantiles")
p3 <- ggplot(aug_big, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Scale-Location Plot",
x = "Fitted Value of bmi",
y = "|Std. Residual|^(1/2)")
p4 <- ggplot(aug_big, aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_point(data = aug_big |> filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug_big |> filter(.cooksd >= 0.5),
aes(label = SEQNO), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Residuals vs. Leverage",
x = "Leverage", y = "Standardized Residual")
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Assessing Residuals for the big model")There are no serious issues except for the spread of the residuals in the residuals vs. fitted values, which indicate a problem with homoscedasticity.
14.2.2 Residual Plots for the Small Model
First we will augment the small model:
aug_small <- augment(model_small, data = TX_training) |>
mutate(bmi = bmi) Next we will generate the residual plots:
p1 <- ggplot(aug_small, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_point(data = aug_small |>
slice_max(abs(.resid), n = 3),
col = "red", size = 2) +
geom_text_repel(data = aug_small |>
slice_max(abs(.resid), n = 3),
aes(label = SEQNO), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Residuals vs. Fitted",
x = "Fitted Value of bmi", y = "Residual")
p2 <- ggplot(aug_small, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
y = "Standardized Residual",
x = "Standard Normal Quantiles")
p3 <- ggplot(aug_small, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Scale-Location Plot",
x = "Fitted Value of bmi",
y = "|Std. Residual|^(1/2)")
p4 <- ggplot(aug_small, aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_point(data = aug_small |> filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug_small |> filter(.cooksd >= 0.5),
aes(label = SEQNO), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Residuals vs. Leverage",
x = "Leverage", y = "Standardized Residual")
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Assessing Residuals for the big model")Similar to the big model, the small model also does not face any issues except for the assumption of homoscedasticity.
14.2.3 Does collinearity have a meaningful impact?
car::vif(model_big) GVIF Df GVIF^(1/(2*Df))
smoke_years 1.012975 1 1.006467
race_cat 1.057683 3 1.009391
weight 1.026062 1 1.012947
drink_daily 1.025737 1 1.012787
As we can see, the highest generalized variance inflation factor value in the big model is 1.058. This does not indicate a serious problem with collinearity. These conclusions can be applied to the small model as well.
14.3 Comparing the Models
Neither model has serious issues with regression assumptions, except for the issue of homoscedasticity. The small model, however, performed better than the big model in the training sample. It will be the main candidate for the time being.
15 Model Validation
We will test the two regression models’ performances using the test sample. Based on that, we can determine which model should be selected as the final model.
15.1 Calculating Prediction Errors
15.1.1 Big Model: Calculating Fits/Residuals
In this step, we will calculate the fitted and residual values using the augment function. We do not need to back-transform any variables because we used them without applying transformations.
aug_big <- augment(model_big, newdata = TX_test) |>
mutate(mod_name = "big",
bmi_fit = .fitted,
bmi_res = bmi - bmi_fit) |>
select(SEQNO, mod_name, bmi, bmi_fit, bmi_res, everything())
head(aug_big,3)# A tibble: 3 × 11
SEQNO mod_name bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 202200… big 26.6 28.4 -1.73 White 11 170 2
2 202200… big 25.1 30.5 -5.42 White 33 185 2
3 202200… big 26.6 27.4 -0.752 White 47 165 2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>
15.1.2 Small Model: Calculating Fits/Residuals
We will apply the same steps for the small model.
aug_small <- augment(model_small, newdata = TX_test) |>
mutate(mod_name = "small",
bmi_fit = .fitted,
bmi_res = bmi - bmi_fit) |>
select(SEQNO, mod_name, bmi, bmi_fit, bmi_res, everything())
head(aug_small,3)# A tibble: 3 × 11
SEQNO mod_name bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 202200… small 26.6 28.4 -1.76 White 11 170 2
2 202200… small 25.1 30.5 -5.45 White 33 185 2
3 202200… small 26.6 27.4 -0.790 White 47 165 2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>
15.1.3 Combining the Results
Now we sill combine the results of the two models.
test_comp <- union(aug_big, aug_small) |>
arrange(SEQNO, mod_name)
test_comp |> head()# A tibble: 6 × 11
SEQNO mod_name bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 202200… big 25.1 28.4 -3.26 White 12 170 2
2 202200… small 25.1 28.4 -3.28 White 12 170 2
3 202200… big 21.5 25.1 -3.62 Other 2 150 5
4 202200… small 21.5 25.2 -3.66 Other 2 150 5
5 202200… big 23.5 25.3 -1.83 White 7 150 1
6 202200… small 23.5 25.4 -1.88 White 7 150 1
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>
The resulting tibble test_comp will be used in the following steps to summarize, visualize, and identify the largest errors in both models.
15.2 Visualizing the Predictions
ggplot(test_comp, aes(x = bmi_fit, y = bmi)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = "dashed") +
geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
facet_wrap( ~ mod_name, labeller = "label_both") +
labs(x = "Predicted bmi",
y = "Observed bmi",
title = "Observed vs. Predicted bmi",
subtitle = "Comparing Big to Small Model in Test Sample",
caption = "Dashed line is where Observed = Predicted")The big and small models have the same patter of prediction errors.
15.3 Summarizing the Errors
This summary includes the mean absolute prediction error (MAPE), the root mean squared prediction error (RMSPE), and the maximum absolute error calculated from predictions made by each model.
test_comp |>
group_by(mod_name) |>
summarise(n = n(),
MAPE = mean(abs(bmi_res)),
RMSPE = sqrt(mean(bmi_res^2)),
max_error = max(abs(bmi_res)))# A tibble: 2 × 5
mod_name n MAPE RMSPE max_error
<chr> <int> <dbl> <dbl> <dbl>
1 big 170 2.16 2.78 11.9
2 small 170 2.16 2.77 11.8
The small model has a slightly better MAPE, RMSPE and maximum error values than the big model. The MAPE value suggests that using these models to predict women’s bmi could overestimate it by approximately 2.2 points. This level of error in bmi measurement is not acceptable because this level of difference can easily change their BMI category.
15.3.1 Identify the largest errors
We will identify the subject with the worst fit in both models by calculating the maximum residual error in each model.
temp1 <- aug_big |>
filter(abs(bmi_res) == max(abs(bmi_res)))
temp2 <- aug_small |>
filter(abs(bmi_res) == max(abs(bmi_res)))
bind_rows(temp1, temp2)# A tibble: 2 × 11
SEQNO mod_name bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 202200… big 50.2 38.2 11.9 Black 31 240 2
2 202200… small 50.2 38.4 11.8 Black 31 240 2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>
From the presented table, we can see that the same subject (2022009499) had the worst fit in both models.
15.3.2 Validated R-square values
Here is the squared correlation between our predicted bmi and our actual bmi in the test sample, using the big model.
cor(aug_big$bmi, aug_big$bmi_fit)^2[1] 0.8611934
This is the R-square we obtained within the test sample for the small model.
cor(aug_small$bmi, aug_small$bmi_fit)^2[1] 0.8627504
The squared correlation of the two models are about the same, with the small model having a slightly higher value. Though, both models have remarkably high validated R squared values.
15.4 Removing the largest error
This is the summary before removing the largest error:
test_comp |>
group_by(mod_name) |>
summarize(n = n(), MAPE = mean(abs(bmi_res)),
RMSPE = sqrt(mean(bmi_res^2)),
max_error = max(abs(bmi_res)),
median_APE = median(abs(bmi_res)),
valid_R2 = cor(bmi, bmi_fit)^2) |>
kbl(digits = c(0, 0, 4, 3, 2, 3, 3)) |> kable_minimal(font_size = 22)| mod_name | n | MAPE | RMSPE | max_error | median_APE | valid_R2 |
|---|---|---|---|---|---|---|
| big | 170 | 2.1613 | 2.780 | 11.94 | 1.750 | 0.861 |
| small | 170 | 2.1558 | 2.765 | 11.77 | 1.765 | 0.863 |
test_comp |> filter(SEQNO != "2022009499") |>
group_by(mod_name) |>
summarize(n = n(), MAPE = mean(abs(bmi_res)),
RMSPE = sqrt(mean(bmi_res^2)),
max_error = max(abs(bmi_res)),
median_APE = median(abs(bmi_res)),
valid_R2 = cor(bmi, bmi_fit)^2) |>
kbl(digits = c(0, 0, 4, 3, 2, 3, 3)) |> kable_minimal(font_size = 22)| mod_name | n | MAPE | RMSPE | max_error | median_APE | valid_R2 |
|---|---|---|---|---|---|---|
| big | 169 | 2.1034 | 2.633 | 7.99 | 1.739 | 0.873 |
| small | 169 | 2.0989 | 2.621 | 8.02 | 1.764 | 0.874 |
Removing the subject with the highest error increased the R squared and adjusted R squared value for both models, but the difference between them remains the same. The maximum error was significantly reduced after removing the subject. The big model has lower sigma and median APE values. The small model remains slightly better in terms of the MAPE, RMSPE, and valid R squared values.
15.5 Comparing the Models
The small model dominated the big model in the training sample. In the test sample, The small model was also slightly better in the majority of the summaries. Therefore, the small model will be selected as our final model. Though, the performance of the two models are fairly similar
16 Discussion
16.1 Chosen Model
In the training sample, the small model yielded better values in terms of R squared, adjusted R squared, AIC, BIC and sigma. In the test sample, the small model exhibited better performance in terms of MAPE, RMSPE, maximum error and Valid R squared, while the big model had better median APE value. After removing the subject with largest error, the small model’s performance remained slightly better. Based on these results, we will choose the small model. It has a lower number of predictors and better performance in both the training sample and testing sample.
16.2 Answering My Question
tidy(model_small, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kable(dig = 4)| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 2.5893 | 0.5008 | 1.7635 | 3.4150 | 0.0000 |
| smoke_years | -0.0058 | 0.0097 | -0.0218 | 0.0102 | 0.5502 |
| race_catHispanic | 1.8191 | 0.3039 | 1.3180 | 2.3202 | 0.0000 |
| race_catBlack | -0.5366 | 0.4772 | -1.3235 | 0.2502 | 0.2615 |
| race_catOther | -0.2161 | 0.6298 | -1.2545 | 0.8223 | 0.7317 |
| weight | 0.1521 | 0.0026 | 0.1479 | 0.1564 | 0.0000 |
The linear regression model that predicted bmi for women in Texas:
bmi = 2.59 + smoke_years (-0.0058) + race_cat(Hispanic(1.82)/ Black(-0.05)/ Other(-0.22)) + weight (0.15)
In our study sample, the key predictor smoke_year, and other predictors, race category (Black and Other) were associated with decreased bmi levels. On the other hand, the race categories (White and Hispanic) and weight were associated with increased bmi levels. The only predictors that exhibited detectable impact on the outcome were weight and race category (White and Hispanic).
Quality of Fit:
The validated R squared is 0.863, which indicates that our model was able to predict 86.3% of variation in our outcome. The previous adjusted r squared value generated from the training sample was 0.90, which indicates that the model’s predictability was slightly overestimated. The remarkable high predictability of this model is likely due to the incorporation of the most important factor in calculating bmi levels (weight). The key predictor had a weak correlation with the outcome (-0.0058), which implies that it is not the best predictor. The Hispanic race category was highly correlated (1.59) with increased bmi.
Regression Assumptions:
There were no serious violations of linear regression assumptions, except for homoscedasticity, which can be observed in the residual vs. fitted plot.
16.3 Next Steps
The only predictors with missing observations was the key predictor smoke_years and another predictor that was added to assess interaction drink_daily. In this analysis, I assumed that those observations were missing completely at random. The next step would be to use imputation (single imputation followed by multiple imputation) to assess potential improvement in big model, which incorporated both variables, and whether imputation could change our decision of the best model.
Another step would be investigating the association between BMI and women from different racial backgrounds. In addition to race, the new model will incorporate age, socioeconomic level, and educational level. This way, we can observe if having a different race impacts BMI levels when all other relative factors (age, socioeconomic level, and educational level) are constant. We would also include women from various states to make the sample representative of the US women population.
16.4 Reflection
When we first loaded the raw data set, the number of observations was 445,132. After narrowing it down to my population of interest (adult women in Texas), the number of observations remained relatively high (1,164). The big sample size gave me a sense of reassurance that I have met the required number of observations for project B. However, after narrowing it down to complete cases of the variables of interest for study 2 analysis, I was left with only 565 observations. It remained above the minimum required number of observations, but I did not anticipate the significant reduction in the sample size. Had I known that I would lose a big number of observations in the process of data cleaning and management, I would have included women from multiple southern states, rather than Texas alone.
17 Session information
sessionInfo()R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2 purrr_1.1.0
[5] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0 tidyverse_2.0.0
[9] sessioninfo_1.2.3 patchwork_1.3.2 naniar_1.1.0 mosaic_1.9.2
[13] mosaicData_0.20.4 ggformula_1.0.0 dplyr_1.1.4 Matrix_1.7-3
[17] lattice_0.22-7 knitr_1.50 janitor_2.2.1 Hmisc_5.2-4
[21] GGally_2.4.0 car_3.1-3 carData_3.0-5 haven_2.5.5
[25] kableExtra_1.4.0 ggrepel_0.9.6 ggplot2_4.0.0 gtsummary_2.4.0
[29] gt_1.1.0 broom_1.0.10
loaded via a namespace (and not attached):
[1] ggiraph_0.9.2 tidyselect_1.2.1 viridisLite_0.4.2
[4] farver_2.1.2 S7_0.2.0 fastmap_1.2.0
[7] fontquiver_0.2.1 labelled_2.15.0 digest_0.6.37
[10] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
[13] cluster_2.1.8.1 magrittr_2.0.4 compiler_4.5.1
[16] rlang_1.1.6 tools_4.5.1 utf8_1.2.6
[19] yaml_2.3.10 data.table_1.17.8 labeling_0.4.3
[22] htmlwidgets_1.6.4 xml2_1.4.0 RColorBrewer_1.1-3
[25] abind_1.4-8 withr_3.0.2 foreign_0.8-90
[28] nnet_7.3-20 grid_4.5.1 mosaicCore_0.9.5
[31] gdtools_0.4.4 colorspace_2.1-2 MASS_7.3-65
[34] scales_1.4.0 ggridges_0.5.7 cli_3.6.5
[37] rmarkdown_2.30 generics_0.1.4 rstudioapi_0.17.1
[40] tzdb_0.5.0 splines_4.5.1 base64enc_0.1-3
[43] vctrs_0.6.5 jsonlite_2.0.0 fontBitstreamVera_0.1.1
[46] hms_1.1.4 visdat_0.6.0 Formula_1.2-5
[49] htmlTable_2.4.3 systemfonts_1.3.1 glue_1.8.0
[52] ggstats_0.11.0 stringi_1.8.7 gtable_0.3.6
[55] pillar_1.11.1 htmltools_0.5.8.1 R6_2.6.1
[58] textshaping_1.0.4 evaluate_1.0.5 backports_1.5.0
[61] snakecase_0.11.1 fontLiberation_0.1.0 Rcpp_1.1.0
[64] nlme_3.1-168 svglite_2.2.1 gridExtra_2.3
[67] checkmate_2.3.3 mgcv_1.9-3 xfun_0.52
[70] fs_1.6.6 pkgconfig_2.0.3