5 Cumulative smoking years and BMI among women in Texas

Using Behavioral Risk Factors Survey System (BRFSS)

Author

Sarah Albalawi & Walaa Alshaia

Published

2025-10-28

6 Setup and Data Ingest

This study aims to explore and measure the association between cumulative smoking years and body mass index among women in Texas. The study sample was extracted from the Behavioral Risk Factor Surveillance System (BRFSS) which is a national telephone based survey that collects state-level data. The target population (aged 18 years and older) for cellular telephone samples in 2022 consists of people residing in a private residence or college housing who have a working cellular telephone. Our data set will only include adult women below the age of 65 whom were Texas residents in 2022. The outcome of interest is body mass index (bmi) and the key predictor is the number of years the subjects smoked cigarettes (smoke_years). Other factors will be observed, including race category, weight and average drinking per day.

6.1 Initial Setup and Package Loads in R

library(broom)
library(gt)
library(gtsummary)
library(ggrepel)
library(kableExtra)
library(haven)
library(car)
library(GGally)
library(Hmisc)
library(janitor)
library(knitr)
library(mosaic)
library(naniar)
library(patchwork)
library(sessioninfo)
library(tidyverse) 


opts_chunk$set(comment=NA)

theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)

6.2 Loading the Raw Data into R

We will load the BRFSS data set.

BRFSS <- read_xpt("C:/Users/albal/OneDrive/Desktop/MPHP 431/BRFSS/LLCP2022.XPT")

7 Cleaning the Data

7.1 The Raw Data

Because the original data set is big, we will narrow down the raw data set to only include the variables of interest to our work.

BRFSS_study <- BRFSS|>
  select(SEQNO, `_STATE`,`_SEX`,`_AGE65YR`, `_RFBMI5`, `_BMI5`, `_BMI5CAT`, WEIGHT2,`_YRSSMOK`, `_RFSMOK3`, SMOKDAY2, AVEDRNK3, `_RFDRHV8`, `_IMPRACE`)

glimpse(BRFSS_study)

Rows: 445,132
Columns: 14
$ SEQNO      <chr> "2022000001", "2022000002", "2022000003", "2022000004", "20…
$ `_STATE`   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `_SEX`     <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2,…
$ `_AGE65YR` <dbl> 2, 2, 1, 3, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 1,…
$ `_RFBMI5`  <dbl> 9, 2, 2, 1, 1, 2, 1, 2, 9, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1,…
$ `_BMI5`    <dbl> NA, 2657, 2561, 2330, 2177, 2608, 2296, 2781, NA, 2905, 292…
$ `_BMI5CAT` <dbl> NA, 3, 3, 2, 2, 3, 2, 3, NA, 3, 3, 2, 3, 4, 3, 3, 2, 4, 2, …
$ WEIGHT2    <dbl> 9999, 150, 140, 140, 119, 187, 138, 162, 9999, 180, 165, 13…
$ `_YRSSMOK` <dbl> NA, NA, NA, 56, NA, NA, 42, NA, 1, NA, NA, NA, 8, NA, NA, N…
$ `_RFSMOK3` <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1,…
$ SMOKDAY2   <dbl> NA, NA, NA, 2, NA, NA, 3, NA, 3, NA, NA, NA, 3, NA, NA, NA,…
$ AVEDRNK3   <dbl> NA, NA, NA, NA, 2, NA, 2, NA, NA, 1, NA, NA, 2, 1, 1, NA, N…
$ `_RFDRHV8` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9,…
$ `_IMPRACE` <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,…

The BRFSS_study includes 16 variables and 445,132 subjects. For each subject we selected the following variables:

Demographic information including their state _STATE, sex _SEX, age _AGE65YR, and race / ethnicity _IMPRACE
Their reported weight WEIGHT2 and the computed BMI level _BMI5 and category _BMI5CAT
whether they are categorized as overweight _RFBMI5
The number of years they smoked cigarettes _YRSSMOK
Their smoking frequency category SMOKDAY2 and whether they are current smokers or not _RFSMOK3
Their average alcohol consumption per day AVEDRNK3 and whether they are categorized as heavy drinkers or not _RFDRHV8

7.2 Variable Filter

We will select our population of interest (Adult women, below the age of 65, residing in Texas) and filter other variables to include appropriate values for our analysis. We will also filter out missing values (such as un-answered and refused to answer categories):

TX <- BRFSS_study |>
  filter( `_STATE` == 48, `_SEX`== 2, `_AGE65YR`== 1, WEIGHT2 < 0776, `_RFBMI5`< 3, `_RFSMOK3`< 3, SMOKDAY2 < 4, `_BMI5CAT` < 5 )|>
  select( SEQNO, `_RFBMI5`, `_BMI5`, `_BMI5CAT`, WEIGHT2,`_YRSSMOK`, `_RFSMOK3`, SMOKDAY2, AVEDRNK3, `_RFDRHV8`, `_IMPRACE`)

This is our initial data set

TX

# A tibble: 1,164 × 11
   SEQNO     `_RFBMI5` `_BMI5` `_BMI5CAT` WEIGHT2 `_YRSSMOK` `_RFSMOK3` SMOKDAY2
   <chr>         <dbl>   <dbl>      <dbl>   <dbl>      <dbl>      <dbl>    <dbl>
 1 20220009…         2    2663          3     170         11          1        3
 2 20220009…         1    2001          2     124         28          1        3
 3 20220010…         2    3004          4     175         41          2        1
 4 20220010…         2    3986          4     225         11          1        3
 5 20220010…         1    2455          2     143         23          2        1
 6 20220010…         2    2887          3     163         NA          1        3
 7 20220010…         1    2036          2     130         24          1        3
 8 20220011…         2    2523          3     147         16          1        3
 9 20220011…         2    4241          4     210         33          1        3
10 20220011…         2    4207          4     230         35          1        3
# ℹ 1,154 more rows
# ℹ 3 more variables: AVEDRNK3 <dbl>, `_RFDRHV8` <dbl>, `_IMPRACE` <dbl>

As we can see we need to adjust the decimal place of our outcome _BMI5. We will also rename our variables.

TX <- TX |>
  mutate(bmi = `_BMI5`/100)|>
  rename(race = `_IMPRACE`,
         overweight = `_RFBMI5`,
         bmi_category = `_BMI5CAT`,
         weight = WEIGHT2,
         smoke_years = `_YRSSMOK`, 
         smoke_currently = `_RFSMOK3`, 
         smoke_frequency = SMOKDAY2,
         drink_daily = AVEDRNK3,
         drink_heavy = `_RFDRHV8`)

7.3 Which variables should be included in the tidy data set?

Our regression model will only include 5 predictors: smoke_years, race, weight, and drink_daily for our outcome of interest bmi.

7.4 Checking our outcome and key predictor

df_stats(~ bmi + smoke_years, data = TX)

     response   min     Q1 median    Q3   max     mean        sd    n missing
1         bmi 15.97 24.675  28.72 34.75 92.01 30.29957  8.247614 1164       0
2 smoke_years  1.00 11.000  21.00 32.00 56.00 22.18367 13.239163 1078      86

There are no missing observations in the outcome variables, but there are a few missing values in the main predictor variable. The bmi range extends from 16 to 65 with an average of 30, which is considered higher than the healthy level. Cumulative smoking years extends from 1 year to 56 years with an average of 19 years.

7.5 Checking the Quantitative Predictors

df_stats(~ weight + drink_daily, data = TX)

     response min  Q1 median     Q3 max       mean       sd    n missing
1      weight  80 145    170 204.25 400 179.356529 49.01472 1164       0
2 drink_daily   1   1      2   3.00  99   4.560732 13.20164  601     563

There are no missing values in the weight values. The height range extends from 30.5 to 60.6 inches with low variation. This implies that the majority of our study sample is shorter than the average american woman. The weight has a high range extending from 80 pounds to 400 pounds, which indicates that a large proportion of our study sample is on the overweight side.

7.6 Checking the Categorical Variables

One of the predictors race is a non-ordinal, categorical variable which requires some modifications before analysis.

7.6.1 Race

First we will mutate the variable race into a factor then rename the variable categories:

TX$race <- as.factor(as.numeric(TX$race))

Now we will re-level the categories.

TX$race <- 
    fct_relevel(TX$race,
                "1", "2", "3", "4", "5", "6")

Then, we will rename each category to clarify its meaning.

TX_b <- TX |>
  mutate(
    race = fct_recode(race,
      "White" = "1",
      "Black" = "2",
      "Asian" = "3",
      "American Indian" = "4",
      "Hispanic" = "5",
      "Other" = "6"))

tabyl(TX_b$race)

       TX_b$race   n     percent
           White 807 0.693298969
           Black  75 0.064432990
           Asian   8 0.006872852
 American Indian  15 0.012886598
        Hispanic 229 0.196735395
           Other  30 0.025773196

As we can see the Asian, American Indian, and Others categories have relatively low frequencies. So we will combine them under the Others category.

We will order our categories according their frequencies from the highest to lowest.

TX_b <- TX_b |>
  mutate(race = fct_infreq(race))

Then we will collapse the categories with the lowest frequencies under a category called Others and rename our variable from race to race_cat.

TX_b <- TX_b |>
  mutate(race_cat = fct_lump_n(TX_b$race, n = 4))

tabyl(TX_b$race_cat)

 TX_b$race_cat   n    percent
         White 807 0.69329897
      Hispanic 229 0.19673540
         Black  75 0.06443299
         Other  53 0.04553265

The majority of the subjects are White, followed by Hispanic. The Black and Other races are remarkably lower. It is true that the unbalances distribution may impact external validity of our findings. However, it may also reflect the distribution of the target population being studied (women in Texas). It is important to note that the Other category includes Asians, Native Americans, and other races, all of which are non-Hispanic.

7.6.2 What about the subjects?

Each subject should have a sequential number SEQNO.

nrow(TX_b)

[1] 1164

n_distinct(TX_b |> select(SEQNO))

[1] 1164

The number of rows matches the number of SEQNO in our data set. We can rest assured that each subject has a sequential number.

7.7 Dealing with Missingness

The outcome does not have any missing observations. Neither do the predictors except for the key predictor smoke_years (86 NA) and drink_daily (563 NA).

miss_var_summary(TX_b)

# A tibble: 13 × 3
   variable        n_miss pct_miss
   <chr>            <int>    <num>
 1 drink_daily        563    48.4 
 2 smoke_years         86     7.39
 3 SEQNO                0     0   
 4 overweight           0     0   
 5 _BMI5                0     0   
 6 bmi_category         0     0   
 7 weight               0     0   
 8 smoke_currently      0     0   
 9 smoke_frequency      0     0   
10 drink_heavy          0     0   
11 race                 0     0   
12 bmi                  0     0   
13 race_cat             0     0

None of the subjects are missing more than 2 variables as demonstrated below.

miss_case_summary(TX_b)

# A tibble: 1,164 × 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     6      2     15.4
 2    59      2     15.4
 3    70      2     15.4
 4    80      2     15.4
 5    84      2     15.4
 6    86      2     15.4
 7    91      2     15.4
 8    99      2     15.4
 9   130      2     15.4
10   184      2     15.4
# ℹ 1,154 more rows

For this analysis we will assume that the observations are missing completely at random. In this case, using complete cases method to deal with the missing data is appropriate.

This is the complete data set for the analysis.

TX_CC <- TX_b |>
  select(SEQNO, race_cat, overweight, bmi, bmi_category, weight, smoke_years, smoke_currently, smoke_frequency, drink_daily, drink_heavy) |>
  drop_na()

TX_CC

# A tibble: 565 × 11
   SEQNO      race_cat overweight   bmi bmi_category weight smoke_years
   <chr>      <fct>         <dbl> <dbl>        <dbl>  <dbl>       <dbl>
 1 2022000985 White             2  26.6            3    170          11
 2 2022001128 White             2  25.2            3    147          16
 3 2022001266 White             2  25.1            3    185          33
 4 2022001354 Hispanic          2  30.2            4    160          31
 5 2022001400 Black             2  41.6            4    220          16
 6 2022001422 White             2  25.8            3    160          25
 7 2022001512 Hispanic          2  43.6            4    270          30
 8 2022001541 White             2  31.6            4    196          50
 9 2022001655 White             2  27.1            3    148          18
10 2022001845 White             2  26.6            3    165          47
# ℹ 555 more rows
# ℹ 4 more variables: smoke_currently <dbl>, smoke_frequency <dbl>,
#   drink_daily <dbl>, drink_heavy <dbl>

We are left with a total of 565 subjects with complete observations of all variables.

8 Codebook and Data Description

8.1 The Codebook

Variable	Initial Name	Type	Description/Levels
`SEQNO`	SEQNO	Character	Sequential number of study subjects
`race_cat`	_IMPRACE	4 level Cat.	Race / ethnicity of respondent
`overweight`	_RFBMI5	Binary	Adults women who have a body mass index greater than 25.00 (Overweight): yes or no
`bmi`	_BMI5	Quantitative	Outcome: Computed body mass index
`bmi_category`	_BMI5CAT	3 level Cat.	Computed body mass index categories: Underweight, Normal Weight, Overweight, Obese
`weight`	WEIGHT2	Quantitative	Reported Weight in Pounds
`smoke_years`	_YRSSMOK	Quantitative	Key Predictor: Number of years respondent smoked cigarettes
`smoke_currently`	_RFSMOK3	Binary	Adults who are current smokers: yes or no
`smoke_frequency`	SMOKDAY2	3 level Cat.	Reported smoking frequency: every day, some days, not at all
`drink_daily`	AVEDRNK3	Quantitative	Average alcoholic drinks per day in past 30 days
`drink_heavy`	_RFDRHV8	Binary	Heavy drinkers (adult women having more than 7 drinks per week): yes or no

This table includes all variables used for both study 1 and study 2 analysis.

For this analysis (study 2) we will only use five predictors: smoke_years, weight, race_cat, and drink_daily.

Now we will create a subset from the complete cases set Texas_cc with the 5 variables of interest for study 2 analysis.

TX_analytic <- TX_CC |>
  select(SEQNO, race_cat, bmi, smoke_years, weight, drink_daily)

8.2 Analytic Tibble

Our analytic tibble has 6 variables and 550 observations with complete cases.

TX_analytic

# A tibble: 565 × 6
   SEQNO      race_cat   bmi smoke_years weight drink_daily
   <chr>      <fct>    <dbl>       <dbl>  <dbl>       <dbl>
 1 2022000985 White     26.6          11    170           2
 2 2022001128 White     25.2          16    147           1
 3 2022001266 White     25.1          33    185           2
 4 2022001354 Hispanic  30.2          31    160           1
 5 2022001400 Black     41.6          16    220           6
 6 2022001422 White     25.8          25    160           1
 7 2022001512 Hispanic  43.6          30    270           2
 8 2022001541 White     31.6          50    196           1
 9 2022001655 White     27.1          18    148           1
10 2022001845 White     26.6          47    165           2
# ℹ 555 more rows

We also need to confirm the presence of our tibble through this code because we are using df_print: paged in our YAML.

is_tibble(TX_analytic)

[1] TRUE

8.3 Numerical Data Description

describe(TX_analytic |> select(-SEQNO)) |> html()

Warning in png(file, width = 1 + k * w, height = h): 'width=13, height=13' are
unlikely values in pixels

select(TX_analytic, -SEQNO) Descriptives

select(TX_analytic, -SEQNO)

5 Variables 565 Observations

race_cat

n	missing	distinct
565	0	4

 Value         White Hispanic    Black    Other
 Frequency       387      115       43       20
 Proportion    0.685    0.204    0.076    0.035

bmi: COMPUTED BODY MASS INDEX

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
565	0	345	1	29.45	28.69	8.285	19.87	21.27	24.03	28.06	33.66	40.35	43.74

lowest : 16.04 16.31 16.5 17.23 17.71 , highest: 56.12 56.64 57.25 57.62 65.23

smoke_years: NUMBER OF YEARS SMOKED CIGARETTES

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
565	0	50	0.999	19.68	19	14.27	3.0	5.0	9.0	18.0	29.0	38.6	44.0

lowest : 1 2 3 4 5 , highest: 46 47 48 50 51

weight: REPORTED WEIGHT IN POUNDS

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
565	0	123	0.999	175.8	171	51.85	115.0	125.0	140.0	165.0	200.0	240.6	262.4

lowest : 91 94 95 100 105 , highest: 315 320 330 357 380

drink_daily: AVG ALCOHOLIC DRINKS PER DAY IN PAST 30

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
565	0	14	0.925	4.384	2	5.355	1.0	1.0	1.0	2.0	3.0	5.0	6.8

 Value          1     2     3     4     5     6     7     8    10    12    15    77
 Frequency    171   197    89    41    25    13     1     6     5     3     1     5
 Proportion 0.303 0.349 0.158 0.073 0.044 0.023 0.002 0.011 0.009 0.005 0.002 0.009
                       
 Value         88    99
 Frequency      5     3
 Proportion 0.009 0.005

9 My Research Question

The impact of smoking cigarettes on health outcomes has been extensively researched throughout recent decades. However, the majority of these studies have been focused on men as the main consumers of cigarettes. Predatory tobacco marketing also targeted women by advertising cigarettes as an effective method for weight reduction. The latest national statistics also revealed a high prevalence of smoking among women from southern states. In this analysis, we aim to explore the impact of the number of years women smoked cigarettes on their body mass index. The body mass index is an important health outcome that highly relates to chronic diseases’ incidence. We will also observe the impact of other biological factors including weight and race / ethnicity, and average alcohol consumption per day as a behavioral factor.

To what extent can the number of years women smoked cigarettes (smoke_years) predict their bmi level, and can weight, race / ethnicity (race_cat), and daily alcohol consumption (drink_daily) improve the quality of our prediction?

10 Partitioning the Data

In the following steps, we will divide our analytic data set (TX_analytic) into two subsets called the training data (TX_training) and the test data (TX_test). The subjects will be assigned in either group randomly with 70% in the training set and 30% in the test set. We will set a seed (777) to allow future replication of the results.

set.seed(777) 

TX_training <- TX_analytic |> slice_sample(prop = .70)
TX_test <- anti_join(TX_analytic, TX_training, by = "SEQNO")

dim(TX_analytic)

[1] 565   6

dim(TX_training)

[1] 395   6

dim(TX_test)

[1] 170   6

The training set has 395 subjects and the test set has 170. The sum of the two sets equals 565 subjects which matches the original number of our analytic data set.

11 Transforming the Outcome

11.1 Visualizing the Outcome Distribution

We will visualize the outcome distribution using three plots: the histogram, normal Q-Q plot, and box plot with violin.

p1 <- ggplot(TX_training, aes(x = (bmi))) +
  geom_histogram(bins = 30, 
                 fill = "slateblue", col = "white")

p2 <- ggplot(TX_training, aes(sample = (bmi))) + 
  geom_qq(col = "slateblue") + geom_qq_line(col = "violetred") +
  labs(y = "bmi", x = "Normal (0,1) quantiles") + 
  theme(aspect.ratio = 1)

p3 <- ggplot(TX_training, aes(x = "", y = (bmi))) +
  geom_violin(fill = "slateblue", alpha = 0.1) + 
  geom_boxplot(fill = "slateblue", width = 0.3, notch = TRUE,
               outlier.color = "slateblue", outlier.size = 3) +
  labs(x = "") + coord_flip()

p1 + p2 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Body Mass Index (bmi)",
         subtitle = str_glue("Model Development Sample: ", nrow(TX_training), 
                           " Adult Women"))

The outcome is clearly right skewed due to multiple extreme values (outliers). We will attempt to transform the outcome to improve the distribution.

11.2 `Box-Cox` function to assess the need for transformation of our outcome

We will use the Box-Cox function to see the suggested transformation to improve the skew of our outcome.

model_temp <- lm(bmi ~ smoke_years + race_cat + weight +  drink_daily,
                 data = TX_training)

boxCox(model_temp)

This plot provides a visual estimate of the suggested transformation of our outcome.

The code below can provide a numeric estimate:

powerTransform(model_temp)

Estimated transformation parameter 
       Y1 
0.6054038

The suggested transformation power is about 0.6 which suggests that the square root transformation could be the best approach in this case.

This is the scatter plot of the outcome bmi without transformation:

p1 <- ggplot(TX_training, aes(x = smoke_years , y = bmi)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE) + 
  geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) +
  labs(title = "bmi vs. years smoking")

p1

Now will compare the scatter plots after transforming the outcome bmi alone and transforming both key predictor smoke_years and outcome bmi.

p2 <- ggplot(TX_training, aes(x = smoke_years, y = sqrt(bmi))) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE) + 
  geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) + 
  labs(title = "sqrt(bmi) vs. (years smoking)")

p3 <- ggplot(TX_training, aes(x = sqrt(smoke_years), y = sqrt(bmi))) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE) + 
  geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) + 
  labs(title = "sqrt(bmi) vs. sqrt(years smoking)")

 p2 + p3

Neither did applying the square root of the outcome alone nor the predictor and outcome together produce remarkable improvements, so we will continue using the data without transformation.

11.3 Numeric Summary of the outcome

We will observe the numeric summary of the outcome bmi in our training set:

favstats(~ bmi, data = TX_training)

   min     Q1 median    Q3   max     mean       sd   n missing
 16.04 23.805  28.13 33.44 65.23 29.39777 7.743024 395       0

The bmi level of the training sample ranges from 16.0 to 65.0. The average is 29.4 with a standard deviation of 7.7.

11.4 Numeric Summary of the Predictors

We will also observe the numeric summary of our predictors in the training sample:

Hmisc::describe(TX_training |> 
                    select(race_cat, weight, smoke_years, drink_daily))

select(TX_training, race_cat, weight, smoke_years, drink_daily) 

 4  Variables      395  Observations
--------------------------------------------------------------------------------
race_cat 
       n  missing distinct 
     395        0        4 
                                              
Value         White Hispanic    Black    Other
Frequency       263       86       30       16
Proportion    0.666    0.218    0.076    0.041
--------------------------------------------------------------------------------
weight : REPORTED WEIGHT IN POUNDS 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
     395        0      109    0.999    174.7      170    52.08    115.0 
     .10      .25      .50      .75      .90      .95 
   120.4    140.0    165.0    200.0    240.6    260.0 

lowest :  91  94 100 105 106, highest: 313 320 330 357 380
--------------------------------------------------------------------------------
smoke_years : NUMBER OF YEARS SMOKED CIGARETTES 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
     395        0       50    0.999    20.07     19.5    14.55        3 
     .10      .25      .50      .75      .90      .95 
       5        9       18       29       39       44 

lowest :  1  2  3  4  5, highest: 46 47 48 50 51
--------------------------------------------------------------------------------
drink_daily : AVG ALCOHOLIC DRINKS PER DAY IN PAST 30 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
     395        0       12    0.929    4.557        2    5.627        1 
     .10      .25      .50      .75      .90      .95 
       1        1        2        3        5        8 
                                                                            
Value          1     2     3     4     5     6     8    10    12    77    88
Frequency    114   137    68    29    19     6     5     5     2     5     3
Proportion 0.289 0.347 0.172 0.073 0.048 0.015 0.013 0.013 0.005 0.013 0.008
                
Value         99
Frequency      2
Proportion 0.005
--------------------------------------------------------------------------------

In terms of race_cat, the sample is predominantly White, followed by Hispanic, Black, and Other (non-Hispanic)
The weight of the subjects extends from 91 pounds to 380 pounds with an average of 174.5 pounds
The number of years the subjects smoked cigarettes (smoke_years) ranges from 1 year to 51 years with an average of 20.1 years
The average alcoholic drinks per day (drinks_daily) ranges from 1 drink per day to 99 drinks per day with an average of 4.9 drinks per day The last variable includes unrealistically high values which may have occurred due to subjects misunderstanding the questionnaire or an error when recording the answers.

11.5 Scatterplot Matrix

temp <- TX_training |> 
  select(race_cat, weight, smoke_years, drink_daily, bmi) 

ggpairs(temp, title = "Scatterplot Matrix",
        lower = list(combo = wrap("facethist", bins = 28)))

As expected, the variable with the highest correlation (0.944) with the outcome is weight. The second highest is the key predictor smoke_years with a correlation of (0.044) followed by drink_daily (0.043). The two predictors with the highest correlation with each other are weight and smoke_years (0.063). The key predictor smoke_years has a slightly higher correlation with weight (0.063) than bmi (0.044). The predictor drink_daily has the least correlation with the outcome as well as other predictors.

Now we will observe our categorical predictor race_cat and the outcome of interest bmi:

favstats(bmi ~ race_cat, data = TX_training)

  race_cat   min     Q1 median      Q3   max     mean        sd   n missing
1    White 17.71 23.785 27.250 32.3650 65.23 28.70532  7.176522 263       0
2 Hispanic 16.04 25.845 29.695 36.4175 57.25 30.99837  7.860048  86       0
3    Black 18.29 23.400 28.430 40.8950 57.62 32.12800 11.088704  30       0
4    Other 16.50 23.175 28.295 30.3250 42.51 27.05750  6.600832  16       0

The Hispanic group has the highest median bmi, followed by the Black, Other, and White respectively. In this scenario, we preferred reporting the median instead of the mean due to the observed right skew in the outcome distribution. The Black group has the highest variation, which may be due to the low count of the two groups. The Hispanic and White group have similar variation (7.9 and 7.2 respectively). The Other group has the lowest count and lowest variation.

11.6 Collinearity Checking

As mentioned previously under the scatter plot matrix, the highest correlation (0.063) among predictors was found between weight and smoke_years. The correlation is not strong, therefore collinearity should not be an issue.

12 The Big Model

Now we will build the big model (AKA the kitchen sink model) which incorporates all the predictors. Our goal is to assess the effectiveness of the model in the training sample.

12.1 Fitting the kitchen sink model

The kitchen sink model predicts bmi using the key predictor smoke_years and four other predictors: weight, race_cat, and drink_daily.

model_big <- lm(bmi ~ smoke_years + race_cat + weight + drink_daily, 
                data = TX_training)

summary(model_big)


Call:
lm(formula = bmi ~ smoke_years + race_cat + weight + drink_daily, 
    data = TX_training)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5123 -1.5223  0.0777  1.6020  7.9548 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.524513   0.500139   5.048 6.89e-07 ***
smoke_years      -0.006080   0.009664  -0.629   0.5296    
race_catHispanic  1.802500   0.302960   5.950 6.01e-09 ***
race_catBlack    -0.675830   0.480843  -1.406   0.1607    
race_catOther    -0.287459   0.628599  -0.457   0.6477    
weight            0.152161   0.002581  58.949  < 2e-16 ***
drink_daily       0.018491   0.009471   1.952   0.0516 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.435 on 388 degrees of freedom
Multiple R-squared:  0.9026,    Adjusted R-squared:  0.9011 
F-statistic: 599.2 on 6 and 388 DF,  p-value: < 2.2e-16

This is a detailed summary of the kitchen sink (big) model.

12.2 Effect Sizes: Coefficient Estimates

We will tidy the coefficient estimates and calculate a 90% confidence interval to observe meaningful size and magnitude differences.

tidy(model_big, conf.int = TRUE, conf.level = 0.90) |> 
  select(term, estimate, std.error, conf.low, conf.high, p.value) |> 
  kable(dig = 4)

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	2.5245	0.5001	1.6999	3.3491	0.0000
smoke_years	-0.0061	0.0097	-0.0220	0.0099	0.5296
race_catHispanic	1.8025	0.3030	1.3030	2.3020	0.0000
race_catBlack	-0.6758	0.4808	-1.4686	0.1170	0.1607
race_catOther	-0.2875	0.6286	-1.3239	0.7490	0.6477
weight	0.1522	0.0026	0.1479	0.1564	0.0000
drink_daily	0.0185	0.0095	0.0029	0.0341	0.0516

The coefficients with highest absolute values are Hispanic race category (1.80), and Black race category (-0.68). However, only Hispanic race, weight, and smoke years have detectable significance in the big model.

12.3 Describing the Equation

The big model implies that the key predictor smoke_years:

For every year of smoking, the body mass index (bmi) is expected to decreased by 0.0061 points. (90% Confidence Interval: -0.0220 0.0099). Suppose we have two adult women, from Texas, from the same race, have the same weight, and alcohol daily intake. If one of them smoked cigarettes for a year longer than the other, her bmi level is predicted to be 0.0061 points lower than the lady that smoked a year less than she did. In this case, the key predictor does not yield a remarkable impact on the bmi level.

Race on the other hand has a higher impact. The Hispanic race, in particular, are predicted to have (1.80) (90% CI: 1.30, 2.30) increase in their bmi level compared to other women with the same features from White, Black and Other races.

weight increases bmi level by (0.15) with a 90% confidence interval (0.148, 0.156).

Average alcohol intake per day increases the bmi level by (0.019), with a 90% confidence interval (0.0029, 0.0341). Although its impact is lower than the other predictors, it remains higher than the key predictor.

13 The Smaller Model

Now we will build the small model which includes a fewer subset of the predictors listed in the big model. In order to decide which predictors should be removed, we will use the backwards step wise elimination approach.

13.1 Backwards Stepwise Elemination

Unfortunately this method was not suitable for our model because it removed the key predictor smoke_years. We will build our candidate models using a different method.

13.2 Another method to develop our small model:

Here, we will build three models, the first model is the original big model (modelA), the second is the first candidate to be the small model (modelB), and the third is the second candidate to be the small model (modelC).

modelA <- lm(bmi ~ smoke_years + race_cat + weight + drink_daily, 
             data = TX_training)
modelB <- lm(bmi ~ smoke_years + race_cat + weight, 
             data = TX_training)
modelC <- lm(bmi ~ smoke_years + race_cat, 
             data = TX_training)
modelD <- lm(bmi ~ smoke_years, 
             data = TX_training)

anova(modelC, modelB, modelA, modelD)

Analysis of Variance Table

Model 1: bmi ~ smoke_years + race_cat
Model 2: bmi ~ smoke_years + race_cat + weight
Model 3: bmi ~ smoke_years + race_cat + weight + drink_daily
Model 4: bmi ~ smoke_years
  Res.Df     RSS Df Sum of Sq         F  Pr(>F)    
1    390 22927.8                                   
2    389  2323.7  1   20604.1 3474.2192 < 2e-16 ***
3    388  2301.1  1      22.6    3.8117 0.05161 .  
4    393 23576.5 -5  -21275.5  717.4843 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

repA <- glance(modelA) |> mutate(name = "modelA")
repB <- glance(modelB) |> mutate(name = "modelB")
repC <- glance(modelC) |> mutate(name = "modelC")
repD <- glance(modelD) |> mutate(name = "modelD")

fit_report <- bind_rows(repA, repB, repC, repD)

fit_report |> 
    select(name, r2 = r.squared, adjr2 = adj.r.squared, sigma, 
           AIC, BIC, nobs, df, df.res = df.residual) |>
    kbl(digits = c(0,4,4,3,0,0,0,0,0)) |> kable_minimal()

name	r2	adjr2	sigma	AIC	BIC	nobs	df	df.res
modelA	0.9026	0.9011	2.435	1833	1865	395	6	388
modelB	0.9016	0.9004	2.444	1835	1863	395	5	389
modelC	0.0294	0.0194	7.667	2737	2761	395	4	390
modelD	0.0019	-0.0006	7.745	2742	2754	395	1	393

Models A and B are comparable in terms of the r squared, adjusted r squared, sigma, and AIC values. Model C and D’s performance is remarkably lower so we will exclude them. Model B will be selected as the small model.

13.3 Fitting the “small” model

Now we will fit the selected small model in the training sample.

model_small <- lm(bmi ~ smoke_years + race_cat + weight,
data = TX_training)

summary(model_small)


Call:
lm(formula = bmi ~ smoke_years + race_cat + weight, data = TX_training)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5548 -1.4312  0.0762  1.6328  7.9096 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.589277   0.500838   5.170 3.75e-07 ***
smoke_years      -0.005799   0.009698  -0.598    0.550    
race_catHispanic  1.819100   0.303933   5.985 4.91e-09 ***
race_catBlack    -0.536627   0.477243  -1.124    0.262    
race_catOther    -0.216105   0.629799  -0.343    0.732    
weight            0.152142   0.002591  58.731  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.444 on 389 degrees of freedom
Multiple R-squared:  0.9016,    Adjusted R-squared:  0.9004 
F-statistic: 713.1 on 5 and 389 DF,  p-value: < 2.2e-16

13.4 Effect Sizes: Coefficient Estimates

We will produce 90% confidence intervals for our coefficient estimates to evaluate the size and magnitude of the predictors’ impact on the outcome.

tidy(model_small, conf.int = TRUE, conf.level = 0.90) |> 
  select(term, estimate, std.error, conf.low, conf.high, p.value) |> 
  kable(dig = 4)

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	2.5893	0.5008	1.7635	3.4150	0.0000
smoke_years	-0.0058	0.0097	-0.0218	0.0102	0.5502
race_catHispanic	1.8191	0.3039	1.3180	2.3202	0.0000
race_catBlack	-0.5366	0.4772	-1.3235	0.2502	0.2615
race_catOther	-0.2161	0.6298	-1.2545	0.8223	0.7317
weight	0.1521	0.0026	0.1479	0.1564	0.0000

13.5 Interpreting the Small Model Regression Equation

The key predictor smoke_years:

Similar to the big model, the key predictor in the small model predicts that for every year of smoking, the body mass index (bmi) is expected to decrease by 0.0058 points (90% Confidence Interval: -0.0218, 0.0102). Suppose we have two adult women, from Texas, from the same race, have the same weight, and alcohol daily intake. If one of them smoked cigarettes for a year longer than the other, her bmi level is predicted to be 0.0058 points lower than the lady that smoked a year less than she did. In this case, the key predictor does not yield a remarkable impact on the bmi level.

Other Predictors:

In the small model, Race also yielded the highest absolute coefficient value, particularly the Hispanic race whom are predicted to have (1.82) with a 90% confidence interval (1.318, 2.320) increase in their bmi level compared to other women with the same features from White, Black and Other non-Hispanic races.

weight increases bmi level by (0.15) with a 90% confidence interval (0.148, 0.156).

14 In-Sample Comparison

In this step, we will compare the big model’s and the small model’s performance in the training sample by assessing the R squared values and adjusted R squared values.

14.1 Quality of Fit

To accomplish this task, we will build two tibbles, one for each model. Each tibble will summarize the key results. Afterwards, we will combine the results in a single table training_comp and assess the models’ performance.

temp_a <- glance(model_big) |> 
  select(-logLik, -deviance) |>
  round(digits = 3) |>
  mutate(modelname = "big")

temp_b <- glance(model_small) |>
  select(-logLik, -deviance) |>
  round(digits = 3) |>
  mutate(modelname = "small")

training_comp <- bind_rows(temp_a, temp_b) |>
  select(modelname, nobs, df, AIC, BIC, everything())

training_comp

# A tibble: 2 × 11
  modelname  nobs    df   AIC   BIC r.squared adj.r.squared sigma statistic
  <chr>     <dbl> <dbl> <dbl> <dbl>     <dbl>         <dbl> <dbl>     <dbl>
1 big         395     6 1833. 1865.     0.903         0.901  2.44      599.
2 small       395     5 1835. 1863.     0.902         0.9    2.44      713.
# ℹ 2 more variables: p.value <dbl>, df.residual <dbl>

The big model (which includes all predictors) has lower R squared and adjusted R squared values. It also has higher sigma, AIC and BIC values. The small model (includes all predictors except dink_daily) has a R squared and adjusted R squared values. The sigma, AIC and BIC values are relatively higher. In the training sample, the small model outperforms the big model in all of the summaries.

14.2 Assessing Assumptions

Now we will check the assumption of linear regression: linearity, normality, homoscedasticity, and influential values.

14.2.1 Residual Plots for the Big Model

First we will augment our results:

aug_big <- augment(model_big, data = TX_training) |>
  mutate(bmi = bmi)

Next we will generate the residual plots:

p1 <- ggplot(aug_big, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_big |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_big |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQNO), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of bmi", y = "Residual") 

p2 <- ggplot(aug_big, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_big, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of bmi", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_big, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_big |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_big |> filter(.cooksd >= 0.5),
               aes(label = SEQNO), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for the big model")

There are no serious issues except for the spread of the residuals in the residuals vs. fitted values, which indicate a problem with homoscedasticity.

14.2.2 Residual Plots for the Small Model

First we will augment the small model:

aug_small <- augment(model_small, data = TX_training) |>
  mutate(bmi = bmi)

Next we will generate the residual plots:

p1 <- ggplot(aug_small, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_small |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_small |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQNO), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of bmi", y = "Residual") 

p2 <- ggplot(aug_small, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_small, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of bmi", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_small, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_small |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_small |> filter(.cooksd >= 0.5),
               aes(label = SEQNO), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for the big model")

Similar to the big model, the small model also does not face any issues except for the assumption of homoscedasticity.

14.2.3 Does collinearity have a meaningful impact?

car::vif(model_big)

                GVIF Df GVIF^(1/(2*Df))
smoke_years 1.012975  1        1.006467
race_cat    1.057683  3        1.009391
weight      1.026062  1        1.012947
drink_daily 1.025737  1        1.012787

As we can see, the highest generalized variance inflation factor value in the big model is 1.058. This does not indicate a serious problem with collinearity. These conclusions can be applied to the small model as well.

14.3 Comparing the Models

Neither model has serious issues with regression assumptions, except for the issue of homoscedasticity. The small model, however, performed better than the big model in the training sample. It will be the main candidate for the time being.

15 Model Validation

We will test the two regression models’ performances using the test sample. Based on that, we can determine which model should be selected as the final model.

15.1 Calculating Prediction Errors

15.1.1 Big Model: Calculating Fits/Residuals

In this step, we will calculate the fitted and residual values using the augment function. We do not need to back-transform any variables because we used them without applying transformations.

aug_big <- augment(model_big, newdata = TX_test) |> 
  mutate(mod_name = "big",
         bmi_fit = .fitted,
         bmi_res = bmi - bmi_fit) |>
  select(SEQNO, mod_name, bmi, bmi_fit, bmi_res, everything())

head(aug_big,3)

# A tibble: 3 × 11
  SEQNO   mod_name   bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
  <chr>   <chr>    <dbl>   <dbl>   <dbl> <fct>          <dbl>  <dbl>       <dbl>
1 202200… big       26.6    28.4  -1.73  White             11    170           2
2 202200… big       25.1    30.5  -5.42  White             33    185           2
3 202200… big       26.6    27.4  -0.752 White             47    165           2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>

15.1.2 Small Model: Calculating Fits/Residuals

We will apply the same steps for the small model.

aug_small <- augment(model_small, newdata = TX_test) |> 
  mutate(mod_name = "small",
         bmi_fit = .fitted,
         bmi_res = bmi - bmi_fit) |>
  select(SEQNO, mod_name, bmi, bmi_fit, bmi_res, everything())

head(aug_small,3)

# A tibble: 3 × 11
  SEQNO   mod_name   bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
  <chr>   <chr>    <dbl>   <dbl>   <dbl> <fct>          <dbl>  <dbl>       <dbl>
1 202200… small     26.6    28.4  -1.76  White             11    170           2
2 202200… small     25.1    30.5  -5.45  White             33    185           2
3 202200… small     26.6    27.4  -0.790 White             47    165           2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>

15.1.3 Combining the Results

Now we sill combine the results of the two models.

test_comp <- union(aug_big, aug_small) |>
  arrange(SEQNO, mod_name)

test_comp |> head()

# A tibble: 6 × 11
  SEQNO   mod_name   bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
  <chr>   <chr>    <dbl>   <dbl>   <dbl> <fct>          <dbl>  <dbl>       <dbl>
1 202200… big       25.1    28.4   -3.26 White             12    170           2
2 202200… small     25.1    28.4   -3.28 White             12    170           2
3 202200… big       21.5    25.1   -3.62 Other              2    150           5
4 202200… small     21.5    25.2   -3.66 Other              2    150           5
5 202200… big       23.5    25.3   -1.83 White              7    150           1
6 202200… small     23.5    25.4   -1.88 White              7    150           1
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>

The resulting tibble test_comp will be used in the following steps to summarize, visualize, and identify the largest errors in both models.

15.2 Visualizing the Predictions

ggplot(test_comp, aes(x = bmi_fit, y = bmi)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, lty = "dashed") + 
  geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
  facet_wrap( ~ mod_name, labeller = "label_both") +
  labs(x = "Predicted bmi",
       y = "Observed bmi",
       title = "Observed vs. Predicted bmi",
       subtitle = "Comparing Big to Small Model in Test Sample",
       caption = "Dashed line is where Observed = Predicted")

The big and small models have the same patter of prediction errors.

15.3 Summarizing the Errors

This summary includes the mean absolute prediction error (MAPE), the root mean squared prediction error (RMSPE), and the maximum absolute error calculated from predictions made by each model.

test_comp |>
  group_by(mod_name) |>
  summarise(n = n(),
            MAPE = mean(abs(bmi_res)), 
            RMSPE = sqrt(mean(bmi_res^2)),
            max_error = max(abs(bmi_res)))

# A tibble: 2 × 5
  mod_name     n  MAPE RMSPE max_error
  <chr>    <int> <dbl> <dbl>     <dbl>
1 big        170  2.16  2.78      11.9
2 small      170  2.16  2.77      11.8

The small model has a slightly better MAPE, RMSPE and maximum error values than the big model. The MAPE value suggests that using these models to predict women’s bmi could overestimate it by approximately 2.2 points. This level of error in bmi measurement is not acceptable because this level of difference can easily change their BMI category.

15.3.1 Identify the largest errors

We will identify the subject with the worst fit in both models by calculating the maximum residual error in each model.

temp1 <- aug_big |>
  filter(abs(bmi_res) == max(abs(bmi_res)))

temp2 <- aug_small |>
  filter(abs(bmi_res) == max(abs(bmi_res)))

bind_rows(temp1, temp2)

# A tibble: 2 × 11
  SEQNO   mod_name   bmi bmi_fit bmi_res race_cat smoke_years weight drink_daily
  <chr>   <chr>    <dbl>   <dbl>   <dbl> <fct>          <dbl>  <dbl>       <dbl>
1 202200… big       50.2    38.2    11.9 Black             31    240           2
2 202200… small     50.2    38.4    11.8 Black             31    240           2
# ℹ 2 more variables: .fitted <dbl>, .resid <dbl>

From the presented table, we can see that the same subject (2022009499) had the worst fit in both models.

15.3.2 Validated R-square values

Here is the squared correlation between our predicted bmi and our actual bmi in the test sample, using the big model.

cor(aug_big$bmi, aug_big$bmi_fit)^2

[1] 0.8611934

This is the R-square we obtained within the test sample for the small model.

cor(aug_small$bmi, aug_small$bmi_fit)^2

[1] 0.8627504

The squared correlation of the two models are about the same, with the small model having a slightly higher value. Though, both models have remarkably high validated R squared values.

15.4 Removing the largest error

This is the summary before removing the largest error:

test_comp |>
  group_by(mod_name) |>
  summarize(n = n(), MAPE = mean(abs(bmi_res)), 
            RMSPE = sqrt(mean(bmi_res^2)),
            max_error = max(abs(bmi_res)), 
            median_APE = median(abs(bmi_res)),
            valid_R2 = cor(bmi, bmi_fit)^2) |>
  kbl(digits = c(0, 0, 4, 3, 2, 3, 3)) |> kable_minimal(font_size = 22)

mod_name	n	MAPE	RMSPE	max_error	median_APE	valid_R2
big	170	2.1613	2.780	11.94	1.750	0.861
small	170	2.1558	2.765	11.77	1.765	0.863

test_comp |> filter(SEQNO != "2022009499") |>
  group_by(mod_name) |>
  summarize(n = n(), MAPE = mean(abs(bmi_res)), 
            RMSPE = sqrt(mean(bmi_res^2)),
            max_error = max(abs(bmi_res)), 
            median_APE = median(abs(bmi_res)),
            valid_R2 = cor(bmi, bmi_fit)^2) |>
  kbl(digits = c(0, 0, 4, 3, 2, 3, 3)) |> kable_minimal(font_size = 22)

mod_name	n	MAPE	RMSPE	max_error	median_APE	valid_R2
big	169	2.1034	2.633	7.99	1.739	0.873
small	169	2.0989	2.621	8.02	1.764	0.874

Removing the subject with the highest error increased the R squared and adjusted R squared value for both models, but the difference between them remains the same. The maximum error was significantly reduced after removing the subject. The big model has lower sigma and median APE values. The small model remains slightly better in terms of the MAPE, RMSPE, and valid R squared values.

15.5 Comparing the Models

The small model dominated the big model in the training sample. In the test sample, The small model was also slightly better in the majority of the summaries. Therefore, the small model will be selected as our final model. Though, the performance of the two models are fairly similar

16 Discussion

16.1 Chosen Model

In the training sample, the small model yielded better values in terms of R squared, adjusted R squared, AIC, BIC and sigma. In the test sample, the small model exhibited better performance in terms of MAPE, RMSPE, maximum error and Valid R squared, while the big model had better median APE value. After removing the subject with largest error, the small model’s performance remained slightly better. Based on these results, we will choose the small model. It has a lower number of predictors and better performance in both the training sample and testing sample.

16.2 Answering My Question

tidy(model_small, conf.int = TRUE, conf.level = 0.90) |> 
  select(term, estimate, std.error, conf.low, conf.high, p.value) |> 
  kable(dig = 4)

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	2.5893	0.5008	1.7635	3.4150	0.0000
smoke_years	-0.0058	0.0097	-0.0218	0.0102	0.5502
race_catHispanic	1.8191	0.3039	1.3180	2.3202	0.0000
race_catBlack	-0.5366	0.4772	-1.3235	0.2502	0.2615
race_catOther	-0.2161	0.6298	-1.2545	0.8223	0.7317
weight	0.1521	0.0026	0.1479	0.1564	0.0000

The linear regression model that predicted bmi for women in Texas:

bmi = 2.59 + smoke_years (-0.0058) + race_cat(Hispanic(1.82)/ Black(-0.05)/ Other(-0.22)) + weight (0.15)

In our study sample, the key predictor smoke_year, and other predictors, race category (Black and Other) were associated with decreased bmi levels. On the other hand, the race categories (White and Hispanic) and weight were associated with increased bmi levels. The only predictors that exhibited detectable impact on the outcome were weight and race category (White and Hispanic).

Quality of Fit:

The validated R squared is 0.863, which indicates that our model was able to predict 86.3% of variation in our outcome. The previous adjusted r squared value generated from the training sample was 0.90, which indicates that the model’s predictability was slightly overestimated. The remarkable high predictability of this model is likely due to the incorporation of the most important factor in calculating bmi levels (weight). The key predictor had a weak correlation with the outcome (-0.0058), which implies that it is not the best predictor. The Hispanic race category was highly correlated (1.59) with increased bmi.

Regression Assumptions:

There were no serious violations of linear regression assumptions, except for homoscedasticity, which can be observed in the residual vs. fitted plot.

16.3 Next Steps

The only predictors with missing observations was the key predictor smoke_years and another predictor that was added to assess interaction drink_daily. In this analysis, I assumed that those observations were missing completely at random. The next step would be to use imputation (single imputation followed by multiple imputation) to assess potential improvement in big model, which incorporated both variables, and whether imputation could change our decision of the best model.

Another step would be investigating the association between BMI and women from different racial backgrounds. In addition to race, the new model will incorporate age, socioeconomic level, and educational level. This way, we can observe if having a different race impacts BMI levels when all other relative factors (age, socioeconomic level, and educational level) are constant. We would also include women from various states to make the sample representative of the US women population.

16.4 Reflection

When we first loaded the raw data set, the number of observations was 445,132. After narrowing it down to my population of interest (adult women in Texas), the number of observations remained relatively high (1,164). The big sample size gave me a sense of reassurance that I have met the required number of observations for project B. However, after narrowing it down to complete cases of the variables of interest for study 2 analysis, I was left with only 565 observations. It remained above the minimum required number of observations, but I did not anticipate the significant reduction in the sample size. Had I known that I would lose a big number of observations in the process of data cleaning and management, I would have included women from multiple southern states, rather than Texas alone.

17 Session information

sessionInfo()

R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4   forcats_1.0.1     stringr_1.5.2     purrr_1.1.0      
 [5] readr_2.1.5       tidyr_1.3.1       tibble_3.3.0      tidyverse_2.0.0  
 [9] sessioninfo_1.2.3 patchwork_1.3.2   naniar_1.1.0      mosaic_1.9.2     
[13] mosaicData_0.20.4 ggformula_1.0.0   dplyr_1.1.4       Matrix_1.7-3     
[17] lattice_0.22-7    knitr_1.50        janitor_2.2.1     Hmisc_5.2-4      
[21] GGally_2.4.0      car_3.1-3         carData_3.0-5     haven_2.5.5      
[25] kableExtra_1.4.0  ggrepel_0.9.6     ggplot2_4.0.0     gtsummary_2.4.0  
[29] gt_1.1.0          broom_1.0.10     

loaded via a namespace (and not attached):
 [1] ggiraph_0.9.2           tidyselect_1.2.1        viridisLite_0.4.2      
 [4] farver_2.1.2            S7_0.2.0                fastmap_1.2.0          
 [7] fontquiver_0.2.1        labelled_2.15.0         digest_0.6.37          
[10] rpart_4.1.24            timechange_0.3.0        lifecycle_1.0.4        
[13] cluster_2.1.8.1         magrittr_2.0.4          compiler_4.5.1         
[16] rlang_1.1.6             tools_4.5.1             utf8_1.2.6             
[19] yaml_2.3.10             data.table_1.17.8       labeling_0.4.3         
[22] htmlwidgets_1.6.4       xml2_1.4.0              RColorBrewer_1.1-3     
[25] abind_1.4-8             withr_3.0.2             foreign_0.8-90         
[28] nnet_7.3-20             grid_4.5.1              mosaicCore_0.9.5       
[31] gdtools_0.4.4           colorspace_2.1-2        MASS_7.3-65            
[34] scales_1.4.0            ggridges_0.5.7          cli_3.6.5              
[37] rmarkdown_2.30          generics_0.1.4          rstudioapi_0.17.1      
[40] tzdb_0.5.0              splines_4.5.1           base64enc_0.1-3        
[43] vctrs_0.6.5             jsonlite_2.0.0          fontBitstreamVera_0.1.1
[46] hms_1.1.4               visdat_0.6.0            Formula_1.2-5          
[49] htmlTable_2.4.3         systemfonts_1.3.1       glue_1.8.0             
[52] ggstats_0.11.0          stringi_1.8.7           gtable_0.3.6           
[55] pillar_1.11.1           htmltools_0.5.8.1       R6_2.6.1               
[58] textshaping_1.0.4       evaluate_1.0.5          backports_1.5.0        
[61] snakecase_0.11.1        fontLiberation_0.1.0    Rcpp_1.1.0             
[64] nlme_3.1-168            svglite_2.2.1           gridExtra_2.3          
[67] checkmate_2.3.3         mgcv_1.9-3              xfun_0.52              
[70] fs_1.6.6                pkgconfig_2.0.3