4 Social Factors and Children’s Mental Health

A cross-sectional study in the Midwest

Author

Walaa Alshaia and Sarah Albalawi

Published

2025-10-28

R Packages and Setup

Code

knitr::opts_chunk$set(comment = NA) 

library(janitor) 
library(broom)
library(gt)
library(gtsummary)
library(Hmisc)
library(ggrepel)
library(GGally)
library(knitr)
library(mosaic)
library(naniar)
library(patchwork)
library(simputation)
library(rms)
library(caret)
library(pROC)
library(sessioninfo)
library(tidyverse) 

opts_chunk$set(comment=NA)
options(dplyr.summarise.inform = FALSE)

theme_set(theme_bw())

5 Data Source

The data set used for this study is the National Health Interview Survey (NHIS). The NHIS is part of the Centers for Disease Control and Prevention (CDC) and the main source of statistics on the health of the civilian non-institutionalized population of the United States. The data is collected through interviews conducted at the households of the subjects. The target population are non-institutionalized US citizens, including group homes and shelters. The NHIS aims to obtain a representative sample of the US population through geographically clustered sampling techniques. The data collection is conducted throughout the year, from January to December. For this study, the 2022 NHIS sample child interview data set was used. The file can be found at https://www.cdc.gov/nchs/nhis/2022nhis.htm.

6 The Subjects

The subjects of this study are children (any individual below the age of 18) living in civilian non-institutionalized households. The sample is restricted to those in the Midwest region of the U.S.

7 Loading and Tidying the Data

7.1 Loading the Raw Data

For this study, we will use the 2022 NHIS sample child interview data set. The file can be found at https://www.cdc.gov/nchs/nhis/2022nhis.htm.

Code

R.child22 <- read_csv("child22.csv",
                      guess_max = 4000,
                      show_col_types = FALSE) |>
    mutate(across(where(is.character), as_factor))
                      

dim (R.child22)

[1] 7464  432

7.2 Cleaning the Data

In this step, we will mutate, rename, and regroup our variables as needed.

7.2.1 Selecting Variables of Interest

Given the extensive size of the raw data set, we will filter the data set to include the variables of interest for our analysis. Additionally, we will narrow the population of interest to those living in the Midwest.

Code

child_raw <- R.child22 |>
  select (HHX,REGION, SEX_C, AGEP_C, MAXEDUCP_C, PHSTAT_C, SDQTOT_C, VIOLENEV_C, HISPALLP_C) |> 
  filter(`REGION` == 2, AGEP_C < 97, SEX_C < 10, MAXEDUCP_C < 97 , PHSTAT_C < 10, SDQTOT_C < 90, VIOLENEV_C < 10, HISPALLP_C < 8)

child_raw

# A tibble: 1,191 × 9
   HHX    REGION SEX_C AGEP_C MAXEDUCP_C PHSTAT_C SDQTOT_C VIOLENEV_C HISPALLP_C
   <fct>   <dbl> <dbl>  <dbl>      <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
 1 H0516…      2     2     17          8        3        8          2          2
 2 H0166…      2     2      7          1        1       14          2          1
 3 H0379…      2     1      5          1        1        2          2          1
 4 H0394…      2     1     10          9        1        0          2          2
 5 H0088…      2     1      5          5        1        6          2          2
 6 H0645…      2     1     17          9        1        4          2          4
 7 H0290…      2     1      7          8        2        3          2          2
 8 H0377…      2     2     13          8        2       11          2          2
 9 H0128…      2     1     15          8        2       88          8          2
10 H0038…      2     1     12          9        1        0          2          2
# ℹ 1,181 more rows

We will mutate the variables with levels containing non-relevant values, such as “refused to answer” or “not ascertained”, into NA.

Code

child_raw$SEX_C= na_if(child_raw$SEX_C, 7)
child_raw$SEX_C= na_if(child_raw$SEX_C, 8)
child_raw$SEX_C= na_if(child_raw$SEX_C, 9)
child_raw$SDQTOT_C= na_if(child_raw$SDQTOT_C,88)
child_raw$PHSTAT_C = na_if(child_raw$PHSTAT_C, 7)
child_raw$PHSTAT_C = na_if(child_raw$PHSTAT_C, 8)
child_raw$PHSTAT_C = na_if(child_raw$PHSTAT_C, 9)
child_raw$VIOLENEV_C = na_if(child_raw$VIOLENEV_C, 7)
child_raw$VIOLENEV_C = na_if(child_raw$VIOLENEV_C, 8)
child_raw$VIOLENEV_C = na_if(child_raw$VIOLENEV_C, 9)

child_raw

# A tibble: 1,191 × 9
   HHX    REGION SEX_C AGEP_C MAXEDUCP_C PHSTAT_C SDQTOT_C VIOLENEV_C HISPALLP_C
   <fct>   <dbl> <dbl>  <dbl>      <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
 1 H0516…      2     2     17          8        3        8          2          2
 2 H0166…      2     2      7          1        1       14          2          1
 3 H0379…      2     1      5          1        1        2          2          1
 4 H0394…      2     1     10          9        1        0          2          2
 5 H0088…      2     1      5          5        1        6          2          2
 6 H0645…      2     1     17          9        1        4          2          4
 7 H0290…      2     1      7          8        2        3          2          2
 8 H0377…      2     2     13          8        2       11          2          2
 9 H0128…      2     1     15          8        2       NA         NA          2
10 H0038…      2     1     12          9        1        0          2          2
# ℹ 1,181 more rows

7.2.2 Converting Variable Types

We will convert the variables into factors, except for HHX, which will remain as a character. The variables AGE_C and SDQTOT_C are the only quantitative variables and will be converted into numeric variables.

Code

child_raw <- child_raw |>
  mutate_if(is.vector, as.factor)|>
  mutate(HHX = as.character(HHX))|>
  mutate(AGEP_C = as.numeric(AGEP_C))|>
  mutate(SDQTOT_C= as.numeric(SDQTOT_C))

child_raw

# A tibble: 1,191 × 9
   HHX    REGION SEX_C AGEP_C MAXEDUCP_C PHSTAT_C SDQTOT_C VIOLENEV_C HISPALLP_C
   <chr>  <fct>  <fct>  <dbl> <fct>      <fct>       <dbl> <fct>      <fct>     
 1 H0516… 2      2         14 8          3               9 2          2         
 2 H0166… 2      2          4 1          1              15 2          1         
 3 H0379… 2      1          2 1          1               3 2          1         
 4 H0394… 2      1          7 9          1               1 2          2         
 5 H0088… 2      1          2 5          1               7 2          2         
 6 H0645… 2      1         14 9          1               5 2          4         
 7 H0290… 2      1          4 8          2               4 2          2         
 8 H0377… 2      2         10 8          2              12 2          2         
 9 H0128… 2      1         12 8          2              NA <NA>       2         
10 H0038… 2      1          9 9          1               1 2          2         
# ℹ 1,181 more rows

7.2.3 Changing Variable Names

We will assign meaningful names to the variables of our study.

Code

child22 <- child_raw|>
  select(HHX, SEX_C, AGEP_C, MAXEDUCP_C, PHSTAT_C, SDQTOT_C, VIOLENEV_C, HISPALLP_C)

child22 <- child22 |>
  rename(sex = SEX_C,
         age = AGEP_C,
         race = HISPALLP_C,
         adult_edu = MAXEDUCP_C,
         health_status = PHSTAT_C, 
         sdq = SDQTOT_C, 
         violence = VIOLENEV_C,
         id = HHX)

child22

# A tibble: 1,191 × 8
   id      sex     age adult_edu health_status   sdq violence race 
   <chr>   <fct> <dbl> <fct>     <fct>         <dbl> <fct>    <fct>
 1 H051614 2        14 8         3                 9 2        2    
 2 H016609 2         4 1         1                15 2        1    
 3 H037925 1         2 1         1                 3 2        1    
 4 H039485 1         7 9         1                 1 2        2    
 5 H008818 1         2 5         1                 7 2        2    
 6 H064500 1        14 9         1                 5 2        4    
 7 H029084 1         4 8         2                 4 2        2    
 8 H037799 2        10 8         2                12 2        2    
 9 H012805 1        12 8         2                NA <NA>     2    
10 H003829 1         9 9         1                 1 2        2    
# ℹ 1,181 more rows

7.2.4 Working with Categorical Predictors

For this project, we have four categorical variables acting as predictors for both analyses and one binary outcome for the logistic regression analysis: sex, race, adult_edu, health_status, and violence

7.2.4.1 sex variable

We will rename the numeric levels of sex into meaningful labels:

Code

child22 <- child22|>
mutate(sex = fct_recode (sex, 
                                  "Male"= "1", 
                                  "Female"= "2"))
tabyl(child22$sex)

 child22$sex   n   percent
        Male 613 0.5146935
      Female 578 0.4853065

7.2.4.2 race variable

We will rename each category of the race variable to clarify the meaning of each level. We will also combine the levels with the lowest frequencies under other category.

Code

child22 <- child22 |>
  mutate(
    race = fct_recode(race,
      "Hispanic" = "1",
      "White" = "2",
      "African_American" = "3",
      "Asian" = "4",
      "other" = "5",
      "other" = "6",
      "other" = "7" ))

tabyl(child22$race)

     child22$race   n    percent
         Hispanic 163 0.13685978
            White 804 0.67506297
 African_American  78 0.06549118
            Asian  63 0.05289673
            other  83 0.06968934

7.2.4.3 adult_edu variable

We will start by renaming the adult_edu numeric levels into meaningful words:

Code

child22 <- child22 |>
mutate(adult_edu= fct_recode (adult_edu, 
                                  "Grade 1-11"= "1", 
                                  "12th grade,no diploma"= "2", 
                                  "GED"="3", 
                                  "High School Graduate"= "4", 
                                  "Some college, no degree"= "5", 
                                  "Associate degree-nonacademic-"="6",
                                  "Associate degree-academic-"="7",
                                  "Bachelor's degree"="8",
                                  "Master's degree"="9",
                                  "Professional School"="10"))
tabyl(child22$adult_edu)

             child22$adult_edu   n     percent
                    Grade 1-11  33 0.027707809
         12th grade,no diploma   7 0.005877414
                           GED   9 0.007556675
          High School Graduate 157 0.131821998
       Some college, no degree 172 0.144416457
 Associate degree-nonacademic-  62 0.052057095
    Associate degree-academic- 133 0.111670865
             Bachelor's degree 321 0.269521411
               Master's degree 212 0.178001679
           Professional School  85 0.071368598

We will collapse the present levels to the following:

Incomplete high school
Complete high school
Incomplete college
Associate’s Degree
Bachelor’s Degree
Graduate Degree

Code

child22 <- child22 |>
mutate (adult_edu= fct_collapse(adult_edu, incomplete_highschool = c("Grade 1-11", "12th grade,no diploma"),
                                complete_highschool = c("GED", "High School Graduate"),
                                incomplete_college = c("Some college, no degree"),
                                      associate_degree = c("Associate degree-nonacademic-", "Associate degree-academic-"),
                                      bachelors_degree = c("Bachelor's degree"),
                                      graduate_degree = c("Master's degree", "Professional School")))
tabyl(child22$adult_edu)

     child22$adult_edu   n    percent
 incomplete_highschool  40 0.03358522
   complete_highschool 166 0.13937867
    incomplete_college 172 0.14441646
      associate_degree 195 0.16372796
      bachelors_degree 321 0.26952141
       graduate_degree 297 0.24937028

7.2.4.4 health_status variable

We will start by renaming the health_status numeric levels into meaningful labels:

Code

child22 <- child22|>
mutate(health_status= fct_recode (health_status, 
                                  "Excellent"= "1", 
                                  "Very Good"= "2", 
                                  "Good"="3", 
                                  "Fair"= "4", 
                                  "Poor"= "5"))
tabyl(child22$health_status)

 child22$health_status   n      percent valid_percent
             Excellent 733 0.6154492024   0.615966387
             Very Good 289 0.2426532326   0.242857143
                  Good 141 0.1183879093   0.118487395
                  Fair  24 0.0201511335   0.020168067
                  Poor   3 0.0025188917   0.002521008
                  <NA>   1 0.0008396306            NA

We would like to collapse them into the following categories:

Excellent
Very Good
Not very good

Code

child22 <- child22 |>
mutate (health_status= fct_collapse(health_status, "Not very good" = c("Good","Fair", "Poor")))
tabyl(child22$health_status)

 child22$health_status   n      percent valid_percent
             Excellent 733 0.6154492024     0.6159664
             Very Good 289 0.2426532326     0.2428571
         Not very good 168 0.1410579345     0.1411765
                  <NA>   1 0.0008396306            NA

7.2.4.5 violence variable

We will rename the violence numeric levels into meaningful labels:

Code

child22 <- child22|>
mutate(violence = fct_recode (violence, 
                                  "Yes"= "1", 
                                  "No"= "2"))
tabyl(child22$violence)

 child22$violence    n    percent valid_percent
              Yes   90 0.07556675    0.07698888
               No 1079 0.90596138    0.92301112
             <NA>   22 0.01847187            NA

7.2.5 Assessment of Missing Values

We will employ the miss_var_summary function to assess missing values in the study variables. The two outcomes of the linear and logistic regression (sdq and violence) have missing values. One of the predictors (health_status) also has a missing value.

Code

miss_var_summary(child22)

# A tibble: 8 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <num>
1 violence          22   1.85  
2 sdq               18   1.51  
3 health_status      1   0.0840
4 id                 0   0     
5 sex                0   0     
6 age                0   0     
7 adult_edu          0   0     
8 race               0   0

7.2.6 Arranging the Tibble

This is the count of the final data set child22 after refining the variables of interest:

Code

dim(child22)

[1] 1191    8

we have 1191 observations and 8 variables of interest (including id).

8 The Tidy Tibble

8.1 Listing the Tibble

We will observe the finalized version of our data set after cleaning and filtering the variables:

Code

child22

# A tibble: 1,191 × 8
   id      sex      age adult_edu             health_status   sdq violence race 
   <chr>   <fct>  <dbl> <fct>                 <fct>         <dbl> <fct>    <fct>
 1 H051614 Female    14 bachelors_degree      Not very good     9 No       White
 2 H016609 Female     4 incomplete_highschool Excellent        15 No       Hisp…
 3 H037925 Male       2 incomplete_highschool Excellent         3 No       Hisp…
 4 H039485 Male       7 graduate_degree       Excellent         1 No       White
 5 H008818 Male       2 incomplete_college    Excellent         7 No       White
 6 H064500 Male      14 graduate_degree       Excellent         5 No       Asian
 7 H029084 Male       4 bachelors_degree      Very Good         4 No       White
 8 H037799 Female    10 bachelors_degree      Very Good        12 No       White
 9 H012805 Male      12 bachelors_degree      Very Good        NA <NA>     White
10 H003829 Male       9 graduate_degree       Excellent         1 No       White
# ℹ 1,181 more rows

8.2 Size and Identifiers

The tibble frame child22 consists of 1191 rows (observations) and 8 columns (variables). The id functions as our indicator variable, guaranteeing that each row in our data set has a distinct identification. This is demonstrated in the code provided below.

Code

identical(nrow(child22), n_distinct(child22$id))

[1] TRUE

8.3 Save The Tibble

We will use the write_rds function to store the finalized data set.

Code

write_rds(child22,"child22.Rds")

9 The Code Book

9.1 Defining the Variables

Variable	Role	Type	Description
id	Subject Identifier Number	-	Subject character code
sex	input	Binary	Male and Female
age	input	Quantitative	Child age in years
race	input	5 Categories	Multiple race groups (White, Hispanic, African American, Asian, Other)
adult_edu	input	6 Categories	Level of education of the adults in the child’s family (Incomplete high school, Complete high school, Incomplete college, Associate degree, Bachelors degree, Graduate degree )
health_status	input	3 Categories	The health status of the children interviewed (Excellent, Very Good, Not very good)
sdq	Outcome	Quantitative	Strengths and Difficulties Questionnaire (SDQ) total score
violence	Outcome	Binary	The subject is a Victim of/ have witnessed violence (yes or no)

Code

tbl_summary(select(child22, -id),
        label = list(
            sex = "Sex",
            age = "Age in years",
            race = "Race category",
            adult_health = "Adult Educational level",
            Health_Status = "Child Health Status",
            sdq = "Strengths and Difficulties Questionnaire (SDQ) total score",
            violence = "Victim of or witnessed violence"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))

Characteristic	N = 1,191¹
Sex
Male	613 (51%)
Female	578 (49%)
Age in years	8.0 [1.0 to 14.0]
adult_edu
incomplete_highschool	40 (3.4%)
complete_highschool	166 (14%)
incomplete_college	172 (14%)
associate_degree	195 (16%)
bachelors_degree	321 (27%)
graduate_degree	297 (25%)
health_status
Excellent	733 (62%)
Very Good	289 (24%)
Not very good	168 (14%)
Unknown	1
Strengths and Difficulties Questionnaire (SDQ) total score	6.0 [1.0 to 31.0]
Unknown	18
Victim of or witnessed violence	90 (7.7%)
Unknown	22
Race category
Hispanic	163 (14%)
White	804 (68%)
African_American	78 (6.5%)
Asian	63 (5.3%)
other	83 (7.0%)
¹ n (%); Median [Min to Max]

9.2 Numerical Description

Code

describe(child22) |> html()

child22 Descriptives

child22

8 Variables 1191 Observations

n	missing	distinct
1191	0	1191

lowest : H000027 H000043 H000185 H000206 H000241 , highest: H067509 H067545 H067731 H067745 H067797

sex

n	missing	distinct
1191	0	2

 Value        Male Female
 Frequency     613    578
 Proportion  0.515  0.485

age

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
1191	0	14	0.995	7.725	7.5	4.805	1	2	4	8	11	13	14

 Value          1     2     3     4     5     6     7     8     9    10    11    12
 Frequency     86    91    92    69    73    81    75    78    69    79   101   101
 Proportion 0.072 0.076 0.077 0.058 0.061 0.068 0.063 0.065 0.058 0.066 0.085 0.085
                       
 Value         13    14
 Frequency     98    98
 Proportion 0.082 0.082

adult_edu

n	missing	distinct
1191	0	6

 Value      incomplete_highschool   complete_highschool    incomplete_college
 Frequency                     40                   166                   172
 Proportion                 0.034                 0.139                 0.144
                                                                             
 Value           associate_degree      bachelors_degree       graduate_degree
 Frequency                    195                   321                   297
 Proportion                 0.164                 0.270                 0.249

health_status

n	missing	distinct
1190	1	3

 Value          Excellent     Very Good Not very good
 Frequency            733           289           168
 Proportion         0.616         0.243         0.141

sdq

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
1173	18	31	0.995	7.665	7	6.1	1	2	3	6	11	16	19

lowest : 1 2 3 4 5 , highest: 27 28 29 30 31

violence

n	missing	distinct
1169	22	2

 Value        Yes    No
 Frequency     90  1079
 Proportion 0.077 0.923

race

n	missing	distinct
1191	0	5

 Value              Hispanic            White African_American            Asian
 Frequency               163              804               78               63
 Proportion            0.137            0.675            0.065            0.053
                            
 Value                 other
 Frequency                83
 Proportion            0.070

10 Linear Regression Plans

10.1 My First Research Question

Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score?

10.2 My Quantitative Outcome

For this research question, we will use the Strengths and Difficulties Questionnaire (SDQ) score to measure the subjects’ mental health status. The outcome is named sdq in the tibble. We are interested in using this outcome because it is an objective measure of the children mental health status. Hence, it will be a reliable tool in exploring potential associations with social factors, namely their age, sex, race, health status, and parental educational level.

Code

count(complete.cases(child22$sdq))

n_TRUE 
  1173

There are 1173 cases with complete sdq score values.

Code

p1 <- ggplot(child22, aes(sample = sdq)) +
  geom_qq(col = "navyblue") + geom_qq_line(col = "pink") + 
  theme(aspect.ratio = 1) +
    labs(y = "(SDQ) total score", 
         x = "Standard Normal Distribution")

p2 <- ggplot(child22, aes(x = sdq)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = "navyblue", col = "white") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(child22$sdq), 
                            sd = sd(child22$sdq)),
                col = "pink", lwd = 1.5) +
  labs(y = "count", x = "(SDQ) total score")

p3 <- ggplot(child22, aes(x = sdq, y = "")) +
  geom_violin(fill = "navyblue") +
  geom_boxplot(width = 0.3, col = "white", notch = TRUE, 
               outlier.color = "navyblue") +
  labs(x = "(SDQ) total score", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
  plot_annotation(title = "SDQ Total Scores from NHIS Data",
       subtitle = "Higher SDQ score indicates more psychological difficulties",
       caption = "n = 1173")

The outcome is not normally distributed, rather, it is remarkably right skewed. A transformation should be considered.

Code

describe(child22$sdq)

child22$sdq 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
    1173       18       31    0.995    7.665        7      6.1        1 
     .10      .25      .50      .75      .90      .95 
       2        3        6       11       16       19 

lowest :  1  2  3  4  5, highest: 27 28 29 30 31

As we can see, the outcome sdq is a quantitative variable with more than 10 ordinal values.

10.3 My Planned Predictors (Linear Model)

The social background encompasses multiple variables including their sex, age, race, health_status, and parental educational level adult_edu.

The age variable is quantitative with more than 10 ordered observations:

Code

describe(child22$age)

child22$age 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
    1191        0       14    0.995    7.725      7.5    4.805        1 
     .10      .25      .50      .75      .90      .95 
       2        4        8       11       13       14 
                                                                            
Value          1     2     3     4     5     6     7     8     9    10    11
Frequency     86    91    92    69    73    81    75    78    69    79   101
Proportion 0.072 0.076 0.077 0.058 0.061 0.068 0.063 0.065 0.058 0.066 0.085
                            
Value         12    13    14
Frequency    101    98    98
Proportion 0.085 0.082 0.082

race is a categorical variable with 5 levels. Each level contains over 30 observations:

Code

tabyl(child22$race)

     child22$race   n    percent
         Hispanic 163 0.13685978
            White 804 0.67506297
 African_American  78 0.06549118
            Asian  63 0.05289673
            other  83 0.06968934

We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N1-100)/100. N1 is the total number of rows with complete outcome sdq data = 1173. 4+(1173-100)/100 = 14.73 ~ 15

The total number of candidate predictors is 5 (sex, age, race, health_status, and adult_edu) which is way below 15.

The first speculation we hold is that females may be associated with higher sdq scores. Our belief is based on the fact that females experience puberty earlier than their male counterpart, which is strongly associated with mental health outcomes. We believe that the child’s health status and parental educational level may have a negative association with their sdq mental health score. We also speculate that race may contribute to their mental health score, where children from minority groups may have higher scores compared to other race categories.

11 Logistic Regression Plans

11.1 My Second Research Question

Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence?

11.2 My Binary Outcome

The outcome is the odds of children being victims or witnesses of violence. It is present in the tibble as violence. We are interested in using this measure to calculate the odds of children facing adverse events when selected social factors are taken into account. This can be a powerful tool in predicting and preventing child abuse. It can also be used in creating a “profile” of children most likely becoming victims of violence based on their social background.

The outcome variable:

Code

tabyl(child22$violence)

 child22$violence    n    percent valid_percent
              Yes   90 0.07556675    0.07698888
               No 1079 0.90596138    0.92301112
             <NA>   22 0.01847187            NA

violence has two categories: yes and no. There are 22 missing values.

11.3 My Planned Predictors (Logistic Model)

The social background encompasses multiple variables including the child’s sex, age, race, health_status, and parental educational level adult_edu.

The age variable is quantitative with more than 10 ordered observations:

Code

describe(child22$age)

child22$age 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
    1191        0       14    0.995    7.725      7.5    4.805        1 
     .10      .25      .50      .75      .90      .95 
       2        4        8       11       13       14 
                                                                            
Value          1     2     3     4     5     6     7     8     9    10    11
Frequency     86    91    92    69    73    81    75    78    69    79   101
Proportion 0.072 0.076 0.077 0.058 0.061 0.068 0.063 0.065 0.058 0.066 0.085
                            
Value         12    13    14
Frequency    101    98    98
Proportion 0.085 0.082 0.082

race is a 5 level categorical variable, with each level having over 30 observations:

Code

tabyl(child22$race)

     child22$race   n    percent
         Hispanic 163 0.13685978
            White 804 0.67506297
 African_American  78 0.06549118
            Asian  63 0.05289673
            other  83 0.06968934

We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N2-100)/100. N2 is the number of the smaller group in the outcome violence. In this case N2 = 90. 4+(90-100)/100 = 3.9 ~ 4

The total number of candidate predictors is 5 (sex, age, race, health_status, and adult_edu) which is more than the suggested number. We will have to remove one predictor before proceeding to analysis.

The first speculation we hold is that younger children (one of the most vulnerable populations) have higher odds of being victims of or witnessing violence. We also believe that the child’s health status and parental educational level may have a negative association with the odds of them being victims or witnesses of violence. Race could contribute to outcome, where children from minority groups may have higher odds of being victims of or witnessing violence compared to children from other race categories.

12 Linear Regression Analyses

12.1 Missingness

We will use miss_var_summary to assess the missingness

Code

miss_var_summary(child22) |> filter(n_miss > 0)

# A tibble: 3 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <num>
1 violence          22   1.85  
2 sdq               18   1.51  
3 health_status      1   0.0840

We’ve noticed that there are missing observations in our outcome variable sdq and the predictor variable health_status.

12.1.1 Single Imputation Approach

For this analysis, we will proceed under the assumption that the observations are missing at random (MAR). Under this assumption, using a single imputation method to handle the missing values is suitable for our predictor health_status. We’ll use the simputation package to accomplish it. However, we will exclude all the missing observations in our outcome via complete cases.

Code

child22_sub <- child22 |>
  select(sdq, sex, race, adult_edu, age, health_status)|>
  filter(complete.cases(sdq))|>
  impute_cart(health_status ~ sex + race + adult_edu+ age+ sdq)
n_miss(child22_sub)

[1] 0

We used the n_miss function to ensure that we obtained zero missing observations.

12.2 Outcome Transformation

We will use Box-Cox function to determine whether our outcome requires transformation to improve its skewness.

Code

mod_temp <- lm(sdq ~ age + sex + race + adult_edu + health_status, data = child22_sub)

car::boxCox(mod_temp)

With a suggested transformation power of around 0.4, it appears that a square root transformation might be the most appropriate approach in the current case.

To verify the given suggestion, we will generate appropriate visualization to assess the skewness

Code

child22_sub <- child22_sub |>
  mutate(sqrtsdq = sqrt(sdq))

p1 <- ggplot(data = child22_sub, aes(sample = sqrtsdq)) +
  geom_qq(col = "cornflowerblue") + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(x = "Expected N(0,1)", y = "sqrt(SDQ)")

p2 <- ggplot(data = child22_sub, aes(x = sqrtsdq)) +
  geom_histogram(bins = 12 , fill = "cornflowerblue", col = "white") + 
  labs(x = "sqrt(SDQ)", y = "counts")

p3 <- ggplot(data = child22_sub, aes(x = sqrtsdq, y = "")) +
  geom_violin() +
  geom_boxplot(fill = "cornflowerblue", width = 0.3,
               outlier.color = "cornflowerblue", outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "white") +
  labs(y = "", x = "sqrt(SDQ)")

p1 + (p2 / p3 + plot_layout(heights = c(2,1)))+
  plot_annotation(title = "Evaluating the square root of SDQ score")

The graphs provided demonstrate that although some skewness is still visible in both the normal Q-Q plot and histogram, there has been an improvement. Furthermore, the violin plot demonstrates a reduction of the outliers.

12.3 Scatterplot Matrix and Collinearity

We will evaluate collinearity between predictors via scatter plot matrix and variance inflation factors.

Scatter plot Matrix

Code

ggpairs(child22_sub, columns = c("age", "sex", "race", 
                               "health_status", "adult_edu", "sqrtsdq"), 
        title = "Scatterplot Matrix")

At the bottom right corner, we observe the density function plot representing our outcome variable sqrtsdq. The correlation among the predictors is fairly modest, and the scatter plot matrix reveals no no major concerns regarding collinearity among predictors.

Checking Variance Inflation Factors

Code

mod_A <- lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)

car::vif(mod_A)

                  GVIF Df GVIF^(1/(2*Df))
age           1.029741  1        1.014762
sex           1.012537  1        1.006249
race          1.222110  4        1.025389
health_status 1.103322  2        1.024886
adult_edu     1.273337  5        1.024458

The predictors’ generalized variance inflation factors (GVIF) demonstrate a small level of collinearity. We therefore have no serious concerns with respect to collinearity.

12.4 Model A

12.4.1 Fitting Model A

Now we will build mod_A “main effect” which incorporates five predictors age, sex, race, health_status, adult_edu using lm.

Code

mod_A <- lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)

Our next step will be to fit model A “main effect” with the ols() function from the rms package, which we will use later.

Code

dd <- datadist(child22_sub)
options(datadist = "dd")

mod_A_ols <- ols(sqrtsdq ~ age + sex + race + health_status + adult_edu,
                 data = child22_sub, x = TRUE, y = TRUE)

12.4.2 Tidied Coefficient Estimates (Model A)

We will use the tidy function to obtain the estimates, standard error, lower and upper confidence interval and p-values.

Code

tidy(mod_A, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, se = std.error, 
         low95 = conf.low, high95 = conf.high, 
         p = p.value) |>
 gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 20)

term	estimate	se	low95	high95	p
(Intercept)	2.603	0.168	2.272	2.933	0.000
age	−0.015	0.007	−0.028	−0.001	0.030
sexFemale	−0.014	0.055	−0.123	0.095	0.801
raceWhite	0.185	0.087	0.015	0.355	0.033
raceAfrican_American	0.043	0.133	−0.219	0.304	0.750
raceAsian	−0.150	0.145	−0.435	0.135	0.302
raceother	0.300	0.130	0.045	0.555	0.021
health_statusVery Good	0.545	0.067	0.414	0.677	0.000
health_statusNot very good	0.714	0.085	0.548	0.880	0.000
adult_educomplete_highschool	−0.173	0.168	−0.503	0.158	0.306
adult_eduincomplete_college	−0.244	0.168	−0.575	0.086	0.148
adult_eduassociate_degree	−0.103	0.168	−0.433	0.226	0.540
adult_edubachelors_degree	−0.422	0.165	−0.746	−0.098	0.011
adult_edugraduate_degree	−0.354	0.166	−0.679	−0.029	0.033

12.4.3 Summarizing Fit (Model A)

In this section we will use glance function to obtain R-squared, Adj.r.squared, sigma, AIC and BIC, nobs, df, and df.residual.

Code

glance(mod_A) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.124	0.114	0.94	3209.4	3285.4	1173	13	1159

The R-squared value indicates that the model accounts for 12.4% of the variation in sqrtsdq. The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 13 df.

12.4.4 Regression Diagnostics (Model A)

Here, the code will yield four residual plots

Code

par(mfrow = c(2,2)); plot(mod_A); par(mfrow = c(2,2))

Given the outcome’s discrete nature and the presence of four categorical predictors, we note certain patterns in our linearity assumption in the residual vs. fitted plot and the homoscedasticity assumption in the scale-location plots. However, the Q-Q plot of residuals reveals a few outliers at the lower end, albeit not influential as per the residuals vs. leverage plot (nothing passed the Cook’s distance). In conclusion, while there are minor deviations, overall, our assumptions appear to be reasonably met.

12.5 Non-Linearity

Presented below is the needed Spearman \(\rho^2\) plot.

Code

plot(spearman2(sqrtsdq ~ age + sex + race + health_status + adult_edu,
                 data = child22_sub))

Based to the Spearman \(\rho^2\) plot, the most attractive candidate predictors for non-linear terms are health_status and adult_edu, respectively. We will include an interaction between health_status (2 df) and adult_edu (5 df), which is expected to contribute 10 degrees of freedom to our initial model.

Note: this approach has received approval from Prof. Thomas Love on 2024-03-17.

12.6 Model B

Our Model B will include one non-linear term the interaction between health_status and adult_edu, this should add about 10 degrees of freedom to our Model A.

12.6.1 Fitting Model B

Now we will build augmented model B with interaction term which incorporates five predictors age, sex, race, health_status*adult_edu using lm.

Code

mod_B <- lm(sqrtsdq ~ age + sex + race  + health_status * adult_edu, data = child22_sub)

Our next step will be to fit model B with the ols() function from the rms package.

Code

dd <- datadist(child22_sub)
options(datadist = "dd")

mod_B_ols <- ols(sqrtsdq ~ age + sex + race + health_status * adult_edu, data = child22_sub, x = TRUE, y = TRUE)

12.6.2 Tidied Coefficient Estimates (Model B)

We will use the tidy function to obtain the estimates, standard error, lower and upper confidence interval and p-values.

Code

tidy(mod_B, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, se = std.error, 
         low95 = conf.low, high95 = conf.high, 
         p = p.value) |>
gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 20)

term	estimate	se	low95	high95	p
(Intercept)	2.754	0.229	2.306	3.203	0.000
age	−0.014	0.007	−0.027	−0.001	0.038
sexFemale	−0.011	0.056	−0.121	0.098	0.840
raceWhite	0.185	0.087	0.014	0.356	0.034
raceAfrican_American	0.055	0.135	−0.209	0.319	0.683
raceAsian	−0.168	0.146	−0.454	0.119	0.251
raceother	0.296	0.131	0.039	0.553	0.024
health_statusVery Good	0.221	0.419	−0.600	1.043	0.597
health_statusNot very good	0.431	0.333	−0.222	1.084	0.196
adult_educomplete_highschool	−0.218	0.244	−0.696	0.259	0.370
adult_eduincomplete_college	−0.382	0.240	−0.854	0.090	0.112
adult_eduassociate_degree	−0.273	0.236	−0.737	0.191	0.248
adult_edubachelors_degree	−0.609	0.230	−1.061	−0.158	0.008
adult_edugraduate_degree	−0.534	0.229	−0.984	−0.084	0.020
health_statusVery Good:adult_educomplete_highschool	0.232	0.454	−0.659	1.123	0.610
health_statusNot very good:adult_educomplete_highschool	−0.061	0.381	−0.809	0.688	0.873
health_statusVery Good:adult_eduincomplete_college	0.140	0.453	−0.748	1.028	0.758
health_statusNot very good:adult_eduincomplete_college	0.417	0.381	−0.331	1.164	0.274
health_statusVery Good:adult_eduassociate_degree	0.385	0.448	−0.493	1.264	0.390
health_statusNot very good:adult_eduassociate_degree	0.253	0.386	−0.503	1.010	0.511
health_statusVery Good:adult_edubachelors_degree	0.413	0.438	−0.447	1.274	0.346
health_statusNot very good:adult_edubachelors_degree	0.405	0.387	−0.355	1.164	0.296
health_statusVery Good:adult_edugraduate_degree	0.366	0.441	−0.499	1.231	0.406
health_statusNot very good:adult_edugraduate_degree	0.515	0.400	−0.269	1.300	0.198

12.6.3 Effects Plot for Model B

Here, we will present a plot of the effects from plot(summary(modelname))for this augmented model, using ols

Code

plot(summary(mod_B_ols))

12.6.4 Summarizing Fit (Model B)

In this section we will use glance function to obtain R-squared, Adj.r.squared, sigma, AIC and BIC, nobs, df, and df.residual.

Code

glance(mod_B) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.131	0.113	0.94	3220.3	3347	1173	23	1149

In our Model B, we use 23 degrees of freedom, and obtain an \(R^2\) value of 0.131. The R-squared value indicates that the model accounts for 13 % of the variation in sqrtsdq. The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 23 df.

12.6.5 Regression Diagnostics (Model B)

Here, the code will yield four residual plots

Code

par(mfrow = c(2,2)); plot(mod_B); par(mfrow = c(1,1))

Considering the discrete nature of the outcome and the presence of four categorical predictors, we observed certain patterns in the residual vs. fitted plot and the scale-location plots. However, these patterns do not substantially affect the linearity and homoscedasticity assumptions. Additionally, the Q-Q plot of residuals shows a few outliers at both ends. Among these outliers, observations 627 and 859 exhibit high leverage, but they are not influential according to the residuals vs. leverage plot (none exceed the Cook’s distance threshold). In summary, while there are minor deviations, our assumptions seem to be reasonably met overall.

12.7 Validating Models A and B

We will use the validate() function from the rms package to validate our ols fits that we previously created by re-sampling.

Code

set.seed(4321); (valA <- validate(mod_A_ols))

          index.orig training   test optimism index.corrected   Lower  Upper  n
R-square      0.1238   0.1344 0.1122   0.0223          0.1016  0.0489 0.1397 40
MSE           0.8804   0.8638 0.8921  -0.0284          0.9088  0.8445 0.9805 40
g             0.3994   0.4115 0.3827   0.0288          0.3707  0.3021 0.4284 40
Intercept     0.0000   0.0000 0.1844  -0.1844          0.1844 -0.3763 0.5931 40
Slope         1.0000   1.0000 0.9297   0.0703          0.9297  0.7624 1.1380 40

Code

set.seed(4322); (valB <- validate(mod_B_ols))

          index.orig training   test optimism index.corrected   Lower  Upper  n
R-square      0.1306   0.1439 0.1107   0.0332          0.0973  0.0579 0.1353 40
MSE           0.8736   0.8632 0.8936  -0.0305          0.9041  0.8266 0.9562 40
g             0.4124   0.4285 0.3815   0.0471          0.3653  0.3113 0.4232 40
Intercept     0.0000   0.0000 0.2737  -0.2737          0.2737 -0.1928 0.6795 40
Slope         1.0000   1.0000 0.8935   0.1065          0.8935  0.7394 1.0579 40

After performing the validation, the output provides the R-squared value, MSE (mean squared error), intercept, and slope, along with their respective indices: “index.orig,” “training,” “test,” and “optimism,” which represents the difference between the training and test samples. Additionally, it includes the “index.corrected,” calculated as “index.orig” minus “optimism.”

12.7.1 Validated \(R^2\), and MSE as well as IC statistics

Model	Validated \(R^2\)	Validated MSE	AIC	BIC	df
A	0.102	0.9088	3209.4	3285.4	13
B	0.097	0.9041	3220.3	3347	23

12.8 Final Linear Regression Model

The table above provides information about the validated R-squared, MSE, AIC, and BIC. These values were obtained from a validation comparison between the “augmented model” B and the “main effects” model A, predicting the transformed outcome (sqrtsdq). Validated R-squared represents the percentage of variation in the outcome, while MSE stands for mean squared error, and AIC and BIC are Akaike and Bayesian Information Criteria, respectively. A more favorable fit is indicated by a higher validated R-squared and lower values of MSE, AIC, and BIC. In this case, model A “main effects” or the simple model shows a higher validated R-squared value and lower AIC and BIC, suggesting the most desirable fit, despite Model B having a lower validated MSE.Additionally, it’s worth noting that the inclusion of interaction terms in the augmented model yields no apparent improvement.

12.8.1 Winning Model’s Parameter Estimates

Here, we will generate the content of mod_A_ols

Code

mod_A_ols

Linear Regression Model

ols(formula = sqrtsdq ~ age + sex + race + health_status + adult_edu, 
    data = child22_sub, x = TRUE, y = TRUE)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs    1173    LR chi2    155.06    R2       0.124    
sigma0.9439    d.f.           13    R2 adj   0.114    
d.f.   1159    Pr(> chi2) 0.0000    g        0.399    

Residuals

     Min       1Q   Median       3Q      Max 
-2.24347 -0.68510 -0.03101  0.61304  2.97395 

                              Coef    S.E.   t     Pr(>|t|)
Intercept                      2.6027 0.1684 15.45 <0.0001 
age                           -0.0145 0.0067 -2.17 0.0298  
sex=Female                    -0.0140 0.0555 -0.25 0.8014  
race=White                     0.1848 0.0866  2.13 0.0331  
race=African_American          0.0426 0.1334  0.32 0.7498  
race=Asian                    -0.1500 0.1452 -1.03 0.3018  
race=other                     0.2998 0.1299  2.31 0.0212  
health_status=Very Good        0.5453 0.0670  8.14 <0.0001 
health_status=Not very good    0.7145 0.0846  8.44 <0.0001 
adult_edu=complete_highschool -0.1726 0.1684 -1.02 0.3058  
adult_edu=incomplete_college  -0.2440 0.1684 -1.45 0.1477  
adult_edu=associate_degree    -0.1031 0.1680 -0.61 0.5396  
adult_edu=bachelors_degree    -0.4218 0.1650 -2.56 0.0107  
adult_edu=graduate_degree     -0.3540 0.1656 -2.14 0.0328

The validated \(R^2\) value ( = 0.102) for our Model A

The output provides details about the formula used for the model, along with the number of observations. It includes the estimated residual standard deviation (sigma) and its corresponding degrees of freedom (df), as well as the model’s likelihood ratio test and discrimination indexes. Additionally, it presents a statistical summary, including the minimum, first quartile (1Q), median, third quartile (3Q), and maximum values for “sqrtsdq.” Finally, it lists the coefficient values, standard errors, t-test statistics, and p-values for all predictors.

12.8.2 Effects Plot for Winning Model

Here, we will present a plot of the effects from plot(summary(modelname))for this simple main effect model, using ols

Code

plot(summary(mod_A_ols))

12.8.3 Numerical Description of Effect Sizes

Here, we will obtain the effects summary of the ols model.

Code

summary(mod_A_ols) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.95	Upper 0.95	Type
age	4	11	7	-0.102	0.047	-0.194	-0.010	1
sex - Female:Male	1	2	NA	-0.014	0.055	-0.123	0.095	1
race - Hispanic:White	2	1	NA	-0.185	0.087	-0.355	-0.015	1
race - African_American:White	2	3	NA	-0.142	0.117	-0.372	0.088	1
race - Asian:White	2	4	NA	-0.335	0.127	-0.583	-0.086	1
race - other:White	2	5	NA	0.115	0.111	-0.102	0.332	1
health_status - Very Good:Excellent	1	2	NA	0.545	0.067	0.414	0.677	1
health_status - Not very good:Excellent	1	3	NA	0.714	0.085	0.548	0.880	1
adult_edu - incomplete_highschool:bachelors_degree	5	1	NA	0.422	0.165	0.098	0.746	1
adult_edu - complete_highschool:bachelors_degree	5	2	NA	0.249	0.095	0.063	0.435	1
adult_edu - incomplete_college:bachelors_degree	5	3	NA	0.178	0.092	-0.003	0.359	1
adult_edu - associate_degree:bachelors_degree	5	4	NA	0.319	0.088	0.147	0.491	1
adult_edu - graduate_degree:bachelors_degree	5	6	NA	0.068	0.078	-0.085	0.220	1

This effects summary of model A illustrates the impact on sqrtsdq when moving from the 25th to the 75th percentile of each variable. The estimates are accompanied by their respective standard errors and 95% confidence intervals, while maintaining the other variables at a constant level.

12.8.4 Effect Size Description

health_status : If two children share the same age, sex, race, and parental education, our model predicts that a child reporting “Not very good” overall health status will have a square root of SDQ score 0.741 higher than a child reporting “Excellent” overall health. This prediction comes with a meaningful 95% confidence interval of (0.548, 0.880).

12.8.5 Nomogram of Winning Model

Here, we will hypothetically generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported “not very good” health statuses, and their parental education level is incomplete college education. Manually

Child one: age 8 = ~12.5 points, sex Female= 0 points, race White= ~ 47.5 points, health_status not very good = 100 points, adult_edu incomplete college= ~ 25, total points= ~ 185 and it is correspond to ~3.1 in the linear predictor scale.
Child two: age 8 = ~12.5 points, sex Female= 0 points, race Hispanic = ~ 21 points, health_status not very good = 100 points, adult_edu incomplete college= ~ 25, total points= ~ 185 and it is correspond to ~ 2.7 in the linear predictor scale.

These predicted values are square root.

Code

plot(nomogram(mod_A_ols, fun = exp))

12.8.6 Prediction for a New Subject

We will generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported “not very good” health statuses, and their parental education level is incomplete college education. We will use the code below and then squaring the predictions.

Code

new_subjects <- 
  data.frame(age = c(8,8), sex = c("Female", "Female"), race = c("White", "Hispanic"),
             health_status = c("Not very good", "Not very good"), adult_edu = c("incomplete_college", "incomplete_college"))

preds1 <- predict.lm(mod_A, newdata = new_subjects, 
                     interval = "prediction", level = 0.95)

(preds1)^2

       fit      lwr     upr
1 9.781555 1.597199 24.9130
2 8.659805 1.156341 23.1378

Predictor Values	Predicted SDQ	95% Prediction Interval
age = 8, sex = Female, race= White, health_status = Not very good, adult_edu = incomplete_college	9.78	(1.59, 24.91)
age = 8, sex = Female, race= Hispanic, health_status = Not very good, adult_edu = incomplete_college	8.66	(1.16, 23.14)

The code below will generate the predicted values in square root form, and they align with the values obtained from the nomogram plot.

Code

new_subjects <- 
  data.frame(age = c(8,8), sex = c("Female", "Female"), race = c("White", "Hispanic"),
             health_status = c("Not very good", "Not very good"), adult_edu = c("incomplete_college", "incomplete_college"))

preds2 <- predict.lm(mod_A, newdata = new_subjects, 
                     interval = "prediction", level = 0.95)

preds2

       fit      lwr      upr
1 3.127548 1.263803 4.991292
2 2.942755 1.075333 4.810176

13 Logistic Regression Analyses

We will start by creating a data subset including the predictors for the logistic regression analysis:

Code

child22_log <- child22|>
  select(id, sex, age, race, health_status, adult_edu, violence)

child22_log

# A tibble: 1,191 × 7
   id      sex      age race     health_status adult_edu             violence
   <chr>   <fct>  <dbl> <fct>    <fct>         <fct>                 <fct>   
 1 H051614 Female    14 White    Not very good bachelors_degree      No      
 2 H016609 Female     4 Hispanic Excellent     incomplete_highschool No      
 3 H037925 Male       2 Hispanic Excellent     incomplete_highschool No      
 4 H039485 Male       7 White    Excellent     graduate_degree       No      
 5 H008818 Male       2 White    Excellent     incomplete_college    No      
 6 H064500 Male      14 Asian    Excellent     graduate_degree       No      
 7 H029084 Male       4 White    Very Good     bachelors_degree      No      
 8 H037799 Female    10 White    Very Good     bachelors_degree      No      
 9 H012805 Male      12 White    Very Good     bachelors_degree      <NA>    
10 H003829 Male       9 White    Excellent     graduate_degree       No      
# ℹ 1,181 more rows

Before proceeding any further, we will notice that our outcome violence is a factor. We will mutate it into a numeric variable with 0 and 1 levels.

Code

child22_log <- child22_log |> mutate(violence = as.numeric (violence))|> mutate(violence = 2 - violence)

13.1 Predictor Selection

In the logistic analysis plan, we decided to reduce the number of predictors from 5 variables to 4 variables. All of the predictors included in the model are child social factors including their age, sex, race, and health status. Parental educational level adult_edu is the only variable associated with parents. This variable also has the highest number of levels (6 educational levels) and therefore will use more degrees of freedom compared to the rest of the predictors. Based on these facts, we decided to remove this variable from the logistic model.

The finalized logistic model data set will include be:

Code

child22_log <- child22_log|>
   select(-adult_edu)

child22_log

# A tibble: 1,191 × 6
   id      sex      age race     health_status violence
   <chr>   <fct>  <dbl> <fct>    <fct>            <dbl>
 1 H051614 Female    14 White    Not very good        0
 2 H016609 Female     4 Hispanic Excellent            0
 3 H037925 Male       2 Hispanic Excellent            0
 4 H039485 Male       7 White    Excellent            0
 5 H008818 Male       2 White    Excellent            0
 6 H064500 Male      14 Asian    Excellent            0
 7 H029084 Male       4 White    Very Good            0
 8 H037799 Female    10 White    Very Good            0
 9 H012805 Male      12 White    Very Good           NA
10 H003829 Male       9 White    Excellent            0
# ℹ 1,181 more rows

13.2 Dealing with missingness

We will observe the number of subjects with missing observations and the proportion of missingness in our data set.

Code

miss_var_summary(child22_log)

# A tibble: 6 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <num>
1 violence          22   1.85  
2 health_status      1   0.0840
3 id                 0   0     
4 sex                0   0     
5 age                0   0     
6 race               0   0

There is a total of 23 subjects with missing observations and the overall proportion of missingness is 1.9%. In this case, imputing the missing observations is neither essential nor unnecessary, so we will proceed with the imputation process.

We will use the single imputation method using simpute function:

Code

set.seed(432)
child22_si <- child22_log |> data.frame() |>
  impute_rhd(violence ~ age + sex + race) |>
  impute_cart(health_status ~ age + sex + race) |>
  as_tibble()

n_miss(child22_si)

[1] 0

13.3 Model Y: Main effects model

13.3.1 ModelY Results using `glm`

Now we will fit the main effects model using glm function.

Code

mody_g <- glm(violence ~ sex + age + race + health_status,
            data = child22_si, 
            family = binomial(link = logit))

Now we will produce a tidied table with the exponentiated estimates (odd ratios) with a 95% confidence interval.

Code

tidy(mody_g, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, 
         p = p.value) |> 
  gt() |> 
  fmt_number(decimals = 3) |>
  tab_options(table.font.size = 20)

term	estimate	std.error	low95	high95	p
(Intercept)	0.028	0.426	0.012	0.062	0.000
sexFemale	1.360	0.223	0.879	2.114	0.168
age	1.059	0.028	1.003	1.119	0.040
raceWhite	1.058	0.347	0.556	2.194	0.871
raceAfrican_American	2.520	0.441	1.062	6.082	0.036
raceAsian	0.425	0.787	0.064	1.654	0.276
raceother	2.116	0.456	0.857	5.229	0.100
health_statusVery Good	1.766	0.257	1.059	2.911	0.027
health_statusNot very good	2.116	0.293	1.173	3.717	0.010

We will check for the presence of interaction between the predictors:

Code

rms::vif(mody_g)

                 sexFemale                        age 
                  1.008948                   1.016786 
                 raceWhite       raceAfrican_American 
                  2.344559                   1.785802 
                 raceAsian                  raceother 
                  1.163887                   1.718125 
    health_statusVery Good health_statusNot very good 
                  1.157662                   1.196977

As we can see, the inflation scores do not exceed the level of potential interaction. Therefore, we have no problems regarding collinearity between the predictors.

13.3.2 ModelY Results using `lrm`

We will refit the same model using the lrm function.

Code

d <- datadist(child22_si)
options(datadist = "d")

mody_r <- lrm(violence ~ sex + age + race + health_status,
            data = child22_si, x = TRUE, y = TRUE)

Now we will produce an effects plot of the odds ratio scale for the model variables.

Code

plot(summary(mody_r))

We can also assess the relationship between each predictor and the outcome independently.

Code

ggplot(Predict(mody_r, fun = plogis))

13.3.3 Key fit summary statistics

Code

mody_r

Logistic Regression Model

lrm(formula = violence ~ sex + age + race + health_status, data = child22_si, 
    x = TRUE, y = TRUE)

                      Model Likelihood      Discrimination    Rank Discrim.    
                            Ratio Test             Indexes          Indexes    
Obs          1191    LR chi2     29.43      R2       0.059    C       0.674    
 0           1100    d.f.            8     R2(8,1191)0.018    Dxy     0.348    
 1             91    Pr(> chi2) 0.0003    R2(8,252.1)0.081    gamma   0.350    
max |deriv| 5e-05                           Brier    0.069    tau-a   0.049    

                            Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                   -3.5721 0.4259 -8.39  <0.0001 
sex=Female                   0.3074 0.2232  1.38  0.1684  
age                          0.0570 0.0277  2.06  0.0396  
race=White                   0.0563 0.3473  0.16  0.8713  
race=African_American        0.9244 0.4407  2.10  0.0359  
race=Asian                  -0.8567 0.7870 -1.09  0.2763  
race=other                   0.7494 0.4561  1.64  0.1004  
health_status=Very Good      0.5690 0.2569  2.21  0.0268  
health_status=Not very good  0.7495 0.2928  2.56  0.0105

Nagelkerke value is 0.059, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.674, which indicates that the predictions of our model are not very reliable.

13.3.4 Confusion matrix

We will produce the confusion matrix using a cutting point of 0.08.

Code

mody_g_aug <- augment(mody_g, type.predict = "response")

cm <- confusionMatrix(
  data = factor(mody_g_aug$.fitted >= 0.08),
  reference = factor(mody_g_aug$violence == 1),
  positive = "TRUE")

cm

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   757   36
     TRUE    343   55
                                          
               Accuracy : 0.6818          
                 95% CI : (0.6545, 0.7082)
    No Information Rate : 0.9236          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1149          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.60440         
            Specificity : 0.68818         
         Pos Pred Value : 0.13819         
         Neg Pred Value : 0.95460         
             Prevalence : 0.07641         
         Detection Rate : 0.04618         
   Detection Prevalence : 0.33417         
      Balanced Accuracy : 0.64629         
                                          
       'Positive' Class : TRUE

The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%.

13.4 Non-linearity

Spearman \(\rho^2\) plot

Code

plot(spearman2(violence ~ sex + age + race + health_status ,
            data = child22_si))

The Spearman \(\rho^2\) plot is suggesting that race and health_status are the most suitable predictors for adding an interaction term. Since both variables are categorical, we will add an interaction term between race and health_status.

13.5 Model Z: incorporating non-linear terms

Now we will fit another model with an interaction term between race and health_status.

Code

modz_r <- lrm(violence ~ sex + age + race + health_status + 
                 race %ia% health_status,
            data = child22_si, x = TRUE, y = TRUE)

Now we will confirm the number of additional degrees of freedom using anova:

Code

anova(modz_r)

                Wald Statistics          Response: violence 

 Factor                                              Chi-Square d.f. P     
 sex                                                  1.71       1   0.1914
 age                                                  4.40       1   0.0359
 race  (Factor+Higher Order Factors)                 15.45      12   0.2175
  All Interactions                                    4.57       8   0.8026
 health_status  (Factor+Higher Order Factors)        14.60      10   0.1473
  All Interactions                                    4.57       8   0.8026
 race * health_status  (Factor+Higher Order Factors)  4.57       8   0.8026
 TOTAL                                               32.85      16   0.0077

The interaction resulted in 8 additional degrees of freedom.

13.5.1 Modelz Results using `glm`

We can also use the glm function to produce tidied results of the interaction model.

Code

modz_g <- glm(violence ~ sex + age + race + health_status + 
                 race %ia% health_status,
                 data =child22_si, 
                 family =binomial(link =logit))

Tidied coefficients:

Code

tidy(modz_g, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low90 = conf.low, high90 = conf.high) |> 
  gt() |> 
  fmt_number(decimals = 3) |> 
  tab_options(table.font.size = 14)

term	estimate	std.error	low90	high90
(Intercept)	0.025	0.568	0.007	0.069
sexFemale	1.340	0.224	0.865	2.088
age	1.060	0.028	1.004	1.121
raceWhite	1.034	0.553	0.387	3.588
raceAfrican_American	4.955	0.664	1.392	20.109
raceAsian	1.188	0.892	0.159	6.415
raceother	2.128	0.735	0.479	9.465
health_statusVery Good	2.574	0.738	0.576	11.511
health_statusNot very good	1.996	0.796	0.374	9.610
race %ia% health_statusrace=White * health_status=Very Good	0.802	0.807	0.158	4.047
race %ia% health_statusrace=White * health_status=Not very good	1.587	0.879	0.280	9.725
race %ia% health_statusrace=African_American * health_status=Very Good	0.201	1.140	0.018	1.765
race %ia% health_statusrace=African_American * health_status=Not very good	0.426	1.055	0.052	3.532
race %ia% health_statusrace=Asian * health_status=Very Good	0.000	597.342	0.000	165.518
race %ia% health_statusrace=Asian * health_status=Not very good	0.000	720.840	0.000	0.000
race %ia% health_statusrace=other * health_status=Very Good	0.913	1.036	0.118	7.277
race %ia% health_statusrace=other * health_status=Not very good	0.971	1.231	0.076	10.872

13.5.2 Modelz Results using `lrm`

We will observe the plot of the effects using the lrm function:

Code

plot(summary(modz_r))

We can also assess the relationship between each predictor and the outcome independently.

Code

ggplot(Predict(modz_r, fun = plogis))

13.5.3 Key fit summary statistics

Code

modz_r

Logistic Regression Model

lrm(formula = violence ~ sex + age + race + health_status + race %ia% 
    health_status, data = child22_si, x = TRUE, y = TRUE)

                       Model Likelihood       Discrimination    Rank Discrim.    
                             Ratio Test              Indexes          Indexes    
Obs           1191    LR chi2     37.73       R2       0.075    C       0.683    
 0            1100    d.f.           16     R2(16,1191)0.018    Dxy     0.366    
 1              91    Pr(> chi2) 0.0017    R2(16,252.1)0.083    gamma   0.368    
max |deriv| 0.0003                            Brier    0.068    tau-a   0.052    

                                                    Coef     S.E.     Wald Z
Intercept                                            -3.6802   0.5678 -6.48 
sex=Female                                            0.2929   0.2242  1.31 
age                                                   0.0585   0.0279  2.10 
race=White                                            0.0334   0.5535  0.06 
race=African_American                                 1.6004   0.6636  2.41 
race=Asian                                            0.1720   0.8922  0.19 
race=other                                            0.7551   0.7355  1.03 
health_status=Very Good                               0.9455   0.7384  1.28 
health_status=Not very good                           0.6911   0.7955  0.87 
race=White * health_status=Very Good                 -0.2205   0.8066 -0.27 
race=White * health_status=Not very good              0.4618   0.8791  0.53 
race=African_American * health_status=Very Good      -1.6043   1.1401 -1.41 
race=African_American * health_status=Not very good  -0.8524   1.0549 -0.81 
race=Asian * health_status=Very Good                -12.6790 369.3752 -0.03 
race=Asian * health_status=Not very good            -12.5553 446.0065 -0.03 
race=other * health_status=Very Good                 -0.0913   1.0359 -0.09 
race=other * health_status=Not very good             -0.0294   1.2306 -0.02 
                                                    Pr(>|Z|)
Intercept                                           <0.0001 
sex=Female                                          0.1914  
age                                                 0.0359  
race=White                                          0.9519  
race=African_American                               0.0159  
race=Asian                                          0.8471  
race=other                                          0.3046  
health_status=Very Good                             0.2003  
health_status=Not very good                         0.3850  
race=White * health_status=Very Good                0.7846  
race=White * health_status=Not very good            0.5994  
race=African_American * health_status=Very Good     0.1594  
race=African_American * health_status=Not very good 0.4191  
race=Asian * health_status=Very Good                0.9726  
race=Asian * health_status=Not very good            0.9775  
race=other * health_status=Very Good                0.9298  
race=other * health_status=Not very good            0.9810

Negelkerke value is 0.075, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.682, which indicates that the predictions of our model are not very reliable.

13.5.4 Confusion matrix

We will produce the confusion matrix using a cutting point of 0.08.

Code

modz_g_aug <- augment(modz_g, type.predict = "response")

cm <- confusionMatrix(
  data = factor(modz_g_aug$.fitted >= 0.08),
  reference = factor(modz_g_aug$violence == 1),
  positive = "TRUE")

cm

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   757   36
     TRUE    343   55
                                          
               Accuracy : 0.6818          
                 95% CI : (0.6545, 0.7082)
    No Information Rate : 0.9236          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1149          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.60440         
            Specificity : 0.68818         
         Pos Pred Value : 0.13819         
         Neg Pred Value : 0.95460         
             Prevalence : 0.07641         
         Detection Rate : 0.04618         
   Detection Prevalence : 0.33417         
      Balanced Accuracy : 0.64629         
                                          
       'Positive' Class : TRUE

The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%. These values match the main effects model mody values.

13.6 Validating the models

We will tidy the summaries of the main effects model mody and interaction model modz and compare them:

Code

temp1 <- bind_rows(glance(mody_g), glance(modz_g)) |>
  mutate(model = c("Y", "Z")) |>
  select(model, AIC, BIC) 

temp2 <- tibble(model = c("Y", "Z"),
  auc = c(mody_r$stats["C"], modz_r$stats["C"]),
  r2_nag = c(mody_r$stats["R2"], modz_r$stats["R2"]))

left_join(temp1, temp2, by = "model") |> 
  gt() |> fmt_number(columns = AIC:BIC, decimals = 1) |>
  fmt_number(columns = auc:r2_nag, decimals = 4) |> 
  tab_options(table.font.size = 20)

model	AIC	BIC	auc	r2_nag
Y	631.5	677.2	0.6740	0.0585
Z	639.2	725.6	0.6828	0.0748

The interaction model modz has a slightly better performance compared to the main effects model mody.

Now we will validate the models to determine our final model:

Code

set.seed(432)
valY <- validate(mody_r, B = 40)
valZ <- validate(modz_r, B = 40)

val_1 <- bind_rows(valY[1,], valZ[1,]) |>
  mutate(model = c("Y", "Z"),
         AUC_nominal = 0.5 + (index.orig/2), 
         AUC_validated = 0.5 + (index.corrected/2)) |>
  select(model, AUC_nominal, AUC_validated)

val_2 <- bind_rows(valY[2,], valZ[2,]) |>
  mutate(model = c("Y", "Z"),
         R2_nominal = index.orig,
         R2_validated = index.corrected) |>
  select(model, R2_nominal, R2_validated)

val <- left_join(val_1, val_2, by = "model") 

val |> gt() |> fmt_number(decimals = 4) |> 
  tab_options(table.font.size = 20)

model	AUC_nominal	AUC_validated	R2_nominal	R2_validated
Y	0.6740	0.6376	0.0585	0.0207
Z	0.6828	0.6320	0.0748	0.0085

After validating the area under the curve value (AUC) and the R2 value, the main effects model mody displayed improved performance compared to the interaction model modz.

13.7 Final Model

After validating the models, the main effects model mody yielded better results. Both the Nagelkerke’s R squared and ROC values were improved compared to the interaction modz model. The main effects model mody will be the model of choice in this case.

The main effects model mody parameters and their coefficients with a 95% confidence interval is displayed in the following table:

Code

tidy(mody_g, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, 
         p = p.value) |> 
  gt() |> 
  fmt_number(decimals = 3) |>
  tab_options(table.font.size = 20)

term	estimate	std.error	low95	high95	p
(Intercept)	0.028	0.426	0.012	0.062	0.000
sexFemale	1.360	0.223	0.879	2.114	0.168
age	1.059	0.028	1.003	1.119	0.040
raceWhite	1.058	0.347	0.556	2.194	0.871
raceAfrican_American	2.520	0.441	1.062	6.082	0.036
raceAsian	0.425	0.787	0.064	1.654	0.276
raceother	2.116	0.456	0.857	5.229	0.100
health_statusVery Good	1.766	0.257	1.059	2.911	0.027
health_statusNot very good	2.116	0.293	1.173	3.717	0.010

The social factor associated with the highest odds of violence is African American race (2.520) with a 95% CI (1.062, 6.082). The second highest social factor were Other race (2.116) with a 95% CI (0.857, 5.229) and health status Not very good (2.116) with a 95% CI (1.173, 3.717). The social factor associated with the lowest odds ratio of violence was the Asian race (0.425) with a 95% CI (0.064, 1.654 ). Among these factors, the only confidence intervals not containing 1 were African American race and Not very good health status, indicating that these factors produced meaningful differences in the odds ratio of violence.

To observe the effect sizes we will display the effect plot:

Code

plot(summary(mody_r))

According to this model, an 11 year old child has 1.5 the odds of witnessing/experiencing violence compared to a 4 year old child, when all other variables are held constant. Female children have about 1.4 times the odds of witnessing/experiencing violence compared to their male counterparts, also holding all other variables constant. When observing race, African American children have about 2.4 the odds witnessing/experiencing violence compared to White children, followed by Other races (2.0). Hispanic have about the same odds of witnessing/experiencing violence (about 1.0) while Asian children have lower odds of witnessing/experiencing violence (about 0.4) compared to White children. Compared to children with excellent health status, those with very good health status have about 1.7 the odds of witnessing/experiencing violence, while those with not very good health status have about 2.0 the odds of witnessing/experiencing violence. Again, these assumption are made when all other variables are held constant.

ROC Curve Plot

Code

predict.prob1 <- predict(mody_g, type = "response")
roc1 <- pROC::roc(mody_g$data$violence, predict.prob1)

plot(roc1, main = "ROC Curve: Main effects model `mody`", lwd = 2, col = "navy")
legend('bottomright', legend = paste("AUC: ", 
                                     round_half_up(auc(roc1),4)))

The AUC values displayed here is the nominal AUC.

Validated Nagelkerke R-squared and the C statistic values

Code

set.seed(432)
valY <- validate(mody_r, B = 40)

val_1 <- bind_rows(valY[1,]) |>
  mutate(model = c("Y"),
         AUC_nominal = 0.5 + (index.orig/2), 
         AUC_validated = 0.5 + (index.corrected/2)) |>
  select(model, AUC_nominal, AUC_validated)

val_2 <- bind_rows(valY[2,]) |>
  mutate(model = c("Y"),
         R2_nominal = index.orig,
         R2_validated = index.corrected) |>
  select(model, R2_nominal, R2_validated)

val <- left_join(val_1, val_2, by = "model") 

val |> gt() |> fmt_number(decimals = 4) |> 
  tab_options(table.font.size = 20)

model	AUC_nominal	AUC_validated	R2_nominal	R2_validated
Y	0.6740	0.6376	0.0585	0.0207

The validated area under the curve (AUC) value is 0.638. This means that our model’s predictions are slightly better than a random guess. The validated Nagelkerke’s R squared value is 0.022, which means that our model does not show an improvement compared to a null model.

Using the nomogram to make a prediction:

Code

plot(nomogram(mody_r, fun = plogis,
            fun.at = c(0.05, seq(0.1, 0.9, by = 0.1), 0.95),
            funlabel = "Pr(violence)"))

We can use the nomogram to predict the odds ration of children by imputing the values of their social factors. We will hypothesize two subjects. The first will be a female African American child who is 5 years old and does not have a very good health status. The second subject will be male Asian child who is 13 years old and has a very good health status. By using this nomogram the first subject is predicted to have approximately 2.1 the odds of witnessing/experiencing violence while the second subject will have about 0.05 the odds. The first subject has remarkably higher probability of witnessing/experiencing violence, while the second subject has a lower probability.

14 Discussion

The first research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score? Based on the final linear regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, race (Asian) and health status (not very good). Having an Asian race contributed positively to the mental health score, while having a not very good health status contributed negatively.

The second research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence? Based on the final logistic regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, race (African American) and health status (not very good). Children with these characteristics presented remarkably higher probabilities of witnessing or experiencing violence compared to children with different social characteristics.

Project A was a challenging, yet achievable project. One of the prominent issues we faced was that the outcome of the linear regression model was discrete. There were a few difficulties during the analysis, and we wished we were more proficient in other suitable methods such as the Poisson or negative binomial regression. Another challenge was choosing an appropriate transformation for the outcome. We used the Box-Cox to guide us, however that was not useful. So, we opted for a different transformation until reaching a satisfying result. We realized that the Box-Cox can be misleading. We also wished we had collapsed some of the categorical levels as they had used more degrees of freedom in the model. However, this decision may compromise our ability to investigate meaningful differences between the categorical levels. The overall process of the project was not difficult as there was a roadmap guiding us (the instructions). However, making decisions on the order of the logistic models (the main effects model and interaction model using glm and lrm function) was confusing at times. We decided to organize the models in a way that facilitated comprehension for the audience. We learned that it is imperative to keep our study goal in mind during the analysis process.

15 Affirmation

I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.

16 References

Centers for Disease Control and Prevention. (2023, June 29). NHIS - 2022 NHIS. Centers for Disease Control and Prevention. https://www.cdc.gov/nchs/nhis/2022nhis.htm.

17 Session Information

Code

xfun::session_info()

R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8             askpass_1.2.1           backports_1.5.0        
  base64enc_0.1-3         bigD_0.3.1              bit_4.6.0              
  bit64_4.6.0-1           bitops_1.0.9            blob_1.2.4             
  boot_1.3.31             broom_1.0.10            bslib_0.9.0            
  cachem_1.1.0            callr_3.7.6             car_3.1-3              
  carData_3.0-5           cards_0.7.0             cardx_0.3.0            
  caret_7.0-1             cellranger_1.1.0        checkmate_2.3.3        
  class_7.3-23            cli_3.6.5               clipr_0.8.0            
  clock_0.7.3             cluster_2.1.8.1         codetools_0.2-20       
  colorspace_2.1-2        commonmark_2.0.0        compiler_4.5.1         
  conflicted_1.2.0        cowplot_1.2.0           cpp11_0.5.2            
  crayon_1.5.3            curl_7.0.0              data.table_1.17.8      
  DBI_1.2.3               dbplyr_2.5.1            DEoptimR_1.1.4         
  Deriv_4.2.0             diagram_1.6.5           digest_0.6.37          
  doBy_4.7.0              doRNG_1.8.6.2           dplyr_1.1.4            
  dtplyr_1.3.2            e1071_1.7-16            evaluate_1.0.5         
  farver_2.1.2            fastmap_1.2.0           fontawesome_0.5.3      
  fontBitstreamVera_0.1.1 fontLiberation_0.1.0    fontquiver_0.2.1       
  forcats_1.0.1           foreach_1.5.2           foreign_0.8-90         
  Formula_1.2-5           fs_1.6.6                future_1.67.0          
  future.apply_1.20.0     gargle_1.6.0            gdtools_0.4.4          
  generics_0.1.4          GGally_2.4.0            ggformula_1.0.0        
  ggiraph_0.9.2           ggplot2_4.0.0           ggrepel_0.9.6          
  ggridges_0.5.7          ggstats_0.11.0          glmnet_4.1.10          
  globals_0.18.0          glue_1.8.0              googledrive_2.1.2      
  googlesheets4_1.1.2     gower_1.0.2             graphics_4.5.1         
  grDevices_4.5.1         grid_4.5.1              gridExtra_2.3          
  gt_1.1.0                gtable_0.3.6            gtsummary_2.4.0        
  hardhat_1.4.2           haven_2.5.5             highr_0.11             
  Hmisc_5.2-4             hms_1.1.4               htmlTable_2.4.3        
  htmltools_0.5.8.1       htmlwidgets_1.6.4       httr_1.4.7             
  ids_1.0.1               ipred_0.9-15            isoband_0.2.7          
  iterators_1.0.14        itertools_0.1.3         janitor_2.2.1          
  jquerylib_0.1.4         jsonlite_2.0.0          juicyjuice_0.1.0       
  KernSmooth_2.23.26      knitr_1.50              labeling_0.4.3         
  labelled_2.15.0         laeken_0.5.3            lattice_0.22-7         
  lava_1.8.1              lifecycle_1.0.4         listenv_0.9.1          
  litedown_0.7            lme4_1.1.37             lmtest_0.9.40          
  lubridate_1.9.4         magrittr_2.0.4          markdown_2.0           
  MASS_7.3-65             Matrix_1.7-3            MatrixModels_0.5-4     
  memoise_2.0.1           methods_4.5.1           mgcv_1.9.3             
  microbenchmark_1.5.0    mime_0.13               minqa_1.2.8            
  missForest_1.5          ModelMetrics_1.2.2.2    modelr_0.1.11          
  mosaic_1.9.2            mosaicCore_0.9.5        mosaicData_0.20.4      
  multcomp_1.4-29         mvtnorm_1.3-3           naniar_1.1.0           
  nlme_3.1-168            nloptr_2.2.1            nnet_7.3-20            
  norm_1.0.11.1           numDeriv_2016.8.1.1     openssl_2.3.4          
  parallel_4.5.1          parallelly_1.45.1       patchwork_1.3.2        
  pbkrtest_0.5.5          pillar_1.11.1           pkgconfig_2.0.3        
  plyr_1.8.9              polspline_1.1.25        prettyunits_1.2.0      
  pROC_1.19.0.1           processx_3.8.6          prodlim_2025.04.28     
  progress_1.2.3          progressr_0.17.0        proxy_0.4-27           
  ps_1.9.1                purrr_1.1.0             quantreg_6.1           
  R6_2.6.1                ragg_1.5.0              randomForest_4.7.1.2   
  ranger_0.17.0           rappdirs_0.3.3          rbibutils_2.3          
  RColorBrewer_1.1-3      Rcpp_1.1.0              RcppEigen_0.3.4.0.2    
  Rdpack_2.6.4            reactable_0.4.4         reactR_0.6.1           
  readr_2.1.5             readxl_1.4.5            recipes_1.3.1          
  reformulas_0.4.1        rematch_2.0.0           rematch2_2.1.2         
  reprex_2.1.1            reshape2_1.4.4          rlang_1.1.6            
  rmarkdown_2.30          rms_8.1-0               rngtools_1.5.2         
  robustbase_0.99.6       rpart_4.1.24            rstudioapi_0.17.1      
  rvest_1.0.5             S7_0.2.0                sandwich_3.1-1         
  sass_0.4.10             scales_1.4.0            selectr_0.4.2          
  sessioninfo_1.2.3       shape_1.4.6.1           simputation_0.2.9      
  snakecase_0.11.1        sp_2.2.0                SparseM_1.84-2         
  sparsevctrs_0.3.4       splines_4.5.1           SQUAREM_2021.1         
  stats_4.5.1             stats4_4.5.1            stringi_1.8.7          
  stringr_1.5.2           survival_3.8-3          sys_3.4.3              
  systemfonts_1.3.1       textshaping_1.0.4       TH.data_1.1-4          
  tibble_3.3.0            tidyr_1.3.1             tidyselect_1.2.1       
  tidyverse_2.0.0         timechange_0.3.0        timeDate_4051.111      
  tinytex_0.57            tools_4.5.1             tzdb_0.5.0             
  UpSetR_1.4.0            utf8_1.2.6              utils_4.5.1            
  uuid_1.2.1              V8_8.0.1                vcd_1.4.13             
  vctrs_0.6.5             VIM_6.2.6               viridis_0.6.5          
  viridisLite_0.4.2       visdat_0.6.0            vroom_1.6.6            
  withr_3.0.2             xfun_0.52               xgboost_1.7.11.1       
  xml2_1.4.0              yaml_2.3.10             zoo_1.8-14

R Packages and Setup

5 Data Source

6 The Subjects

7 Loading and Tidying the Data

7.1 Loading the Raw Data

7.2 Cleaning the Data

7.2.1 Selecting Variables of Interest

7.2.2 Converting Variable Types

7.2.3 Changing Variable Names

7.2.4 Working with Categorical Predictors

7.2.4.1 sex variable

7.2.4.2 race variable

7.2.4.3 adult_edu variable

7.2.4.4 health_status variable

7.2.4.5 violence variable

7.2.5 Assessment of Missing Values

7.2.6 Arranging the Tibble

8 The Tidy Tibble

8.1 Listing the Tibble

8.2 Size and Identifiers

8.3 Save The Tibble

9 The Code Book

9.1 Defining the Variables

9.2 Numerical Description

10 Linear Regression Plans

10.1 My First Research Question

10.2 My Quantitative Outcome

10.3 My Planned Predictors (Linear Model)

11 Logistic Regression Plans

11.1 My Second Research Question

11.2 My Binary Outcome

11.3 My Planned Predictors (Logistic Model)

12 Linear Regression Analyses

12.1 Missingness

12.1.1 Single Imputation Approach

12.2 Outcome Transformation

12.3 Scatterplot Matrix and Collinearity

12.4 Model A

12.4.1 Fitting Model A

12.4.2 Tidied Coefficient Estimates (Model A)

12.4.3 Summarizing Fit (Model A)

12.4.4 Regression Diagnostics (Model A)

12.5 Non-Linearity

12.6 Model B

12.6.1 Fitting Model B

12.6.2 Tidied Coefficient Estimates (Model B)

12.6.3 Effects Plot for Model B

12.6.4 Summarizing Fit (Model B)

12.6.5 Regression Diagnostics (Model B)

12.7 Validating Models A and B

12.7.1 Validated \(R^2\), and MSE as well as IC statistics

12.8 Final Linear Regression Model

12.8.1 Winning Model’s Parameter Estimates

12.8.2 Effects Plot for Winning Model

12.8.3 Numerical Description of Effect Sizes

12.8.4 Effect Size Description

12.8.5 Nomogram of Winning Model

12.8.6 Prediction for a New Subject

13 Logistic Regression Analyses

13.1 Predictor Selection

13.2 Dealing with missingness

13.3 Model Y: Main effects model

13.3.1 ModelY Results using glm

13.3.2 ModelY Results using lrm

13.3.3 Key fit summary statistics

13.3.4 Confusion matrix

13.4 Non-linearity

13.5 Model Z: incorporating non-linear terms

13.5.1 Modelz Results using glm

13.5.2 Modelz Results using lrm

13.5.3 Key fit summary statistics

13.5.4 Confusion matrix

13.6 Validating the models

13.7 Final Model

14 Discussion

15 Affirmation

16 References

17 Session Information

13.3.1 ModelY Results using `glm`

13.3.2 ModelY Results using `lrm`

13.5.1 Modelz Results using `glm`

13.5.2 Modelz Results using `lrm`