The data set used for this study is the National Health Interview Survey (NHIS). The NHIS is part of the Centers for Disease Control and Prevention (CDC) and the main source of statistics on the health of the civilian non-institutionalized population of the United States. The data is collected through interviews conducted at the households of the subjects. The target population are non-institutionalized US citizens, including group homes and shelters. The NHIS aims to obtain a representative sample of the US population through geographically clustered sampling techniques. The data collection is conducted throughout the year, from January to December. For this study, the 2022 NHIS sample child interview data set was used. The file can be found at https://www.cdc.gov/nchs/nhis/2022nhis.htm.
6 The Subjects
The subjects of this study are children (any individual below the age of 18) living in civilian non-institutionalized households. The sample is restricted to those in the Midwest region of the U.S.
In this step, we will mutate, rename, and regroup our variables as needed.
7.2.1 Selecting Variables of Interest
Given the extensive size of the raw data set, we will filter the data set to include the variables of interest for our analysis. Additionally, we will narrow the population of interest to those living in the Midwest.
We will convert the variables into factors, except for HHX, which will remain as a character. The variables AGE_C and SDQTOT_C are the only quantitative variables and will be converted into numeric variables.
For this project, we have four categorical variables acting as predictors for both analyses and one binary outcome for the logistic regression analysis: sex, race, adult_edu, health_status, and violence
7.2.4.1 sex variable
We will rename the numeric levels of sex into meaningful labels:
child22$sex n percent
Male 613 0.5146935
Female 578 0.4853065
7.2.4.2 race variable
We will rename each category of the race variable to clarify the meaning of each level. We will also combine the levels with the lowest frequencies under other category.
child22$health_status n percent valid_percent
Excellent 733 0.6154492024 0.615966387
Very Good 289 0.2426532326 0.242857143
Good 141 0.1183879093 0.118487395
Fair 24 0.0201511335 0.020168067
Poor 3 0.0025188917 0.002521008
<NA> 1 0.0008396306 NA
We would like to collapse them into the following categories:
Excellent
Very Good
Not very good
Code
child22 <- child22 |>mutate (health_status=fct_collapse(health_status, "Not very good"=c("Good","Fair", "Poor")))tabyl(child22$health_status)
child22$health_status n percent valid_percent
Excellent 733 0.6154492024 0.6159664
Very Good 289 0.2426532326 0.2428571
Not very good 168 0.1410579345 0.1411765
<NA> 1 0.0008396306 NA
7.2.4.5 violence variable
We will rename the violence numeric levels into meaningful labels:
child22$violence n percent valid_percent
Yes 90 0.07556675 0.07698888
No 1079 0.90596138 0.92301112
<NA> 22 0.01847187 NA
7.2.5 Assessment of Missing Values
We will employ the miss_var_summary function to assess missing values in the study variables. The two outcomes of the linear and logistic regression (sdq and violence) have missing values. One of the predictors (health_status) also has a missing value.
This is the count of the final data set child22 after refining the variables of interest:
Code
dim(child22)
[1] 1191 8
we have 1191 observations and 8 variables of interest (including id).
8 The Tidy Tibble
8.1 Listing the Tibble
We will observe the finalized version of our data set after cleaning and filtering the variables:
Code
child22
# A tibble: 1,191 × 8
id sex age adult_edu health_status sdq violence race
<chr> <fct> <dbl> <fct> <fct> <dbl> <fct> <fct>
1 H051614 Female 14 bachelors_degree Not very good 9 No White
2 H016609 Female 4 incomplete_highschool Excellent 15 No Hisp…
3 H037925 Male 2 incomplete_highschool Excellent 3 No Hisp…
4 H039485 Male 7 graduate_degree Excellent 1 No White
5 H008818 Male 2 incomplete_college Excellent 7 No White
6 H064500 Male 14 graduate_degree Excellent 5 No Asian
7 H029084 Male 4 bachelors_degree Very Good 4 No White
8 H037799 Female 10 bachelors_degree Very Good 12 No White
9 H012805 Male 12 bachelors_degree Very Good NA <NA> White
10 H003829 Male 9 graduate_degree Excellent 1 No White
# ℹ 1,181 more rows
8.2 Size and Identifiers
The tibble frame child22 consists of 1191 rows (observations) and 8 columns (variables). The id functions as our indicator variable, guaranteeing that each row in our data set has a distinct identification. This is demonstrated in the code provided below.
Code
identical(nrow(child22), n_distinct(child22$id))
[1] TRUE
8.3 Save The Tibble
We will use the write_rds function to store the finalized data set.
Code
write_rds(child22,"child22.Rds")
9 The Code Book
9.1 Defining the Variables
Variable
Role
Type
Description
id
Subject Identifier Number
-
Subject character code
sex
input
Binary
Male and Female
age
input
Quantitative
Child age in years
race
input
5 Categories
Multiple race groups (White, Hispanic, African American, Asian, Other)
adult_edu
input
6 Categories
Level of education of the adults in the child’s family (Incomplete high school, Complete high school, Incomplete college, Associate degree, Bachelors degree, Graduate degree )
health_status
input
3 Categories
The health status of the children interviewed (Excellent, Very Good, Not very good)
sdq
Outcome
Quantitative
Strengths and Difficulties Questionnaire (SDQ) total score
violence
Outcome
Binary
The subject is a Victim of/ have witnessed violence (yes or no)
Code
tbl_summary(select(child22, -id),label =list(sex ="Sex",age ="Age in years",race ="Race category",adult_health ="Adult Educational level",Health_Status ="Child Health Status",sdq ="Strengths and Difficulties Questionnaire (SDQ) total score",violence ="Victim of or witnessed violence"),stat =list( all_continuous() ~"{median} [{min} to {max}]" ))
Characteristic
N = 1,1911
Sex
Male
613 (51%)
Female
578 (49%)
Age in years
8.0 [1.0 to 14.0]
adult_edu
incomplete_highschool
40 (3.4%)
complete_highschool
166 (14%)
incomplete_college
172 (14%)
associate_degree
195 (16%)
bachelors_degree
321 (27%)
graduate_degree
297 (25%)
health_status
Excellent
733 (62%)
Very Good
289 (24%)
Not very good
168 (14%)
Unknown
1
Strengths and Difficulties Questionnaire (SDQ) total score
Value Yes No
Frequency 90 1079
Proportion 0.077 0.923
race
n
missing
distinct
1191
0
5
Value Hispanic White African_American Asian
Frequency 163 804 78 63
Proportion 0.137 0.675 0.065 0.053
Value other
Frequency 83
Proportion 0.070
10 Linear Regression Plans
10.1 My First Research Question
Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score?
10.2 My Quantitative Outcome
For this research question, we will use the Strengths and Difficulties Questionnaire (SDQ) score to measure the subjects’ mental health status. The outcome is named sdq in the tibble. We are interested in using this outcome because it is an objective measure of the children mental health status. Hence, it will be a reliable tool in exploring potential associations with social factors, namely their age, sex, race, health status, and parental educational level.
Code
count(complete.cases(child22$sdq))
n_TRUE
1173
There are 1173 cases with complete sdq score values.
Code
p1 <-ggplot(child22, aes(sample = sdq)) +geom_qq(col ="navyblue") +geom_qq_line(col ="pink") +theme(aspect.ratio =1) +labs(y ="(SDQ) total score", x ="Standard Normal Distribution")p2 <-ggplot(child22, aes(x = sdq)) +geom_histogram(aes(y =stat(density)), bins =10, fill ="navyblue", col ="white") +stat_function(fun = dnorm, args =list(mean =mean(child22$sdq), sd =sd(child22$sdq)),col ="pink", lwd =1.5) +labs(y ="count", x ="(SDQ) total score")p3 <-ggplot(child22, aes(x = sdq, y ="")) +geom_violin(fill ="navyblue") +geom_boxplot(width =0.3, col ="white", notch =TRUE, outlier.color ="navyblue") +labs(x ="(SDQ) total score", y ="")p1 + (p2 / p3 +plot_layout(heights =c(4,1))) +plot_annotation(title ="SDQ Total Scores from NHIS Data",subtitle ="Higher SDQ score indicates more psychological difficulties",caption ="n = 1173")
The outcome is not normally distributed, rather, it is remarkably right skewed. A transformation should be considered.
race is a categorical variable with 5 levels. Each level contains over 30 observations:
Code
tabyl(child22$race)
child22$race n percent
Hispanic 163 0.13685978
White 804 0.67506297
African_American 78 0.06549118
Asian 63 0.05289673
other 83 0.06968934
We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N1-100)/100. N1 is the total number of rows with complete outcome sdq data = 1173. 4+(1173-100)/100 = 14.73 ~ 15
The total number of candidate predictors is 5 (sex, age, race, health_status, and adult_edu) which is way below 15.
The first speculation we hold is that females may be associated with higher sdq scores. Our belief is based on the fact that females experience puberty earlier than their male counterpart, which is strongly associated with mental health outcomes. We believe that the child’s health status and parental educational level may have a negative association with their sdq mental health score. We also speculate that race may contribute to their mental health score, where children from minority groups may have higher scores compared to other race categories.
11 Logistic Regression Plans
11.1 My Second Research Question
Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence?
11.2 My Binary Outcome
The outcome is the odds of children being victims or witnesses of violence. It is present in the tibble as violence. We are interested in using this measure to calculate the odds of children facing adverse events when selected social factors are taken into account. This can be a powerful tool in predicting and preventing child abuse. It can also be used in creating a “profile” of children most likely becoming victims of violence based on their social background.
The outcome variable:
Code
tabyl(child22$violence)
child22$violence n percent valid_percent
Yes 90 0.07556675 0.07698888
No 1079 0.90596138 0.92301112
<NA> 22 0.01847187 NA
violence has two categories: yes and no. There are 22 missing values.
11.3 My Planned Predictors (Logistic Model)
The social background encompasses multiple variables including the child’s sex, age, race, health_status, and parental educational level adult_edu.
The age variable is quantitative with more than 10 ordered observations:
race is a 5 level categorical variable, with each level having over 30 observations:
Code
tabyl(child22$race)
child22$race n percent
Hispanic 163 0.13685978
White 804 0.67506297
African_American 78 0.06549118
Asian 63 0.05289673
other 83 0.06968934
We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N2-100)/100. N2 is the number of the smaller group in the outcome violence. In this case N2 = 90. 4+(90-100)/100 = 3.9 ~ 4
The total number of candidate predictors is 5 (sex, age, race, health_status, and adult_edu) which is more than the suggested number. We will have to remove one predictor before proceeding to analysis.
The first speculation we hold is that younger children (one of the most vulnerable populations) have higher odds of being victims of or witnessing violence. We also believe that the child’s health status and parental educational level may have a negative association with the odds of them being victims or witnesses of violence. Race could contribute to outcome, where children from minority groups may have higher odds of being victims of or witnessing violence compared to children from other race categories.
12 Linear Regression Analyses
12.1 Missingness
We will use miss_var_summary to assess the missingness
We’ve noticed that there are missing observations in our outcome variable sdq and the predictor variable health_status.
12.1.1 Single Imputation Approach
For this analysis, we will proceed under the assumption that the observations are missing at random (MAR). Under this assumption, using a single imputation method to handle the missing values is suitable for our predictor health_status. We’ll use the simputation package to accomplish it. However, we will exclude all the missing observations in our outcome via complete cases.
We used the n_miss function to ensure that we obtained zero missing observations.
12.2 Outcome Transformation
We will use Box-Cox function to determine whether our outcome requires transformation to improve its skewness.
Code
mod_temp <-lm(sdq ~ age + sex + race + adult_edu + health_status, data = child22_sub)car::boxCox(mod_temp)
With a suggested transformation power of around 0.4, it appears that a square root transformation might be the most appropriate approach in the current case.
To verify the given suggestion, we will generate appropriate visualization to assess the skewness
Code
child22_sub <- child22_sub |>mutate(sqrtsdq =sqrt(sdq))p1 <-ggplot(data = child22_sub, aes(sample = sqrtsdq)) +geom_qq(col ="cornflowerblue") +geom_qq_line(col ="red") +theme(aspect.ratio =1) +labs(x ="Expected N(0,1)", y ="sqrt(SDQ)")p2 <-ggplot(data = child22_sub, aes(x = sqrtsdq)) +geom_histogram(bins =12 , fill ="cornflowerblue", col ="white") +labs(x ="sqrt(SDQ)", y ="counts")p3 <-ggplot(data = child22_sub, aes(x = sqrtsdq, y ="")) +geom_violin() +geom_boxplot(fill ="cornflowerblue", width =0.3,outlier.color ="cornflowerblue", outlier.size =3) +stat_summary(fun ="mean", geom ="point",shape =23, size =3, fill ="white") +labs(y ="", x ="sqrt(SDQ)")p1 + (p2 / p3 +plot_layout(heights =c(2,1)))+plot_annotation(title ="Evaluating the square root of SDQ score")
The graphs provided demonstrate that although some skewness is still visible in both the normal Q-Q plot and histogram, there has been an improvement. Furthermore, the violin plot demonstrates a reduction of the outliers.
12.3 Scatterplot Matrix and Collinearity
We will evaluate collinearity between predictors via scatter plot matrix and variance inflation factors.
At the bottom right corner, we observe the density function plot representing our outcome variable sqrtsdq. The correlation among the predictors is fairly modest, and the scatter plot matrix reveals no no major concerns regarding collinearity among predictors.
Checking Variance Inflation Factors
Code
mod_A <-lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)car::vif(mod_A)
The predictors’ generalized variance inflation factors (GVIF) demonstrate a small level of collinearity. We therefore have no serious concerns with respect to collinearity.
12.4 Model A
12.4.1 Fitting Model A
Now we will build mod_A “main effect” which incorporates five predictors age, sex, race, health_status, adult_edu using lm.
Code
mod_A <-lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)
Our next step will be to fit model A “main effect” with the ols() function from the rms package, which we will use later.
Code
dd <-datadist(child22_sub)options(datadist ="dd")mod_A_ols <-ols(sqrtsdq ~ age + sex + race + health_status + adult_edu,data = child22_sub, x =TRUE, y =TRUE)
12.4.2 Tidied Coefficient Estimates (Model A)
We will use the tidy function to obtain the estimates, standard error, lower and upper confidence interval and p-values.
The R-squared value indicates that the model accounts for 12.4% of the variation in sqrtsdq. The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 13 df.
Given the outcome’s discrete nature and the presence of four categorical predictors, we note certain patterns in our linearity assumption in the residual vs. fitted plot and the homoscedasticity assumption in the scale-location plots. However, the Q-Q plot of residuals reveals a few outliers at the lower end, albeit not influential as per the residuals vs. leverage plot (nothing passed the Cook’s distance). In conclusion, while there are minor deviations, overall, our assumptions appear to be reasonably met.
12.5 Non-Linearity
Presented below is the needed Spearman \(\rho^2\) plot.
Code
plot(spearman2(sqrtsdq ~ age + sex + race + health_status + adult_edu,data = child22_sub))
Based to the Spearman \(\rho^2\) plot, the most attractive candidate predictors for non-linear terms are health_status and adult_edu, respectively. We will include an interaction between health_status (2 df) and adult_edu (5 df), which is expected to contribute 10 degrees of freedom to our initial model.
Note: this approach has received approval from Prof. Thomas Love on 2024-03-17.
12.6 Model B
Our Model B will include one non-linear term the interaction between health_status and adult_edu, this should add about 10 degrees of freedom to our Model A.
12.6.1 Fitting Model B
Now we will build augmented model B with interaction term which incorporates five predictors age, sex, race, health_status*adult_edu using lm.
Code
mod_B <-lm(sqrtsdq ~ age + sex + race + health_status * adult_edu, data = child22_sub)
Our next step will be to fit model B with the ols() function from the rms package.
Code
dd <-datadist(child22_sub)options(datadist ="dd")mod_B_ols <-ols(sqrtsdq ~ age + sex + race + health_status * adult_edu, data = child22_sub, x =TRUE, y =TRUE)
12.6.2 Tidied Coefficient Estimates (Model B)
We will use the tidy function to obtain the estimates, standard error, lower and upper confidence interval and p-values.
In our Model B, we use 23 degrees of freedom, and obtain an \(R^2\) value of 0.131. The R-squared value indicates that the model accounts for 13 % of the variation in sqrtsdq. The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 23 df.
Considering the discrete nature of the outcome and the presence of four categorical predictors, we observed certain patterns in the residual vs. fitted plot and the scale-location plots. However, these patterns do not substantially affect the linearity and homoscedasticity assumptions. Additionally, the Q-Q plot of residuals shows a few outliers at both ends. Among these outliers, observations 627 and 859 exhibit high leverage, but they are not influential according to the residuals vs. leverage plot (none exceed the Cook’s distance threshold). In summary, while there are minor deviations, our assumptions seem to be reasonably met overall.
12.7 Validating Models A and B
We will use the validate() function from the rms package to validate our ols fits that we previously created by re-sampling.
After performing the validation, the output provides the R-squared value, MSE (mean squared error), intercept, and slope, along with their respective indices: “index.orig,” “training,” “test,” and “optimism,” which represents the difference between the training and test samples. Additionally, it includes the “index.corrected,” calculated as “index.orig” minus “optimism.”
12.7.1 Validated \(R^2\), and MSE as well as IC statistics
Model
Validated \(R^2\)
Validated MSE
AIC
BIC
df
A
0.102
0.9088
3209.4
3285.4
13
B
0.097
0.9041
3220.3
3347
23
12.8 Final Linear Regression Model
The table above provides information about the validated R-squared, MSE, AIC, and BIC. These values were obtained from a validation comparison between the “augmented model” B and the “main effects” model A, predicting the transformed outcome (sqrtsdq). Validated R-squared represents the percentage of variation in the outcome, while MSE stands for mean squared error, and AIC and BIC are Akaike and Bayesian Information Criteria, respectively. A more favorable fit is indicated by a higher validated R-squared and lower values of MSE, AIC, and BIC. In this case, model A “main effects” or the simple model shows a higher validated R-squared value and lower AIC and BIC, suggesting the most desirable fit, despite Model B having a lower validated MSE.Additionally, it’s worth noting that the inclusion of interaction terms in the augmented model yields no apparent improvement.
12.8.1 Winning Model’s Parameter Estimates
Here, we will generate the content of mod_A_ols
Code
mod_A_ols
Linear Regression Model
ols(formula = sqrtsdq ~ age + sex + race + health_status + adult_edu,
data = child22_sub, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 1173 LR chi2 155.06 R2 0.124
sigma0.9439 d.f. 13 R2 adj 0.114
d.f. 1159 Pr(> chi2) 0.0000 g 0.399
Residuals
Min 1Q Median 3Q Max
-2.24347 -0.68510 -0.03101 0.61304 2.97395
Coef S.E. t Pr(>|t|)
Intercept 2.6027 0.1684 15.45 <0.0001
age -0.0145 0.0067 -2.17 0.0298
sex=Female -0.0140 0.0555 -0.25 0.8014
race=White 0.1848 0.0866 2.13 0.0331
race=African_American 0.0426 0.1334 0.32 0.7498
race=Asian -0.1500 0.1452 -1.03 0.3018
race=other 0.2998 0.1299 2.31 0.0212
health_status=Very Good 0.5453 0.0670 8.14 <0.0001
health_status=Not very good 0.7145 0.0846 8.44 <0.0001
adult_edu=complete_highschool -0.1726 0.1684 -1.02 0.3058
adult_edu=incomplete_college -0.2440 0.1684 -1.45 0.1477
adult_edu=associate_degree -0.1031 0.1680 -0.61 0.5396
adult_edu=bachelors_degree -0.4218 0.1650 -2.56 0.0107
adult_edu=graduate_degree -0.3540 0.1656 -2.14 0.0328
The validated \(R^2\) value ( = 0.102) for our Model A
The output provides details about the formula used for the model, along with the number of observations. It includes the estimated residual standard deviation (sigma) and its corresponding degrees of freedom (df), as well as the model’s likelihood ratio test and discrimination indexes. Additionally, it presents a statistical summary, including the minimum, first quartile (1Q), median, third quartile (3Q), and maximum values for “sqrtsdq.” Finally, it lists the coefficient values, standard errors, t-test statistics, and p-values for all predictors.
12.8.2 Effects Plot for Winning Model
Here, we will present a plot of the effects from plot(summary(modelname))for this simple main effect model, using ols
Code
plot(summary(mod_A_ols))
12.8.3 Numerical Description of Effect Sizes
Here, we will obtain the effects summary of the ols model.
This effects summary of model A illustrates the impact on sqrtsdq when moving from the 25th to the 75th percentile of each variable. The estimates are accompanied by their respective standard errors and 95% confidence intervals, while maintaining the other variables at a constant level.
12.8.4 Effect Size Description
health_status : If two children share the same age, sex, race, and parental education, our model predicts that a child reporting “Not very good” overall health status will have a square root of SDQ score 0.741 higher than a child reporting “Excellent” overall health. This prediction comes with a meaningful 95% confidence interval of (0.548, 0.880).
12.8.5 Nomogram of Winning Model
Here, we will hypothetically generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported “not very good” health statuses, and their parental education level is incomplete college education. Manually
Child one: age 8 = ~12.5 points, sex Female= 0 points, race White= ~ 47.5 points, health_status not very good = 100 points, adult_edu incomplete college= ~ 25, total points= ~ 185 and it is correspond to ~3.1 in the linear predictor scale.
Child two: age 8 = ~12.5 points, sex Female= 0 points, race Hispanic = ~ 21 points, health_status not very good = 100 points, adult_edu incomplete college= ~ 25, total points= ~ 185 and it is correspond to ~ 2.7 in the linear predictor scale.
These predicted values are square root.
Code
plot(nomogram(mod_A_ols, fun = exp))
12.8.6 Prediction for a New Subject
We will generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported “not very good” health statuses, and their parental education level is incomplete college education. We will use the code below and then squaring the predictions.
Code
new_subjects <-data.frame(age =c(8,8), sex =c("Female", "Female"), race =c("White", "Hispanic"),health_status =c("Not very good", "Not very good"), adult_edu =c("incomplete_college", "incomplete_college"))preds1 <-predict.lm(mod_A, newdata = new_subjects, interval ="prediction", level =0.95)(preds1)^2
# A tibble: 1,191 × 7
id sex age race health_status adult_edu violence
<chr> <fct> <dbl> <fct> <fct> <fct> <fct>
1 H051614 Female 14 White Not very good bachelors_degree No
2 H016609 Female 4 Hispanic Excellent incomplete_highschool No
3 H037925 Male 2 Hispanic Excellent incomplete_highschool No
4 H039485 Male 7 White Excellent graduate_degree No
5 H008818 Male 2 White Excellent incomplete_college No
6 H064500 Male 14 Asian Excellent graduate_degree No
7 H029084 Male 4 White Very Good bachelors_degree No
8 H037799 Female 10 White Very Good bachelors_degree No
9 H012805 Male 12 White Very Good bachelors_degree <NA>
10 H003829 Male 9 White Excellent graduate_degree No
# ℹ 1,181 more rows
Before proceeding any further, we will notice that our outcome violence is a factor. We will mutate it into a numeric variable with 0 and 1 levels.
In the logistic analysis plan, we decided to reduce the number of predictors from 5 variables to 4 variables. All of the predictors included in the model are child social factors including their age, sex, race, and health status. Parental educational level adult_edu is the only variable associated with parents. This variable also has the highest number of levels (6 educational levels) and therefore will use more degrees of freedom compared to the rest of the predictors. Based on these facts, we decided to remove this variable from the logistic model.
The finalized logistic model data set will include be:
# A tibble: 1,191 × 6
id sex age race health_status violence
<chr> <fct> <dbl> <fct> <fct> <dbl>
1 H051614 Female 14 White Not very good 0
2 H016609 Female 4 Hispanic Excellent 0
3 H037925 Male 2 Hispanic Excellent 0
4 H039485 Male 7 White Excellent 0
5 H008818 Male 2 White Excellent 0
6 H064500 Male 14 Asian Excellent 0
7 H029084 Male 4 White Very Good 0
8 H037799 Female 10 White Very Good 0
9 H012805 Male 12 White Very Good NA
10 H003829 Male 9 White Excellent 0
# ℹ 1,181 more rows
13.2 Dealing with missingness
We will observe the number of subjects with missing observations and the proportion of missingness in our data set.
Code
miss_var_summary(child22_log)
# A tibble: 6 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 violence 22 1.85
2 health_status 1 0.0840
3 id 0 0
4 sex 0 0
5 age 0 0
6 race 0 0
There is a total of 23 subjects with missing observations and the overall proportion of missingness is 1.9%. In this case, imputing the missing observations is neither essential nor unnecessary, so we will proceed with the imputation process.
We will use the single imputation method using simpute function:
Code
set.seed(432)child22_si <- child22_log |>data.frame() |>impute_rhd(violence ~ age + sex + race) |>impute_cart(health_status ~ age + sex + race) |>as_tibble()n_miss(child22_si)
[1] 0
13.3 Model Y: Main effects model
13.3.1 ModelY Results using glm
Now we will fit the main effects model using glm function.
Code
mody_g <-glm(violence ~ sex + age + race + health_status,data = child22_si, family =binomial(link = logit))
Now we will produce a tidied table with the exponentiated estimates (odd ratios) with a 95% confidence interval.
We will check for the presence of interaction between the predictors:
Code
rms::vif(mody_g)
sexFemale age
1.008948 1.016786
raceWhite raceAfrican_American
2.344559 1.785802
raceAsian raceother
1.163887 1.718125
health_statusVery Good health_statusNot very good
1.157662 1.196977
As we can see, the inflation scores do not exceed the level of potential interaction. Therefore, we have no problems regarding collinearity between the predictors.
13.3.2 ModelY Results using lrm
We will refit the same model using the lrm function.
Code
d <-datadist(child22_si)options(datadist ="d")mody_r <-lrm(violence ~ sex + age + race + health_status,data = child22_si, x =TRUE, y =TRUE)
Now we will produce an effects plot of the odds ratio scale for the model variables.
Code
plot(summary(mody_r))
We can also assess the relationship between each predictor and the outcome independently.
Code
ggplot(Predict(mody_r, fun = plogis))
13.3.3 Key fit summary statistics
Code
mody_r
Logistic Regression Model
lrm(formula = violence ~ sex + age + race + health_status, data = child22_si,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1191 LR chi2 29.43 R2 0.059 C 0.674
0 1100 d.f. 8 R2(8,1191)0.018 Dxy 0.348
1 91 Pr(> chi2) 0.0003 R2(8,252.1)0.081 gamma 0.350
max |deriv| 5e-05 Brier 0.069 tau-a 0.049
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.5721 0.4259 -8.39 <0.0001
sex=Female 0.3074 0.2232 1.38 0.1684
age 0.0570 0.0277 2.06 0.0396
race=White 0.0563 0.3473 0.16 0.8713
race=African_American 0.9244 0.4407 2.10 0.0359
race=Asian -0.8567 0.7870 -1.09 0.2763
race=other 0.7494 0.4561 1.64 0.1004
health_status=Very Good 0.5690 0.2569 2.21 0.0268
health_status=Not very good 0.7495 0.2928 2.56 0.0105
Nagelkerke value is 0.059, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.674, which indicates that the predictions of our model are not very reliable.
13.3.4 Confusion matrix
We will produce the confusion matrix using a cutting point of 0.08.
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 757 36
TRUE 343 55
Accuracy : 0.6818
95% CI : (0.6545, 0.7082)
No Information Rate : 0.9236
P-Value [Acc > NIR] : 1
Kappa : 0.1149
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.60440
Specificity : 0.68818
Pos Pred Value : 0.13819
Neg Pred Value : 0.95460
Prevalence : 0.07641
Detection Rate : 0.04618
Detection Prevalence : 0.33417
Balanced Accuracy : 0.64629
'Positive' Class : TRUE
The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%.
13.4 Non-linearity
Spearman \(\rho^2\) plot
Code
plot(spearman2(violence ~ sex + age + race + health_status ,data = child22_si))
The Spearman \(\rho^2\) plot is suggesting that race and health_status are the most suitable predictors for adding an interaction term. Since both variables are categorical, we will add an interaction term between race and health_status.
13.5 Model Z: incorporating non-linear terms
Now we will fit another model with an interaction term between race and health_status.
Code
modz_r <-lrm(violence ~ sex + age + race + health_status + race %ia% health_status,data = child22_si, x =TRUE, y =TRUE)
Now we will confirm the number of additional degrees of freedom using anova:
Code
anova(modz_r)
Wald Statistics Response: violence
Factor Chi-Square d.f. P
sex 1.71 1 0.1914
age 4.40 1 0.0359
race (Factor+Higher Order Factors) 15.45 12 0.2175
All Interactions 4.57 8 0.8026
health_status (Factor+Higher Order Factors) 14.60 10 0.1473
All Interactions 4.57 8 0.8026
race * health_status (Factor+Higher Order Factors) 4.57 8 0.8026
TOTAL 32.85 16 0.0077
The interaction resulted in 8 additional degrees of freedom.
13.5.1 Modelz Results using glm
We can also use the glm function to produce tidied results of the interaction model.
Code
modz_g <-glm(violence ~ sex + age + race + health_status + race %ia% health_status,data =child22_si, family =binomial(link =logit))
race %ia% health_statusrace=White * health_status=Very Good
0.802
0.807
0.158
4.047
race %ia% health_statusrace=White * health_status=Not very good
1.587
0.879
0.280
9.725
race %ia% health_statusrace=African_American * health_status=Very Good
0.201
1.140
0.018
1.765
race %ia% health_statusrace=African_American * health_status=Not very good
0.426
1.055
0.052
3.532
race %ia% health_statusrace=Asian * health_status=Very Good
0.000
597.342
0.000
165.518
race %ia% health_statusrace=Asian * health_status=Not very good
0.000
720.840
0.000
0.000
race %ia% health_statusrace=other * health_status=Very Good
0.913
1.036
0.118
7.277
race %ia% health_statusrace=other * health_status=Not very good
0.971
1.231
0.076
10.872
13.5.2 Modelz Results using lrm
We will observe the plot of the effects using the lrm function:
Code
plot(summary(modz_r))
We can also assess the relationship between each predictor and the outcome independently.
Code
ggplot(Predict(modz_r, fun = plogis))
13.5.3 Key fit summary statistics
Code
modz_r
Logistic Regression Model
lrm(formula = violence ~ sex + age + race + health_status + race %ia%
health_status, data = child22_si, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1191 LR chi2 37.73 R2 0.075 C 0.683
0 1100 d.f. 16 R2(16,1191)0.018 Dxy 0.366
1 91 Pr(> chi2) 0.0017 R2(16,252.1)0.083 gamma 0.368
max |deriv| 0.0003 Brier 0.068 tau-a 0.052
Coef S.E. Wald Z
Intercept -3.6802 0.5678 -6.48
sex=Female 0.2929 0.2242 1.31
age 0.0585 0.0279 2.10
race=White 0.0334 0.5535 0.06
race=African_American 1.6004 0.6636 2.41
race=Asian 0.1720 0.8922 0.19
race=other 0.7551 0.7355 1.03
health_status=Very Good 0.9455 0.7384 1.28
health_status=Not very good 0.6911 0.7955 0.87
race=White * health_status=Very Good -0.2205 0.8066 -0.27
race=White * health_status=Not very good 0.4618 0.8791 0.53
race=African_American * health_status=Very Good -1.6043 1.1401 -1.41
race=African_American * health_status=Not very good -0.8524 1.0549 -0.81
race=Asian * health_status=Very Good -12.6790 369.3752 -0.03
race=Asian * health_status=Not very good -12.5553 446.0065 -0.03
race=other * health_status=Very Good -0.0913 1.0359 -0.09
race=other * health_status=Not very good -0.0294 1.2306 -0.02
Pr(>|Z|)
Intercept <0.0001
sex=Female 0.1914
age 0.0359
race=White 0.9519
race=African_American 0.0159
race=Asian 0.8471
race=other 0.3046
health_status=Very Good 0.2003
health_status=Not very good 0.3850
race=White * health_status=Very Good 0.7846
race=White * health_status=Not very good 0.5994
race=African_American * health_status=Very Good 0.1594
race=African_American * health_status=Not very good 0.4191
race=Asian * health_status=Very Good 0.9726
race=Asian * health_status=Not very good 0.9775
race=other * health_status=Very Good 0.9298
race=other * health_status=Not very good 0.9810
Negelkerke value is 0.075, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.682, which indicates that the predictions of our model are not very reliable.
13.5.4 Confusion matrix
We will produce the confusion matrix using a cutting point of 0.08.
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 757 36
TRUE 343 55
Accuracy : 0.6818
95% CI : (0.6545, 0.7082)
No Information Rate : 0.9236
P-Value [Acc > NIR] : 1
Kappa : 0.1149
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.60440
Specificity : 0.68818
Pos Pred Value : 0.13819
Neg Pred Value : 0.95460
Prevalence : 0.07641
Detection Rate : 0.04618
Detection Prevalence : 0.33417
Balanced Accuracy : 0.64629
'Positive' Class : TRUE
The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%. These values match the main effects model mody values.
13.6 Validating the models
We will tidy the summaries of the main effects model mody and interaction model modz and compare them:
After validating the area under the curve value (AUC) and the R2 value, the main effects model mody displayed improved performance compared to the interaction model modz.
13.7 Final Model
After validating the models, the main effects model mody yielded better results. Both the Nagelkerke’s R squared and ROC values were improved compared to the interaction modz model. The main effects model mody will be the model of choice in this case.
The main effects model mody parameters and their coefficients with a 95% confidence interval is displayed in the following table:
The social factor associated with the highest odds of violence is African Americanrace (2.520) with a 95% CI (1.062, 6.082). The second highest social factor were Otherrace (2.116) with a 95% CI (0.857, 5.229) and health statusNot very good (2.116) with a 95% CI (1.173, 3.717). The social factor associated with the lowest odds ratio of violence was the Asianrace (0.425) with a 95% CI (0.064, 1.654 ). Among these factors, the only confidence intervals not containing 1 were African Americanrace and Not very goodhealth status, indicating that these factors produced meaningful differences in the odds ratio of violence.
To observe the effect sizes we will display the effect plot:
Code
plot(summary(mody_r))
According to this model, an 11 year old child has 1.5 the odds of witnessing/experiencing violence compared to a 4 year old child, when all other variables are held constant. Female children have about 1.4 times the odds of witnessing/experiencing violence compared to their male counterparts, also holding all other variables constant. When observing race, African American children have about 2.4 the odds witnessing/experiencing violence compared to White children, followed by Other races (2.0). Hispanic have about the same odds of witnessing/experiencing violence (about 1.0) while Asian children have lower odds of witnessing/experiencing violence (about 0.4) compared to White children. Compared to children with excellent health status, those with very good health status have about 1.7 the odds of witnessing/experiencing violence, while those with not very good health status have about 2.0 the odds of witnessing/experiencing violence. Again, these assumption are made when all other variables are held constant.
ROC Curve Plot
Code
predict.prob1 <-predict(mody_g, type ="response")roc1 <- pROC::roc(mody_g$data$violence, predict.prob1)plot(roc1, main ="ROC Curve: Main effects model `mody`", lwd =2, col ="navy")legend('bottomright', legend =paste("AUC: ", round_half_up(auc(roc1),4)))
The AUC values displayed here is the nominal AUC.
Validated Nagelkerke R-squared and the C statistic values
The validated area under the curve (AUC) value is 0.638. This means that our model’s predictions are slightly better than a random guess. The validated Nagelkerke’s R squared value is 0.022, which means that our model does not show an improvement compared to a null model.
Using the nomogram to make a prediction:
Code
plot(nomogram(mody_r, fun = plogis,fun.at =c(0.05, seq(0.1, 0.9, by =0.1), 0.95),funlabel ="Pr(violence)"))
We can use the nomogram to predict the odds ration of children by imputing the values of their social factors. We will hypothesize two subjects. The first will be a female African American child who is 5 years old and does not have a very good health status. The second subject will be male Asian child who is 13 years old and has a very good health status. By using this nomogram the first subject is predicted to have approximately 2.1 the odds of witnessing/experiencing violence while the second subject will have about 0.05 the odds. The first subject has remarkably higher probability of witnessing/experiencing violence, while the second subject has a lower probability.
14 Discussion
The first research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score? Based on the final linear regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, race (Asian) and health status (not very good). Having an Asian race contributed positively to the mental health score, while having a not very good health status contributed negatively.
The second research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence? Based on the final logistic regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, race (African American) and health status (not very good). Children with these characteristics presented remarkably higher probabilities of witnessing or experiencing violence compared to children with different social characteristics.
Project A was a challenging, yet achievable project. One of the prominent issues we faced was that the outcome of the linear regression model was discrete. There were a few difficulties during the analysis, and we wished we were more proficient in other suitable methods such as the Poisson or negative binomial regression. Another challenge was choosing an appropriate transformation for the outcome. We used the Box-Cox to guide us, however that was not useful. So, we opted for a different transformation until reaching a satisfying result. We realized that the Box-Cox can be misleading. We also wished we had collapsed some of the categorical levels as they had used more degrees of freedom in the model. However, this decision may compromise our ability to investigate meaningful differences between the categorical levels. The overall process of the project was not difficult as there was a roadmap guiding us (the instructions). However, making decisions on the order of the logistic models (the main effects model and interaction model using glm and lrm function) was confusing at times. We decided to organize the models in a way that facilitated comprehension for the audience. We learned that it is imperative to keep our study goal in mind during the analysis process.
15 Affirmation
I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.
---title: "Social Factors and Children's Mental Health"subtitle: "A cross-sectional study in the Midwest"author: "Walaa Alshaia and Sarah Albalawi"date: last-modifiedformat: html: toc: true number-sections: true code-fold: show code-tools: true code-overflow: wrap embed-resources: true date-format: iso theme: spacelab ---## R Packages and Setup {.unnumbered}```{r}#| message: false#| warning: falseknitr::opts_chunk$set(comment =NA) library(janitor) library(broom)library(gt)library(gtsummary)library(Hmisc)library(ggrepel)library(GGally)library(knitr)library(mosaic)library(naniar)library(patchwork)library(simputation)library(rms)library(caret)library(pROC)library(sessioninfo)library(tidyverse) opts_chunk$set(comment=NA)options(dplyr.summarise.inform =FALSE)theme_set(theme_bw()) ```# Data SourceThe data set used for this study is the National Health Interview Survey (NHIS). The NHIS is part of the Centers for Disease Control and Prevention (CDC) and the main source of statistics on the health of the civilian non-institutionalized population of the United States. The data is collected through interviews conducted at the households of the subjects. The target population are non-institutionalized US citizens, including group homes and shelters. The NHIS aims to obtain a representative sample of the US population through geographically clustered sampling techniques. The data collection is conducted throughout the year, from January to December. For this study, the 2022 NHIS sample child interview data set was used. The file can be found at <https://www.cdc.gov/nchs/nhis/2022nhis.htm>.# The SubjectsThe subjects of this study are children (any individual below the age of 18) living in civilian non-institutionalized households. The sample is restricted to those in the Midwest region of the U.S.# Loading and Tidying the Data## Loading the Raw DataFor this study, we will use the 2022 NHIS sample child interview data set. The file can be found at <https://www.cdc.gov/nchs/nhis/2022nhis.htm>.```{r}R.child22 <-read_csv("child22.csv",guess_max =4000,show_col_types =FALSE) |>mutate(across(where(is.character), as_factor))dim (R.child22)```## Cleaning the DataIn this step, we will mutate, rename, and regroup our variables as needed.### Selecting Variables of InterestGiven the extensive size of the raw data set, we will filter the data set to include the variables of interest for our analysis. Additionally, we will narrow the population of interest to those living in the Midwest.```{r}child_raw <- R.child22 |>select (HHX,REGION, SEX_C, AGEP_C, MAXEDUCP_C, PHSTAT_C, SDQTOT_C, VIOLENEV_C, HISPALLP_C) |>filter(`REGION`==2, AGEP_C <97, SEX_C <10, MAXEDUCP_C <97 , PHSTAT_C <10, SDQTOT_C <90, VIOLENEV_C <10, HISPALLP_C <8)child_raw```We will mutate the variables with levels containing non-relevant values, such as "refused to answer" or "not ascertained", into `NA`.```{r}child_raw$SEX_C=na_if(child_raw$SEX_C, 7)child_raw$SEX_C=na_if(child_raw$SEX_C, 8)child_raw$SEX_C=na_if(child_raw$SEX_C, 9)child_raw$SDQTOT_C=na_if(child_raw$SDQTOT_C,88)child_raw$PHSTAT_C =na_if(child_raw$PHSTAT_C, 7)child_raw$PHSTAT_C =na_if(child_raw$PHSTAT_C, 8)child_raw$PHSTAT_C =na_if(child_raw$PHSTAT_C, 9)child_raw$VIOLENEV_C =na_if(child_raw$VIOLENEV_C, 7)child_raw$VIOLENEV_C =na_if(child_raw$VIOLENEV_C, 8)child_raw$VIOLENEV_C =na_if(child_raw$VIOLENEV_C, 9)child_raw```### Converting Variable TypesWe will convert the variables into factors, except for `HHX`, which will remain as a character. The variables `AGE_C` and `SDQTOT_C` are the only quantitative variables and will be converted into numeric variables.```{r}child_raw <- child_raw |>mutate_if(is.vector, as.factor)|>mutate(HHX =as.character(HHX))|>mutate(AGEP_C =as.numeric(AGEP_C))|>mutate(SDQTOT_C=as.numeric(SDQTOT_C))child_raw```### Changing Variable NamesWe will assign meaningful names to the variables of our study.```{r}child22 <- child_raw|>select(HHX, SEX_C, AGEP_C, MAXEDUCP_C, PHSTAT_C, SDQTOT_C, VIOLENEV_C, HISPALLP_C)child22 <- child22 |>rename(sex = SEX_C,age = AGEP_C,race = HISPALLP_C,adult_edu = MAXEDUCP_C,health_status = PHSTAT_C, sdq = SDQTOT_C, violence = VIOLENEV_C,id = HHX)child22```### Working with Categorical PredictorsFor this project, we have four categorical variables acting as predictors for both analyses and one binary outcome for the logistic regression analysis: `sex`, `race`, `adult_edu`, `health_status`, and `violence`#### sex variableWe will rename the numeric levels of `sex` into meaningful labels:```{r}child22 <- child22|>mutate(sex =fct_recode (sex, "Male"="1", "Female"="2"))tabyl(child22$sex)```#### race variableWe will rename each category of the `race` variable to clarify the meaning of each level. We will also combine the levels with the lowest frequencies under *other* category.```{r}child22 <- child22 |>mutate(race =fct_recode(race,"Hispanic"="1","White"="2","African_American"="3","Asian"="4","other"="5","other"="6","other"="7" ))tabyl(child22$race)```#### adult_edu variableWe will start by renaming the `adult_edu` numeric levels into meaningful words:```{r}#| warning: falsechild22 <- child22 |>mutate(adult_edu=fct_recode (adult_edu, "Grade 1-11"="1", "12th grade,no diploma"="2", "GED"="3", "High School Graduate"="4", "Some college, no degree"="5", "Associate degree-nonacademic-"="6","Associate degree-academic-"="7","Bachelor's degree"="8","Master's degree"="9","Professional School"="10"))tabyl(child22$adult_edu)```We will collapse the present levels to the following:- Incomplete high school- Complete high school- Incomplete college- Associate's Degree- Bachelor's Degree- Graduate Degree```{r}child22 <- child22 |>mutate (adult_edu=fct_collapse(adult_edu, incomplete_highschool =c("Grade 1-11", "12th grade,no diploma"),complete_highschool =c("GED", "High School Graduate"),incomplete_college =c("Some college, no degree"),associate_degree =c("Associate degree-nonacademic-", "Associate degree-academic-"),bachelors_degree =c("Bachelor's degree"),graduate_degree =c("Master's degree", "Professional School")))tabyl(child22$adult_edu)```#### health_status variableWe will start by renaming the `health_status` numeric levels into meaningful labels:```{r}child22 <- child22|>mutate(health_status=fct_recode (health_status, "Excellent"="1", "Very Good"="2", "Good"="3", "Fair"="4", "Poor"="5"))tabyl(child22$health_status)```We would like to collapse them into the following categories:- Excellent- Very Good- Not very good```{r}#| warning: falsechild22 <- child22 |>mutate (health_status=fct_collapse(health_status, "Not very good"=c("Good","Fair", "Poor")))tabyl(child22$health_status)```#### violence variableWe will rename the `violence` numeric levels into meaningful labels:```{r}child22 <- child22|>mutate(violence =fct_recode (violence, "Yes"="1", "No"="2"))tabyl(child22$violence)```### Assessment of Missing ValuesWe will employ the `miss_var_summary` function to assess missing values in the study variables. The two outcomes of the linear and logistic regression (`sdq` and `violence`) have missing values. One of the predictors (`health_status`) also has a missing value.```{r}miss_var_summary(child22)```### Arranging the TibbleThis is the count of the final data set `child22` after refining the variables of interest:```{r}dim(child22)```we have 1191 observations and 8 variables of interest (including `id`).# The Tidy Tibble## Listing the TibbleWe will observe the finalized version of our data set after cleaning and filtering the variables:```{r}child22```## Size and IdentifiersThe tibble frame `child22` consists of 1191 rows (*observations*) and 8 columns (*variables*). The `id` functions as our indicator variable, guaranteeing that each row in our data set has a distinct identification. This is demonstrated in the code provided below.```{r}identical(nrow(child22), n_distinct(child22$id))```## Save The TibbleWe will use the write_rds function to store the finalized data set.```{r}write_rds(child22,"child22.Rds")```# The Code Book## Defining the Variables| Variable | Role | Type | Description ||:---|:---|:---|:---|| **id** | Subject Identifier Number | \- | Subject character code || **sex** | input | Binary | Male and Female || **age** | input | Quantitative | Child age in years || **race** | input | 5 Categories | Multiple race groups (White, Hispanic, African American, Asian, Other) || **adult_edu** | input | 6 Categories | Level of education of the adults in the child's family (Incomplete high school, Complete high school, Incomplete college, Associate degree, Bachelors degree, Graduate degree ) || **health_status** | input | 3 Categories | The health status of the children interviewed (Excellent, Very Good, Not very good) || **sdq** | **Outcome** | Quantitative | Strengths and Difficulties Questionnaire (SDQ) total score || **violence** | **Outcome** | Binary | The subject is a Victim of/ have witnessed violence (yes or no) |```{r}tbl_summary(select(child22, -id),label =list(sex ="Sex",age ="Age in years",race ="Race category",adult_health ="Adult Educational level",Health_Status ="Child Health Status",sdq ="Strengths and Difficulties Questionnaire (SDQ) total score",violence ="Victim of or witnessed violence"),stat =list( all_continuous() ~"{median} [{min} to {max}]" ))```## Numerical Description```{r}#| warning: falsedescribe(child22) |>html()```# Linear Regression Plans## My First Research QuestionAmong children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score?## My Quantitative OutcomeFor this research question, we will use the *Strengths and Difficulties Questionnaire* (**SDQ**) score to measure the subjects' mental health status. The outcome is named `sdq` in the tibble. We are interested in using this outcome because it is an objective measure of the children mental health status. Hence, it will be a reliable tool in exploring potential associations with social factors, namely their age, sex, race, health status, and parental educational level.```{r}count(complete.cases(child22$sdq))```There are 1173 cases with complete `sdq` score values.```{r}#| warning: falsep1 <-ggplot(child22, aes(sample = sdq)) +geom_qq(col ="navyblue") +geom_qq_line(col ="pink") +theme(aspect.ratio =1) +labs(y ="(SDQ) total score", x ="Standard Normal Distribution")p2 <-ggplot(child22, aes(x = sdq)) +geom_histogram(aes(y =stat(density)), bins =10, fill ="navyblue", col ="white") +stat_function(fun = dnorm, args =list(mean =mean(child22$sdq), sd =sd(child22$sdq)),col ="pink", lwd =1.5) +labs(y ="count", x ="(SDQ) total score")p3 <-ggplot(child22, aes(x = sdq, y ="")) +geom_violin(fill ="navyblue") +geom_boxplot(width =0.3, col ="white", notch =TRUE, outlier.color ="navyblue") +labs(x ="(SDQ) total score", y ="")p1 + (p2 / p3 +plot_layout(heights =c(4,1))) +plot_annotation(title ="SDQ Total Scores from NHIS Data",subtitle ="Higher SDQ score indicates more psychological difficulties",caption ="n = 1173")```The outcome is not normally distributed, rather, it is remarkably right skewed. A transformation should be considered.```{r}describe(child22$sdq)```As we can see, the outcome `sdq` is a quantitative variable with more than 10 ordinal values.## My Planned Predictors (Linear Model)The social background encompasses multiple variables including their `sex`, `age`, `race`, `health_status`, and parental educational level `adult_edu`.The `age` variable is quantitative with more than 10 ordered observations:```{r}describe(child22$age)````race` is a categorical variable with 5 levels. Each level contains over 30 observations:```{r}tabyl(child22$race)```We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N1-100)/100. N1 is the total number of rows with complete outcome `sdq` data = 1173. 4+(1173-100)/100 = 14.73 \~ 15The total number of candidate predictors is 5 (`sex`, `age`, `race`, `health_status`, and `adult_edu`) which is way below 15.The first speculation we hold is that females may be associated with higher `sdq` scores. Our belief is based on the fact that females experience puberty earlier than their male counterpart, which is strongly associated with mental health outcomes. We believe that the child's health status and parental educational level may have a negative association with their `sdq` mental health score. We also speculate that race may contribute to their mental health score, where children from minority groups may have higher scores compared to other race categories.# Logistic Regression Plans## My Second Research QuestionAmong children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence?## My Binary OutcomeThe outcome is the odds of children being victims or witnesses of violence. It is present in the tibble as `violence`. We are interested in using this measure to calculate the odds of children facing adverse events when selected social factors are taken into account. This can be a powerful tool in predicting and preventing child abuse. It can also be used in creating a "profile" of children most likely becoming victims of violence based on their social background.The outcome variable:```{r}tabyl(child22$violence)````violence` has two categories: yes and no. There are 22 missing values.## My Planned Predictors (Logistic Model)The social background encompasses multiple variables including the child's `sex`, `age`, `race`, `health_status`, and parental educational level `adult_edu`.The `age` variable is quantitative with more than 10 ordered observations:```{r}describe(child22$age)````race` is a 5 level categorical variable, with each level having over 30 observations:```{r}tabyl(child22$race)```We will demonstrate that the total number of candidate predictors we suggest is no more than: 4+(N2-100)/100. N2 is the number of the smaller group in the outcome `violence`. In this case N2 = 90. 4+(90-100)/100 = 3.9 \~ 4The total number of candidate predictors is 5 (`sex`, `age`, `race`, `health_status`, and `adult_edu`) which is more than the suggested number. We will have to remove one predictor before proceeding to analysis.The first speculation we hold is that younger children (one of the most vulnerable populations) have higher odds of being victims of or witnessing violence. We also believe that the child's health status and parental educational level may have a negative association with the odds of them being victims or witnesses of violence. Race could contribute to outcome, where children from minority groups may have higher odds of being victims of or witnessing violence compared to children from other race categories.# Linear Regression Analyses## MissingnessWe will use `miss_var_summary` to assess the missingness```{r}miss_var_summary(child22) |>filter(n_miss >0)```We've noticed that there are missing observations in our outcome variable `sdq` and the predictor variable `health_status`.### Single Imputation ApproachFor this analysis, we will proceed under the assumption that the observations are missing at random (MAR). Under this assumption, using a single imputation method to handle the missing values is suitable for our predictor `health_status`. We’ll use the `simputation` package to accomplish it. However, we will exclude all the missing observations in our outcome via complete cases.```{r}child22_sub <- child22 |>select(sdq, sex, race, adult_edu, age, health_status)|>filter(complete.cases(sdq))|>impute_cart(health_status ~ sex + race + adult_edu+ age+ sdq)n_miss(child22_sub)```We used the `n_miss` function to ensure that we obtained zero missing observations.## Outcome TransformationWe will use Box-Cox function to determine whether our outcome requires transformation to improve its skewness.```{r}mod_temp <-lm(sdq ~ age + sex + race + adult_edu + health_status, data = child22_sub)car::boxCox(mod_temp)```With a suggested transformation power of around 0.4, it appears that a square root transformation might be the most appropriate approach in the current case.To verify the given suggestion, we will generate appropriate visualization to assess the skewness```{r}child22_sub <- child22_sub |>mutate(sqrtsdq =sqrt(sdq))p1 <-ggplot(data = child22_sub, aes(sample = sqrtsdq)) +geom_qq(col ="cornflowerblue") +geom_qq_line(col ="red") +theme(aspect.ratio =1) +labs(x ="Expected N(0,1)", y ="sqrt(SDQ)")p2 <-ggplot(data = child22_sub, aes(x = sqrtsdq)) +geom_histogram(bins =12 , fill ="cornflowerblue", col ="white") +labs(x ="sqrt(SDQ)", y ="counts")p3 <-ggplot(data = child22_sub, aes(x = sqrtsdq, y ="")) +geom_violin() +geom_boxplot(fill ="cornflowerblue", width =0.3,outlier.color ="cornflowerblue", outlier.size =3) +stat_summary(fun ="mean", geom ="point",shape =23, size =3, fill ="white") +labs(y ="", x ="sqrt(SDQ)")p1 + (p2 / p3 +plot_layout(heights =c(2,1)))+plot_annotation(title ="Evaluating the square root of SDQ score")```The graphs provided demonstrate that although some skewness is still visible in both the normal Q-Q plot and histogram, there has been an improvement. Furthermore, the violin plot demonstrates a reduction of the outliers.## Scatterplot Matrix and CollinearityWe will evaluate collinearity between predictors via scatter plot matrix and variance inflation factors.- Scatter plot Matrix```{r}#| warning: false#| message: falseggpairs(child22_sub, columns =c("age", "sex", "race", "health_status", "adult_edu", "sqrtsdq"), title ="Scatterplot Matrix")```At the bottom right corner, we observe the density function plot representing our outcome variable `sqrtsdq.` The correlation among the predictors is fairly modest, and the scatter plot matrix reveals no no major concerns regarding collinearity among predictors.- Checking Variance Inflation Factors```{r}mod_A <-lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)car::vif(mod_A)```The predictors' generalized variance inflation factors (GVIF) demonstrate a small level of collinearity. We therefore have no serious concerns with respect to collinearity.## Model A### Fitting Model ANow we will build mod_A "main effect" which incorporates five predictors `age`, `sex`, `race`, `health_status`, `adult_edu` using `lm`.```{r}mod_A <-lm(sqrtsdq ~ age + sex + race + health_status + adult_edu, data = child22_sub)```Our next step will be to fit model A "main effect" with the `ols()` function from the `rms` package, which we will use later.```{r}dd <-datadist(child22_sub)options(datadist ="dd")mod_A_ols <-ols(sqrtsdq ~ age + sex + race + health_status + adult_edu,data = child22_sub, x =TRUE, y =TRUE)```### Tidied Coefficient Estimates (Model A)We will use the `tidy` function to obtain the estimates, standard error, lower and upper confidence interval and p-values.```{r}tidy(mod_A, conf.int =TRUE, conf.level =0.95) |>select(term, estimate, se = std.error, low95 = conf.low, high95 = conf.high, p = p.value) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =20)```### Summarizing Fit (Model A)In this section we will use `glance` function to obtain R-squared, Adj.r.squared, sigma, AIC and BIC, nobs, df, and df.residual.```{r}glance(mod_A) |>select(r2 = r.squared, adjr2 = adj.r.squared, sigma, AIC, BIC, nobs, df, df.residual) |>kable(digits =c(3, 3, 2, 1, 1, 0, 0, 0))```The R-squared value indicates that the model accounts for 12.4% of the variation in `sqrtsdq.` The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 13 df.### Regression Diagnostics (Model A)Here, the code will yield four residual plots```{r}#| fig-height: 8par(mfrow =c(2,2)); plot(mod_A); par(mfrow =c(2,2))```Given the outcome's discrete nature and the presence of four categorical predictors, we note certain patterns in our linearity assumption in the residual vs. fitted plot and the homoscedasticity assumption in the scale-location plots. However, the Q-Q plot of residuals reveals a few outliers at the lower end, albeit not influential as per the residuals vs. leverage plot (nothing passed the Cook's distance). In conclusion, while there are minor deviations, overall, our assumptions appear to be reasonably met.## Non-LinearityPresented below is the needed Spearman $\rho^2$ plot.```{r}plot(spearman2(sqrtsdq ~ age + sex + race + health_status + adult_edu,data = child22_sub))```Based to the Spearman $\rho^2$ plot, the most attractive candidate predictors for non-linear terms are `health_status` and `adult_edu`, respectively. We will include an interaction between `health_status` (2 df) and `adult_edu` (5 df), which is expected to contribute 10 degrees of freedom to our initial model.**Note**: this approach has received approval from Prof. Thomas Love on 2024-03-17.## Model BOur Model B will include one non-linear term the interaction between `health_status` and `adult_edu`, this should add about 10 degrees of freedom to our Model A.### Fitting Model BNow we will build augmented model B with interaction term which incorporates five predictors `age`, `sex`, `race`, `health_status`\*`adult_edu` using `lm`.```{r}mod_B <-lm(sqrtsdq ~ age + sex + race + health_status * adult_edu, data = child22_sub)```Our next step will be to fit model B with the `ols()` function from the `rms` package.```{r}dd <-datadist(child22_sub)options(datadist ="dd")mod_B_ols <-ols(sqrtsdq ~ age + sex + race + health_status * adult_edu, data = child22_sub, x =TRUE, y =TRUE)```### Tidied Coefficient Estimates (Model B)We will use the `tidy` function to obtain the estimates, standard error, lower and upper confidence interval and p-values.```{r}tidy(mod_B, conf.int =TRUE, conf.level =0.95) |>select(term, estimate, se = std.error, low95 = conf.low, high95 = conf.high, p = p.value) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =20)```### Effects Plot for Model BHere, we will present a plot of the effects from `plot(summary(modelname))`for this augmented model, using `ols````{r}plot(summary(mod_B_ols))```### Summarizing Fit (Model B)In this section we will use `glance` function to obtain R-squared, Adj.r.squared, sigma, AIC and BIC, nobs, df, and df.residual.```{r}glance(mod_B) |>select(r2 = r.squared, adjr2 = adj.r.squared, sigma, AIC, BIC, nobs, df, df.residual) |>kable(digits =c(3, 3, 2, 1, 1, 0, 0, 0))```In our Model B, we use `r glance(mod_B)$df` degrees of freedom, and obtain an $R^2$ value of `r round_half_up(glance(mod_B)$r.squared,3)`. The R-squared value indicates that the model accounts for 13 % of the variation in `sqrtsdq.` The adjusted R-squared is an index and it is is 0.11, and the model is based on 1173 observations, and has 23 df.### Regression Diagnostics (Model B)Here, the code will yield four residual plots```{r}#| fig-height: 8par(mfrow =c(2,2)); plot(mod_B); par(mfrow =c(1,1))```Considering the discrete nature of the outcome and the presence of four categorical predictors, we observed certain patterns in the residual vs. fitted plot and the scale-location plots. However, these patterns do not substantially affect the linearity and homoscedasticity assumptions. Additionally, the Q-Q plot of residuals shows a few outliers at both ends. Among these outliers, observations 627 and 859 exhibit high leverage, but they are not influential according to the residuals vs. leverage plot (none exceed the Cook's distance threshold). In summary, while there are minor deviations, our assumptions seem to be reasonably met overall.## Validating Models A and BWe will use the `validate()` function from the **rms** package to validate our `ols` fits that we previously created by re-sampling.```{r}set.seed(4321); (valA <-validate(mod_A_ols))set.seed(4322); (valB <-validate(mod_B_ols))```After performing the validation, the output provides the R-squared value, MSE (mean squared error), intercept, and slope, along with their respective indices: "index.orig," "training," "test," and "optimism," which represents the difference between the training and test samples. Additionally, it includes the "index.corrected," calculated as "index.orig" minus "optimism."### Validated $R^2$, and MSE as well as IC statistics {#sec-vallin}| Model | Validated $R^2$ | Validated MSE | AIC | BIC | df ||---:|:--:|:--:|:--:|:--:|:--:|| A | `r round_half_up(valA[1,5],3)` | `r round_half_up(valA[2,5],4)` | `r round_half_up(AIC(mod_A),1)` | `r round_half_up(BIC(mod_A),1)` | `r glance(mod_A)$df` || B | `r round_half_up(valB[1,5],3)` | `r round_half_up(valB[2,5],4)` | `r round_half_up(AIC(mod_B),1)` | `r round_half_up(BIC(mod_B),1)` | `r glance(mod_B)$df` |## Final Linear Regression ModelThe table above provides information about the validated R-squared, MSE, AIC, and BIC. These values were obtained from a validation comparison between the "augmented model" B and the "main effects" model A, predicting the transformed outcome (sqrtsdq). Validated R-squared represents the percentage of variation in the outcome, while MSE stands for mean squared error, and AIC and BIC are Akaike and Bayesian Information Criteria, respectively. A more favorable fit is indicated by a higher validated R-squared and lower values of MSE, AIC, and BIC. In this case, **model A "main effects"** or the simple model shows a higher validated R-squared value and lower AIC and BIC, suggesting the most desirable fit, despite Model B having a lower validated MSE.Additionally, it's worth noting that the inclusion of interaction terms in the augmented model yields no apparent improvement.### Winning Model's Parameter EstimatesHere, we will generate the content of mod_A_ols```{r}mod_A_ols```The validated $R^2$ value ( = `r round_half_up(valA[1,5],3)`) for our Model AThe output provides details about the formula used for the model, along with the number of observations. It includes the estimated residual standard deviation (sigma) and its corresponding degrees of freedom (df), as well as the model's likelihood ratio test and discrimination indexes. Additionally, it presents a statistical summary, including the minimum, first quartile (1Q), median, third quartile (3Q), and maximum values for "sqrtsdq." Finally, it lists the coefficient values, standard errors, t-test statistics, and p-values for all predictors.### Effects Plot for Winning ModelHere, we will present a plot of the effects from `plot(summary(modelname))`for this simple main effect model, using `ols````{r}plot(summary(mod_A_ols))```### Numerical Description of Effect SizesHere, we will obtain the effects summary of the `ols` model.```{r}summary(mod_A_ols) |>kable(digits =3)```This effects summary of model A illustrates the impact on `sqrtsdq` when moving from the 25th to the 75th percentile of each variable. The estimates are accompanied by their respective standard errors and 95% confidence intervals, while maintaining the other variables at a constant level.### Effect Size Description**health_status** : If two children share the same age, sex, race, and parental education, our model predicts that a child reporting "Not very good" overall health status will have a square root of SDQ score 0.741 higher than a child reporting "Excellent" overall health. This prediction comes with a meaningful 95% confidence interval of (0.548, 0.880).### Nomogram of Winning ModelHere, we will hypothetically generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported "not very good" health statuses, and their parental education level is incomplete college education. Manually- Child one: age 8 = \~12.5 points, sex Female= 0 points, race White= \~ 47.5 points, health_status not very good = 100 points, adult_edu incomplete college= \~ 25, total points= \~ 185 and it is correspond to \~3.1 in the linear predictor scale.- Child two: age 8 = \~12.5 points, sex Female= 0 points, race Hispanic = \~ 21 points, health_status not very good = 100 points, adult_edu incomplete college= \~ 25, total points= \~ 185 and it is correspond to \~ 2.7 in the linear predictor scale.These predicted values are square root.```{r}#| fig-height: 8plot(nomogram(mod_A_ols, fun = exp))```### Prediction for a New SubjectWe will generate predicted SDQ scores for two children, both female, one of White race and the other Hispanic. Both are 8 years old with reported "not very good" health statuses, and their parental education level is incomplete college education. We will use the code below and then squaring the predictions.```{r}new_subjects <-data.frame(age =c(8,8), sex =c("Female", "Female"), race =c("White", "Hispanic"),health_status =c("Not very good", "Not very good"), adult_edu =c("incomplete_college", "incomplete_college"))preds1 <-predict.lm(mod_A, newdata = new_subjects, interval ="prediction", level =0.95)(preds1)^2```| Predictor Values | Predicted SDQ | 95% Prediction Interval ||:--:|:--:|:--:|| age = 8, sex = Female, race= White, health_status = Not very good, adult_edu = incomplete_college | 9.78 | (1.59, 24.91) || age = 8, sex = Female, race= Hispanic, health_status = Not very good, adult_edu = incomplete_college | 8.66 | (1.16, 23.14) |The code below will generate the predicted values in square root form, and they align with the values obtained from the nomogram plot.```{r}new_subjects <-data.frame(age =c(8,8), sex =c("Female", "Female"), race =c("White", "Hispanic"),health_status =c("Not very good", "Not very good"), adult_edu =c("incomplete_college", "incomplete_college"))preds2 <-predict.lm(mod_A, newdata = new_subjects, interval ="prediction", level =0.95)preds2```# Logistic Regression AnalysesWe will start by creating a data subset including the predictors for the logistic regression analysis:```{r}child22_log <- child22|>select(id, sex, age, race, health_status, adult_edu, violence)child22_log```Before proceeding any further, we will notice that our outcome `violence` is a factor. We will mutate it into a numeric variable with 0 and 1 levels.```{r}child22_log <- child22_log |>mutate(violence =as.numeric (violence))|>mutate(violence =2- violence)```## Predictor SelectionIn the logistic analysis plan, we decided to reduce the number of predictors from 5 variables to 4 variables. All of the predictors included in the model are child social factors including their age, sex, race, and health status. Parental educational level `adult_edu` is the only variable associated with parents. This variable also has the highest number of levels (6 educational levels) and therefore will use more degrees of freedom compared to the rest of the predictors. Based on these facts, we decided to remove this variable from the logistic model.The finalized logistic model data set will include be:```{r}child22_log <- child22_log|>select(-adult_edu)child22_log```## Dealing with missingnessWe will observe the number of subjects with missing observations and the proportion of missingness in our data set.```{r}miss_var_summary(child22_log)```There is a total of 23 subjects with missing observations and the overall proportion of missingness is 1.9%. In this case, imputing the missing observations is neither essential nor unnecessary, so we will proceed with the imputation process.We will use the single imputation method using `simpute` function:```{r}set.seed(432)child22_si <- child22_log |>data.frame() |>impute_rhd(violence ~ age + sex + race) |>impute_cart(health_status ~ age + sex + race) |>as_tibble()n_miss(child22_si) ```## Model Y: Main effects model### ModelY Results using `glm`Now we will fit the main effects model using `glm` function.```{r}mody_g <-glm(violence ~ sex + age + race + health_status,data = child22_si, family =binomial(link = logit))```Now we will produce a tidied table with the exponentiated estimates (odd ratios) with a 95% confidence interval.```{r}tidy(mody_g, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95) |>select(term, estimate, std.error, low95 = conf.low, high95 = conf.high, p = p.value) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =20)```We will check for the presence of interaction between the predictors:```{r}rms::vif(mody_g)```As we can see, the inflation scores do not exceed the level of potential interaction. Therefore, we have no problems regarding collinearity between the predictors.### ModelY Results using `lrm`We will refit the same model using the `lrm` function.```{r}d <-datadist(child22_si)options(datadist ="d")mody_r <-lrm(violence ~ sex + age + race + health_status,data = child22_si, x =TRUE, y =TRUE)```Now we will produce an effects plot of the odds ratio scale for the model variables.```{r}plot(summary(mody_r))```We can also assess the relationship between each predictor and the outcome independently.```{r}ggplot(Predict(mody_r, fun = plogis))```### Key fit summary statistics```{r}mody_r```Nagelkerke value is 0.059, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.674, which indicates that the predictions of our model are not very reliable.### Confusion matrixWe will produce the confusion matrix using a cutting point of 0.08.```{r}mody_g_aug <-augment(mody_g, type.predict ="response")cm <-confusionMatrix(data =factor(mody_g_aug$.fitted >=0.08),reference =factor(mody_g_aug$violence ==1),positive ="TRUE")cm```The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%.## Non-linearitySpearman $\rho^2$ plot```{r}plot(spearman2(violence ~ sex + age + race + health_status ,data = child22_si))```The Spearman $\rho^2$ plot is suggesting that `race` and `health_status` are the most suitable predictors for adding an interaction term. Since both variables are categorical, we will add an interaction term between `race` and `health_status`.## Model Z: incorporating non-linear termsNow we will fit another model with an interaction term between `race` and `health_status`.```{r}modz_r <-lrm(violence ~ sex + age + race + health_status + race %ia% health_status,data = child22_si, x =TRUE, y =TRUE)```Now we will confirm the number of additional degrees of freedom using anova:```{r}anova(modz_r)```The interaction resulted in 8 additional degrees of freedom.### Modelz Results using `glm`We can also use the glm function to produce tidied results of the interaction model.```{r}modz_g <-glm(violence ~ sex + age + race + health_status + race %ia% health_status,data =child22_si, family =binomial(link =logit))```Tidied coefficients:```{r}#| message: false#| warning: falsetidy(modz_g, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95) |>select(term, estimate, std.error, low90 = conf.low, high90 = conf.high) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =14)```### Modelz Results using `lrm`We will observe the plot of the effects using the `lrm` function:```{r}plot(summary(modz_r))```We can also assess the relationship between each predictor and the outcome independently.```{r}ggplot(Predict(modz_r, fun = plogis))```### Key fit summary statistics```{r}modz_r```Negelkerke value is 0.075, which indicates that our model does not present an improvement compared to a null model. The C statistic value is 0.682, which indicates that the predictions of our model are not very reliable.### Confusion matrixWe will produce the confusion matrix using a cutting point of 0.08.```{r}modz_g_aug <-augment(modz_g, type.predict ="response")cm <-confusionMatrix(data =factor(modz_g_aug$.fitted >=0.08),reference =factor(modz_g_aug$violence ==1),positive ="TRUE")cm```The sensitivity of our model is 60.4%, the specificity is 68.8%, and the positive predictive value is 13.8%. These values match the main effects model `mody` values.## Validating the modelsWe will tidy the summaries of the main effects model `mody` and interaction model `modz` and compare them:```{r}temp1 <-bind_rows(glance(mody_g), glance(modz_g)) |>mutate(model =c("Y", "Z")) |>select(model, AIC, BIC) temp2 <-tibble(model =c("Y", "Z"),auc =c(mody_r$stats["C"], modz_r$stats["C"]),r2_nag =c(mody_r$stats["R2"], modz_r$stats["R2"]))left_join(temp1, temp2, by ="model") |>gt() |>fmt_number(columns = AIC:BIC, decimals =1) |>fmt_number(columns = auc:r2_nag, decimals =4) |>tab_options(table.font.size =20)```The interaction model `modz` has a slightly better performance compared to the main effects model `mody`.Now we will validate the models to determine our final model:```{r}set.seed(432)valY <-validate(mody_r, B =40)valZ <-validate(modz_r, B =40)val_1 <-bind_rows(valY[1,], valZ[1,]) |>mutate(model =c("Y", "Z"),AUC_nominal =0.5+ (index.orig/2), AUC_validated =0.5+ (index.corrected/2)) |>select(model, AUC_nominal, AUC_validated)val_2 <-bind_rows(valY[2,], valZ[2,]) |>mutate(model =c("Y", "Z"),R2_nominal = index.orig,R2_validated = index.corrected) |>select(model, R2_nominal, R2_validated)val <-left_join(val_1, val_2, by ="model") val |>gt() |>fmt_number(decimals =4) |>tab_options(table.font.size =20)```After validating the area under the curve value (AUC) and the R2 value, the main effects model `mody` displayed improved performance compared to the interaction model `modz`.## Final ModelAfter validating the models, the main effects model `mody` yielded better results. Both the Nagelkerke's R squared and ROC values were improved compared to the interaction `modz` model. The main effects model `mody` will be the model of choice in this case.The main effects model `mody` parameters and their coefficients with a 95% confidence interval is displayed in the following table:```{r}tidy(mody_g, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95) |>select(term, estimate, std.error, low95 = conf.low, high95 = conf.high, p = p.value) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =20)```The social factor associated with the highest odds of violence is *African American* `race` (2.520) with a 95% CI (1.062, 6.082). The second highest social factor were *Other* `race` (2.116) with a 95% CI (0.857, 5.229) and `health status` *Not very good* (2.116) with a 95% CI (1.173, 3.717). The social factor associated with the lowest odds ratio of violence was the *Asian* `race` (0.425) with a 95% CI (0.064, 1.654 ). Among these factors, the only confidence intervals not containing 1 were **African American** `race` and **Not very good** `health status`, indicating that these factors produced meaningful differences in the odds ratio of violence.To observe the effect sizes we will display the effect plot:```{r}plot(summary(mody_r))```According to this model, an 11 year old child has 1.5 the odds of witnessing/experiencing violence compared to a 4 year old child, when all other variables are held constant. Female children have about 1.4 times the odds of witnessing/experiencing violence compared to their male counterparts, also holding all other variables constant. When observing race, African American children have about 2.4 the odds witnessing/experiencing violence compared to White children, followed by Other races (2.0). Hispanic have about the same odds of witnessing/experiencing violence (about 1.0) while Asian children have lower odds of witnessing/experiencing violence (about 0.4) compared to White children. Compared to children with excellent health status, those with very good health status have about 1.7 the odds of witnessing/experiencing violence, while those with not very good health status have about 2.0 the odds of witnessing/experiencing violence. Again, these assumption are made when all other variables are held constant.ROC Curve Plot```{r}#| message: false#| warning: falsepredict.prob1 <-predict(mody_g, type ="response")roc1 <- pROC::roc(mody_g$data$violence, predict.prob1)plot(roc1, main ="ROC Curve: Main effects model `mody`", lwd =2, col ="navy")legend('bottomright', legend =paste("AUC: ", round_half_up(auc(roc1),4)))```The AUC values displayed here is the *nominal* AUC.Validated Nagelkerke R-squared and the C statistic values```{r}set.seed(432)valY <-validate(mody_r, B =40)val_1 <-bind_rows(valY[1,]) |>mutate(model =c("Y"),AUC_nominal =0.5+ (index.orig/2), AUC_validated =0.5+ (index.corrected/2)) |>select(model, AUC_nominal, AUC_validated)val_2 <-bind_rows(valY[2,]) |>mutate(model =c("Y"),R2_nominal = index.orig,R2_validated = index.corrected) |>select(model, R2_nominal, R2_validated)val <-left_join(val_1, val_2, by ="model") val |>gt() |>fmt_number(decimals =4) |>tab_options(table.font.size =20)```The validated area under the curve (AUC) value is 0.638. This means that our model's predictions are slightly better than a random guess. The validated Nagelkerke's R squared value is 0.022, which means that our model does not show an improvement compared to a null model.Using the nomogram to make a prediction:```{r}plot(nomogram(mody_r, fun = plogis,fun.at =c(0.05, seq(0.1, 0.9, by =0.1), 0.95),funlabel ="Pr(violence)"))```We can use the nomogram to predict the odds ration of children by imputing the values of their social factors. We will hypothesize two subjects. The first will be a female African American child who is 5 years old and does not have a very good health status. The second subject will be male Asian child who is 13 years old and has a very good health status. By using this nomogram the first subject is predicted to have approximately 2.1 the odds of witnessing/experiencing violence while the second subject will have about 0.05 the odds. The first subject has remarkably higher probability of witnessing/experiencing violence, while the second subject has a lower probability.# DiscussionThe first research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to their mental health status score? Based on the final linear regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, **race** (Asian) and **health status** (not very good). Having an Asian race contributed positively to the mental health score, while having a not very good health status contributed negatively.The second research question was: Among children in the Midwest, Do specific social factors (age, sex, race, health status, and parental educational level) contribute to the probability of witnessing or experiencing violence? Based on the final logistic regression model, a couple of the social factors contributed meaningfully to the outcome, specifically, **race** (African American) and **health status** (not very good). Children with these characteristics presented remarkably higher probabilities of witnessing or experiencing violence compared to children with different social characteristics.Project A was a challenging, yet achievable project. One of the prominent issues we faced was that the outcome of the linear regression model was discrete. There were a few difficulties during the analysis, and we wished we were more proficient in other suitable methods such as the Poisson or negative binomial regression. Another challenge was choosing an appropriate transformation for the outcome. We used the Box-Cox to guide us, however that was not useful. So, we opted for a different transformation until reaching a satisfying result. We realized that the Box-Cox can be misleading. We also wished we had collapsed some of the categorical levels as they had used more degrees of freedom in the model. However, this decision may compromise our ability to investigate meaningful differences between the categorical levels. The overall process of the project was not difficult as there was a roadmap guiding us (the instructions). However, making decisions on the order of the logistic models (the main effects model and interaction model using glm and lrm function) was confusing at times. We decided to organize the models in a way that facilitated comprehension for the audience. We learned that it is imperative to keep our study goal in mind during the analysis process.# AffirmationI am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.# References- Centers for Disease Control and Prevention. (2023, June 29). NHIS - 2022 NHIS. Centers for Disease Control and Prevention. <https://www.cdc.gov/nchs/nhis/2022nhis.htm>.# Session Information```{r}xfun::session_info()```