3  Preventive Dental Care and Hospitalization Trends Among Children

2022 National Survey of Children’s Health (NSCH) Data

Author

Sarah Albalawi & Walaa Alshaia

Published

2025-10-28

3.1 R Packages and Setup

Code
knitr::opts_chunk$set(comment = NA)

library(bayestestR); library(bestglm)
library(car); library(caret); library(cutpointr)
library(Epi)
library(GGally); library(glmnet); library(gt)
library(insight)
library(janitor)
library(lme4)
library(MASS); library(mice); library(mosaic)
library(naniar); library(nnet)
library(patchwork); library(pROC); library(pscl)
library(quantreg)
library(ROCR); library(rstanarm)
library(survey); library(survival); library(survminer)
library(haven)
library(broom.mixed)
library(conflicted)
library(rms)
library(tidymodels)
library(tidyverse)

conflicts_prefer(dplyr::filter, dplyr::select, dplyr::summarize, dplyr::count, 
                 base::mean, base::sum, base::max,
                 car::vif, Matrix::update, rms::Predict)

options(dplyr.summarise.inform = FALSE)

theme_set(theme_bw())

4 Background

The study uses data from the National Survey of Children’s Health (NSCH), year 2022. It is a survey based data set with the main subject children aged 0-17. The population includes children from all 50 states and the District of Columbia. The survey is a self-administered web-based and paper-based questionnaire. The respondent is a parent or a guardian, where one child from each household is selected to be the subject of the questionnaire.

This study focuses on children with hospitalization records. Medically compromised patients often face delay when receiving oral health care, which can negatively impact their oral and overall health status. The aim of this study is to assess whether being a child with frequent hospital visits, history of admission, and taking medication, impacts their access to preventive dental care.

5 Research Questions

The overarching purpose of this study is to investigate the access and oral health status of children with hospitalization history. Medically compromised children often face barriers to preventive dental care, which negatively impacts their health outcomes.

Research question 1: What are the odds of children with hospitalization records of receiving preventive dental care when other variables (age) are taken into account?

Hospital admission and ER visits were used to observe whether they present as barriers to preventive oral care access. Children with poor health status can often face difficulties accessing preventive dental care.

Research question 2: What are the odds of children with hospitalization records of experiencing chronic cavities, when other variables (ER visits and medication) are taken into account?

Hospital admission, ER visits, and medication use were added to observe potential associations. There has been dental reluctance in treating children with poor health status. Children with more hospital visits may face dental care neglect and consequently have higher odds of experiencing cavities

6 Data Ingest and Management

6.1 Loading the Raw Data

Code
nsch_2022e_topical <- read_dta("nsch_2022e_topical.dta")
dim(nsch_2022e_topical)
[1] 54103   490

6.1.1 Sampling the Data

We will select a random sample of 3000 subjects out of the 54,103 original study population.

Code
set.seed(432)
nsch22_raw <- nsch_2022e_topical |>
  slice_sample(n = 3000)

6.2 Cleaning the Data

6.2.1 Selecting Variables We’ll Use

select the variables

Code
nsch22_raw <- nsch22_raw|>
  dplyr::select(hhid, sc_age_years, hospitaler, hospitalstay, sc_k2q10, dentistvisit, cavities)|>
  dplyr::mutate(across(where(is.numeric), as_factor))|>
  dplyr::mutate(hhid = as.character(hhid))|>
  dplyr::mutate(sc_age_years = as.numeric(sc_age_years))
nsch22_raw
# A tibble: 3,000 × 7
   hhid     sc_age_years hospitaler hospitalstay sc_k2q10 dentistvisit cavities
   <chr>           <dbl> <fct>      <fct>        <fct>    <fct>        <fct>   
 1 22017610            7 1          2            2        3            2       
 2 22075400           12 2          2            2        2            1       
 3 22103566           12 1          2            1        3            2       
 4 22317899           18 1          2            1        3            <NA>    
 5 22270255            4 1          2            2        2            2       
 6 22308731           14 1          2            2        3            2       
 7 22304506           17 1          2            2        3            2       
 8 22287109           14 1          2            2        2            2       
 9 22261632            2 1          2            2        <NA>         2       
10 22020482           12 1          2            1        1            2       
# ℹ 2,990 more rows

6.2.2 Changing Variable Names

Code
nsch22_raw <- nsch22_raw |>
  rename(age = sc_age_years,
         med_use = sc_k2q10,
         er_visits = hospitaler)
nsch22_raw
# A tibble: 3,000 × 7
   hhid       age er_visits hospitalstay med_use dentistvisit cavities
   <chr>    <dbl> <fct>     <fct>        <fct>   <fct>        <fct>   
 1 22017610     7 1         2            2       3            2       
 2 22075400    12 2         2            2       2            1       
 3 22103566    12 1         2            1       3            2       
 4 22317899    18 1         2            1       3            <NA>    
 5 22270255     4 1         2            2       2            2       
 6 22308731    14 1         2            2       3            2       
 7 22304506    17 1         2            2       3            2       
 8 22287109    14 1         2            2       2            2       
 9 22261632     2 1         2            2       <NA>         2       
10 22020482    12 1         2            1       1            2       
# ℹ 2,990 more rows

6.2.3 Working with Categorical Predictors

er_visits

Code
nsch22_raw |> tabyl (er_visits) 
 er_visits    n     percent valid_percent
         1 2511 0.837000000   0.839799331
         2  372 0.124000000   0.124414716
         3   92 0.030666667   0.030769231
         4   15 0.005000000   0.005016722
      <NA>   10 0.003333333            NA

The fourth categories has 15 observations. We will need to collapse the variable into three categories instead of four.

Code
nsch22_raw <- nsch22_raw |>
mutate (er_visits= fct_collapse(er_visits, "3" = c(" 3", "4")))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `er_visits = fct_collapse(er_visits, `3` = c(" 3", "4"))`.
Caused by warning:
! Unknown levels in `f`:  3
Code
tabyl(nsch22_raw$er_visits)
 nsch22_raw$er_visits    n     percent valid_percent
                    1 2511 0.837000000    0.83979933
                    2  372 0.124000000    0.12441472
                    3  107 0.035666667    0.03578595
                 <NA>   10 0.003333333            NA
Code
nsch22_raw <- nsch22_raw |>
  mutate(
    er_visits = fct_recode(er_visits,
      "None" = "1",
      "1 visit" = "2",
      "2 or more visits" = "3"))

tabyl(nsch22_raw$er_visits)
 nsch22_raw$er_visits    n     percent valid_percent
                 None 2511 0.837000000    0.83979933
              1 visit  372 0.124000000    0.12441472
     2 or more visits  107 0.035666667    0.03578595
                 <NA>   10 0.003333333            NA

hospitalstay

Code
nsch22_raw<- nsch22_raw|>
mutate(hospitalstay = fct_recode (hospitalstay, 
                                  "yes"= "1", 
                                  "no"= "2"))
tabyl(nsch22_raw$hospitalstay)
 nsch22_raw$hospitalstay    n     percent valid_percent
                     yes   86 0.028666667    0.02880107
                      no 2900 0.966666667    0.97119893
                    <NA>   14 0.004666667            NA

med_use

Code
nsch22_raw<- nsch22_raw|>
mutate(med_use = fct_recode (med_use, 
                                  "yes"= "1", 
                                  "no"= "2"))
tabyl(nsch22_raw$med_use)
 nsch22_raw$med_use    n      percent valid_percent
                yes  556 0.1853333333      0.185457
                 no 2442 0.8140000000      0.814543
               <NA>    2 0.0006666667            NA

6.2.4 Working with Categorical Outcomes

dentistvisit

Code
nsch22_raw <- nsch22_raw |>
  mutate(
    dentistvisit = fct_recode(dentistvisit,
      "No preventive visits" = "1",
      "1 visit" = "2",
      "2 or more visits" = "3"), 
    dentistvisit=relevel(dentistvisit, "No preventive visits", "1 visit", "2 or more visits"),
    dentistvisit= factor (dentistvisit, ordered = TRUE))
is.ordered(nsch22_raw$dentistvisit)
[1] TRUE
Code
tabyl(nsch22_raw$dentistvisit)
 nsch22_raw$dentistvisit    n    percent valid_percent
    No preventive visits   86 0.02866667    0.03568465
                 1 visit  947 0.31566667    0.39294606
        2 or more visits 1377 0.45900000    0.57136929
                    <NA>  590 0.19666667            NA

cavities

Code
nsch22_raw<- nsch22_raw|>
mutate(cavities = fct_recode (cavities, 
                                  "yes"= "1", 
                                 "no"= "2"))
tabyl(nsch22_raw$cavities)
 nsch22_raw$cavities    n     percent valid_percent
                 yes  282 0.094000000    0.09444072
                  no 2704 0.901333333    0.90555928
                <NA>   14 0.004666667            NA

6.2.5 Arranging the Tibble

We will rename the cleaned sample as nsch22.

Code
nsch22 <- nsch22_raw

dim(nsch22)
[1] 3000    7
Code
identical(nrow(nsch22), n_distinct(nsch22$hhid))
[1] TRUE

7 Code Book and Description

Variable Role Type Description
hhid ID - Unique Household ID
dentistvisit Outcome (multi-categorical regression) 3 Categories The number of dental visits the child had during the past 12 months to seek preventive dental care: None, 1 visit, 2 or more visits
cavities Outcome (logistic regression) Binary Child experience of chronic cavities during the past 12 Months (yes or no)
age input Quantitative Child age in years
er_visits input 3 Categories The number of child emergency room visits (None, 1 visit, 2 or more visits)
hospitalstay input Binary Child admitted to a hospital for at least one night during past 12 months (yes or no)
med_use Outcome Binary Child need for prescribed medicine by a doctor (yes/ no)

8 Analyses

8.1 Analysis 1

8.1.1 Research Question for Analysis 1

What are the odds of children with hospitalization records of receiving preventive dental care when other variables (age) are taken into account?

Children who have experienced hospitalization might have a higher likelihood of receiving preventive dental care. This could be due to frequent ER visits and hospitalization often indicating a decline in the health status, which may lead to a more comprehensive approach to healthcare, encompassing preventive dental care. Furthermore, younger children may be more likely to receive preventive dental care due to parental involvement and regular check-ups, while older children might have more autonomy over their healthcare decisions and may be more proactive about preventive measures. We need to conduct a proper statistical analysis of relevant data to confirm those anticipations.

We plan to use an ordinal logistic regression model to analyze the likelihood of children with history of ER visits and hospital admission receiving preventive dental care. This model will consider three factors that might influence the outcome: age, history of hospitalization , and the number of emergency room visits We will treat the outcome variable, dentistvisit as an ordinal variable, acknowledging that it has a natural order (No preventive visits ,1 visit, 2 or more visits).

8.1.2 Analysis 1 Outcome

Our outcome of interest is dentistvisit. It indicates how many visits during the past 12 months a child has seen a dentist or other oral health care provider for preventive dental care(No preventive visits in the past 12 months, none, 1 visit, 2 or more visits).

Code
nsch22 |> filter(is.na(dentistvisit) == FALSE) |> nrow()
[1] 2410

There are 2410 rows with complete data on dentistvisit.

8.1.3 Analysis 1 Predictors

For Analysis 1, we will use four variables to predict dentistvisit:

  • age: child age in years
  • er_visits: How many times did this child visit a hospital emergency room? (None, 1 time, 2‐3 times, 4 or more times)
  • hospitalstay: Whether the child was admitted to the hospital to stay for at least one night during past 12 months (yes or no)

8.1.4 Missingness

First, we will generate a tibble that includes our variables of interest.

Code
nsch22_model_one <- nsch22  |>
 select(hhid, dentistvisit, age, er_visits, hospitalstay)

Then we’ll use miss_var_summary to see how many observations are missing for each variable.

Code
miss_var_summary(nsch22_model_one)
# A tibble: 5 × 3
  variable     n_miss pct_miss
  <chr>         <int>    <num>
1 dentistvisit    590   19.7  
2 hospitalstay     14    0.467
3 er_visits        10    0.333
4 hhid              0    0    
5 age               0    0    

We have 590 missing observations on the outcome dentistvisit.

Code
n_case_complete(nsch22_model_one); pct_complete_case(nsch22_model_one)
[1] 2401
[1] 80.03333
  • 2401 (80%) of the 3000 subjects in the nsch22_model_one are complete.

  • The remaining 599 or 20% observations have some missing data on predictors or the outcome.

  • We will assume these data are missing at random (MAR) and proceed with single imputation.

  • We will filter by complete cases for the outcome.

Code
nsch22_model_one <- nsch22_model_one  |>
  filter(complete.cases(dentistvisit))

8.1.5 Single Imputation Approach

As we have missing values in our predictors hospitalstay and er_visits, we conduct single imputation under the assumption that missing values are missing at random (MAR).

Code
set.seed(432)

orderdata <- nsch22_model_one |> select(-hhid)

nsch22_order_mice1 <- mice(orderdata, m = 15, printFlag = FALSE)
Code
nsch22_order_si <- complete(nsch22_order_mice1) |> tibble()
Code
nsch22_order_si$hhid <- nsch22_model_one$hhid
dim(nsch22_order_si); n_miss(nsch22_order_si); nsch22_order_si 
[1] 2410    5
[1] 0
# A tibble: 2,410 × 5
   dentistvisit           age er_visits hospitalstay hhid    
   <ord>                <dbl> <fct>     <fct>        <chr>   
 1 2 or more visits         7 None      no           22017610
 2 1 visit                 12 1 visit   no           22075400
 3 2 or more visits        12 None      no           22103566
 4 2 or more visits        18 None      no           22317899
 5 1 visit                  4 None      no           22270255
 6 2 or more visits        14 None      no           22308731
 7 2 or more visits        17 None      no           22304506
 8 1 visit                 14 None      no           22287109
 9 No preventive visits    12 None      no           22020482
10 1 visit                  3 1 visit   no           22354728
# ℹ 2,400 more rows
  • The next step is to verify the single imputation with the multi-categorical variable er_visits:
Code
nsch22_model_one |> tabyl (er_visits) |> adorn_pct_formatting()
        er_visits    n percent valid_percent
             None 2060   85.5%         85.6%
          1 visit  274   11.4%         11.4%
 2 or more visits   73    3.0%          3.0%
             <NA>    3    0.1%             -
Code
nsch22_order_si |> tabyl (er_visits) |> adorn_pct_formatting()
        er_visits    n percent
             None 2062   85.6%
          1 visit  274   11.4%
 2 or more visits   74    3.1%

We have the same percentages for each level in both data sets.

Code
describe(nsch22_order_si)
nsch22_order_si 

 5  Variables      2410  Observations
--------------------------------------------------------------------------------
dentistvisit 
       n  missing distinct 
    2410        0        3 
                                                                         
Value      No preventive visits              1 visit     2 or more visits
Frequency                    86                  947                 1377
Proportion                0.036                0.393                0.571
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
    2410        0       18    0.996    10.61     10.5    5.556        3 
     .10      .25      .50      .75      .90      .95 
       4        6       11       15       17       18 
                                                                            
Value          1     2     3     4     5     6     7     8     9    10    11
Frequency      4    33   119   152   182   176   110   116   152   133   140
Proportion 0.002 0.014 0.049 0.063 0.076 0.073 0.046 0.048 0.063 0.055 0.058
                                                    
Value         12    13    14    15    16    17    18
Frequency    131   141   146   147   175   172   181
Proportion 0.054 0.059 0.061 0.061 0.073 0.071 0.075
--------------------------------------------------------------------------------
er_visits 
       n  missing distinct 
    2410        0        3 
                                                             
Value                  None          1 visit 2 or more visits
Frequency              2062              274               74
Proportion            0.856            0.114            0.031
--------------------------------------------------------------------------------
hospitalstay 
       n  missing distinct 
    2410        0        2 
                      
Value        yes    no
Frequency     59  2351
Proportion 0.024 0.976
--------------------------------------------------------------------------------
hhid 
       n  missing distinct 
    2410        0     2410 

lowest : 22000115 22000164 22000276 22000358 22000381
highest: 22356672 22356767 22356784 22356882 22357318
--------------------------------------------------------------------------------

Here’s a breakdown of the observations from the describe analysis:

Hospital ER Visits: Most children (2062) had no hospital ER visits in the past year. A smaller portion had one visit (274), even fewer had 2 or more visits or (74).

Hospital Admissions: A large majority (2354) of the children had a prior hospitalization, while 59 had never been admitted.

8.1.6 Checking the Outcome (order,levels and distribution)

  • We will check if our outcome dentistvisit is an ordered factor.
Code
is.ordered(nsch22_order_si$dentistvisit)
[1] TRUE
  • We will check if the first level dentistvisit is No preventive visits
Code
nsch22_order_si |> tabyl(dentistvisit)
         dentistvisit    n    percent
 No preventive visits   86 0.03568465
              1 visit  947 0.39294606
     2 or more visits 1377 0.57136929
  • Bar Chart for dentistvisit

We will build a graphical summary of our outcome dentistvisit distribution.

Code
ggplot(nsch22_order_si, aes(x = dentistvisit, fill = dentistvisit)) + 
    geom_bar(aes(y = 
        (after_stat(count)/sum(after_stat(count))))) +
    geom_text(aes(y = 
        (after_stat(count))/sum(after_stat(count)), 
          label = scales::percent((after_stat(count)) / 
                            sum(after_stat(count)))),
              stat = "count", vjust = 1.5, 
              color = "black", size = 5) +
    scale_y_continuous(labels = scales::percent) +
    scale_fill_brewer(palette = "PuRd") +
    guides(fill = "none") + 
labs(title = "Preventive Dental Care",
         subtitle = "NSCH 22",
         x = "Dentist Visits",
         y = "Percentage",
         caption = "Number of children: 2410")

The graph reveals how preventive dental care is distributed among children in our NSCH 22 sample. Notably, the majority of children received dental care, with two or more visits being the most common frequency. While one visit was also relatively frequent, only 4% of children in the survey reported never having had preventive dental care in the past year.

Code
ggplot(nsch22_order_si, aes(x = dentistvisit, y = age, fill = dentistvisit)) + 
    geom_violin(trim = TRUE) +
    geom_boxplot(col = "white", width = 0.2) +
    scale_fill_brewer(palette = "PuRd") +
    guides(fill = "none")

8.1.7 Partitioning the Data

After dealing with missing values, we will split the data into training and testing samples with resample function. We want 75% in the training sample and 25% in the testing sample.

Code
set.seed(4322024)

nsch22_o_splits <- 
    initial_split(nsch22_order_si, prop = 3/4)

nsch22_order_train <- training(nsch22_o_splits)
nsch22_order_test <- testing(nsch22_o_splits)

We want to look at the distribution of subjects in both splits.

Code
dim(nsch22_order_train); dim(nsch22_order_test)
[1] 1807    5
[1] 603   5

Also, we want to check the breakdown of the distribution of preventive dental visits across different categories for both the training and test data sets.

Code
nsch22_order_train |> tabyl(dentistvisit)
         dentistvisit    n    percent
 No preventive visits   66 0.03652463
              1 visit  719 0.39789707
     2 or more visits 1022 0.56557831
Code
nsch22_order_test |> tabyl(dentistvisit)
         dentistvisit   n   percent
 No preventive visits  20 0.0331675
              1 visit 228 0.3781095
     2 or more visits 355 0.5887231

8.1.8 Scatterplot Matrix and Collinearity

We will evaluate collinearity between predictors via scatter plot matrix and variance inflation factors.

  • Scatterplot Matrix:
Code
GGally::ggpairs(nsch22_order_train |> select (age, er_visits, hospitalstay, dentistvisit))

The correlation among the predictors is fairly modest, and the scatter plot matrix reveals no major concerns regarding collinearity among predictors.

  • Checking Variance Inflation Factors:
Code
vif(polr(dentistvisit~ age+ er_visits+ hospitalstay, data= nsch22_order_train, Hess= TRUE))
                 GVIF Df GVIF^(1/(2*Df))
age          1.003689  1        1.001843
er_visits    1.099581  2        1.024016
hospitalstay 1.095761  1        1.046786

The predictors’ generalized variance inflation factors (GVIF) demonstrate a small level of collinearity. We therefore have no serious concerns with respect to collinearity.

8.1.9 Models Fiting

  • Fit proportional odds logistic regression model

Our model, denoted as m_polr, uses proportional odds logistic regression to predict dentistvisit. This prediction is made considering these factors: age, er visits, and hospitalstay, utilizing the polr function. Additionally, we provide a summary of the model.

Based on prior analysis, we confirm that our outcome variable is an ordered factor.

Code
m_polr <- polr(dentistvisit~ age+ er_visits+ hospitalstay, data= nsch22_order_train, Hess= TRUE)
summary(m_polr)
Call:
polr(formula = dentistvisit ~ age + er_visits + hospitalstay, 
    data = nsch22_order_train, Hess = TRUE)

Coefficients:
                             Value Std. Error t value
age                        0.06507   0.009935  6.5489
er_visits1 visit          -0.21842   0.148048 -1.4753
er_visits2 or more visits -0.29149   0.292856 -0.9953
hospitalstayno            -0.72220   0.355314 -2.0326

Intercepts:
                             Value   Std. Error t value
No preventive visits|1 visit -3.3821  0.3896    -8.6811
1 visit|2 or more visits     -0.3268  0.3721    -0.8782

Residual Deviance: 2875.925 
AIC: 2887.925 

Our m_polr model includes two intercepts to accommodate the three levels of dentistvisit, along with six slopes corresponding to our four predictors.

  • Summarizing the Fit
Code
# model including covariates
glance(m_polr) |> gt() |> 
  fmt_number(columns = logLik:deviance, decimals = 3) |>
  tab_options(table.font.size = 20)
edf logLik AIC BIC deviance df.residual nobs
6 −1,437.963 2,887.925 2,920.922 2,875.925 1801 1807

8.1.10 Interpreting the Model

We will exponentiate our model coefficients to facilitate interpretation.

Code
tidy(m_polr, exponentiate = TRUE, conf.int = TRUE) |>
filter(coef.type == "coefficient") |>
gt() |> fmt_number(estimate:conf.high, decimals = 3)
term estimate std.error statistic conf.low conf.high coef.type
age 1.067 0.010 6.549 1.047 1.088 coefficient
er_visits1 visit 0.804 0.148 −1.475 0.602 1.076 coefficient
er_visits2 or more visits 0.747 0.293 −0.995 0.421 1.334 coefficient
hospitalstayno 0.486 0.355 −2.033 0.234 0.954 coefficient

Coefficients Interpretation:

If we compare two children, A and B, with the same values of the other predictors:

  • If child A is one year older than child B, then the model predicts that child A will have 1.067 times higher odds of having more preventive dental care visits with 95% CI:(1.047, 1.088).

The influence of hospital ER visits, whether occurring once or 2 times or more, compared to none, is estimated to have coefficients of approximately 0.8 and 0.75, respectively. This suggests that when holding other variables constant, each additional ER visit is associated with a slight decrease in the odds of attending dental appointments, by approximately 0.80 and 0.75, respectively. The 95% CI for these estimates are (0.60,1.076) for the first scenario and (0.421, 1.334) for the latter. Notably, these effects are deemed to be relatively small.

For children with no history of hospitalization, the odds ratio of 0.486 indicates that they have lower odds of preventive dental care visits with 95% CI:(0.234, 0.954), holding all other variables constant. This effect is considered to be relatively minor.

8.1.11 Validating Models

  • Using lrm for Proportional Odds Logistic Regression

We will build a model using lrm tool from the rms package

Code
d <- datadist(nsch22_order_train)
options(datadist = "d")
m_lrm <- lrm(dentistvisit~ age+ er_visits+ hospitalstay, data= nsch22_order_train, x = T, y = T)

m_lrm
Logistic Regression Model

lrm(formula = dentistvisit ~ age + er_visits + hospitalstay, 
    data = nsch22_order_train, x = T, y = T)

                             Model Likelihood       Discrimination    Rank Discrim.    
                                   Ratio Test              Indexes          Indexes    
      Obs          1807    LR chi2      51.06       R2       0.035    C       0.590    
 No preventive visits66    d.f.             4      R2(4,1807)0.026    Dxy     0.180    
       1 visit      719    Pr(> chi2) <0.0001    R2(4,1366.2)0.034    gamma   0.188    
   2 or more visits1022                             Brier    0.238    tau-a   0.094    
      max |deriv| 2e-05                                                                

                           Coef    S.E.   Wald Z Pr(>|Z|)
y>=1 visit                  3.3821 0.3896  8.68  <0.0001 
y>=2 or more visits         0.3268 0.3721  0.88  0.3798  
age                         0.0651 0.0099  6.55  <0.0001 
er_visits=1 visit          -0.2184 0.1480 -1.48  0.1401  
er_visits=2 or more visits -0.2915 0.2929 -1.00  0.3196  
hospitalstay=no            -0.7222 0.3553 -2.03  0.0421  

The Nagelkerke R^2 value is suboptimal, measuring at 0.035. The C statistic of 0.590 suggests that the model fails in terms of predictive ability.

We validate the summary statistics with validate.

Code
set.seed(4322023)
validate(m_lrm)
          index.orig training    test optimism index.corrected   Lower   Upper
Dxy           0.1800   0.1844  0.1777   0.0066          0.1734  0.1265  0.2255
R2            0.0347   0.0366  0.0328   0.0038          0.0310  0.0112  0.0474
Intercept     0.0000   0.0000  0.1398  -0.1398          0.1398 -1.0065  1.0738
Slope         1.0000   1.0000  0.9653   0.0347          0.9653  0.6833  1.3376
Emax          0.0000   0.0000  0.1018  -0.1018          0.1018 -0.0693  0.2659
D             0.0277   0.0293  0.0261   0.0032          0.0245  0.0082  0.0381
U            -0.0011  -0.0011 -1.2757   1.2746         -1.2757 -1.2827 -1.2703
Q             0.0288   0.0304  1.3018  -1.2714          1.3002  1.2799  1.3180
B             0.0353   0.0364  0.0353   0.0011          0.0343  0.0269  0.0426
g             0.3891   0.3948  0.3755   0.0193          0.3698  0.2625  0.4699
gp            0.0938   0.0947  0.0130   0.0817          0.0120 -0.0136  0.0356
           n
Dxy       40
R2        40
Intercept 40
Slope     40
Emax      40
D         40
U         40
Q         40
B         40
g         40
gp        40

We obtain a validated 𝑅^2 of 0.0310 , and a validated C statistic of 0.5 + (0.1735/2) = 0.586, which suggests a that the model fails in terms of predictive ability.

8.1.12 Effects Plot

Before proceeding with any predictions, we present a plot illustrating the effect size of our model. The summary function applied to the lrm fit showcases the effect size by comparing the 25th percentile to the 75th percentile.

Code
plot(summary(m_lrm))

Code
summary(m_lrm)
             Effects              Response : dentistvisit 

 Factor                            Low High Diff. Effect   S.E.     Lower 0.95
 age                               6   15    9     0.58561 0.089418  0.410350 
  Odds Ratio                       6   15    9     1.79610       NA  1.507300 
 er_visits - 1 visit:None          1    2   NA    -0.21842 0.148050 -0.508590 
  Odds Ratio                       1    2   NA     0.80379       NA  0.601340 
 er_visits - 2 or more visits:None 1    3   NA    -0.29147 0.292860 -0.865460 
  Odds Ratio                       1    3   NA     0.74716       NA  0.420860 
 hospitalstay - yes:no             2    1   NA     0.72222 0.355320  0.025817 
  Odds Ratio                       2    1   NA     2.05900       NA  1.026200 
 Upper 0.95
 0.760860  
 2.140100  
 0.071751  
 1.074400  
 0.282520  
 1.326500  
 1.418600  
 4.131500  

According to this model, when the age increases from 6 to 15 years old the odds of having more dentistvisit are estimated to be 1.8 times higher, with a 95% CI of (1.5, 2.1), while keeping all other variables constant. Additionally, the effect of being a 15-year-old child on the log odds of dentistvisit is estimated to be 0.58, with a SE of 0.1.

When the child’s number of ER visits increases from none to once in the past 12 months, this is associated with an estimated effect on the log odds of dentistvisit of -0.21, with a SE of 0.15. This translates to an odds ratio of 0.80, indicating that children with one ER visit have 80% lower odds of having preventive dental care compared to children with no ER visits, with 95% CI (0.60, 1.07), while keeping all other variables constant. The impact is relatively minor.

When the child’s number of ER visits increases from none to 2 times or more in the past 12 months, this is associated with an estimated effect on the log odds of dentist visits of -0.29, with a SE of 0.29 Alternatively, it is associated with an estimated odds ratio of 0.74, indicating lower odds of dentistvisits, with a 95% CI of (0.42, 1.32), while controlling for all other variables. The impact is relatively small.

Being admitted to the hospital for at least one night in the past 12 months is associated with an estimated effect on the log odds of dentistvisit of 0.72, with a standard error of 0.36. Alternatively, it is associated with a higher estimated odds ratio effect of 2 times for dentistvisit, with a 95% confidence interval of (1, 4.13). However, it’s important to note that the effect is very small.

8.1.13 Nomogram of Model

Code
fun.1 <- function(x) 1 - plogis(x)
fun.3 <- function(x) 
    plogis(x - m_lrm$coef[1] + m_lrm$coef[2])

plot(nomogram(m_lrm,
    fun=list('Prob Y = 1 (Low)' = fun.1, 
             'Prob Y = 3 (High)' = fun.3)))

8.1.14 Testing the Proportional Odds Assumption

We will apply the equivalent multinomial logit model to assess the proportional odds assumption. We will build this fit with the multinomial model. This model does not adhere to our assumption.

Code
m_multi <- multinom(dentistvisit~ age+ er_visits+ hospitalstay, data= nsch22_order_train)
# weights:  18 (10 variable)
initial  value 1985.192406 
iter  10 value 1549.164657
iter  20 value 1430.936986
final  value 1430.936934 
converged
Code
m_multi
Call:
multinom(formula = dentistvisit ~ age + er_visits + hospitalstay, 
    data = nsch22_order_train)

Coefficients:
                 (Intercept)         age er_visits1 visit
1 visit             2.691706 -0.05716066       0.01654727
2 or more visits    3.047038  0.01833096      -0.22362763
                 er_visits2 or more visits hospitalstayno
1 visit                          -1.073470      0.3349657
2 or more visits                 -1.107578     -0.4428863

Residual Deviance: 2861.874 
AIC: 2881.874 

We will calculate the log likelihood values for each of our models by comparing the log likelihoods of the multinomial model and the proportional odds logistic regression model. The multinomial logit model includes 2 intercepts and 12 slopes, while the proportional odds logit model includes 2 intercepts and 6 slopes. Therefore, the difference in parameters between the two models is 6.

Code
LL_1 <- logLik(m_polr)
LL_1m <- logLik(m_multi)
(G <- -2 * (LL_1[1] - LL_1m[1]))
[1] 14.05155
Code
pchisq(G, 6, lower.tail = FALSE)
[1] 0.02906562

As indicated by the p-value of 0.029, the proportional odds logistic regression model provides a less proper fit compared to the more complex multinomial logit model.

Code
par(mfrow = c(2,2))
resid(m_lrm, 'score.binary', pl=TRUE)
`geom_smooth()` using formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.995
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 1.005
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 1.2889e-28
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1.01
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
0.995
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
1.005
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 1.2889e-28
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 1.01
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.995
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 1.005
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 1.2889e-28
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1.01
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
0.995
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
1.005
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 1.2889e-28
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 1.01
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.995
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 1.005
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 1.2889e-28
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1.01
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
0.995
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
1.005
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 1.2889e-28
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 1.01
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.995
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 1.005
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 1.2889e-28
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1.01
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
0.995
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
1.005
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 1.2889e-28
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 1.01

Code
par(mfrow= c(1,1))

8.1.15 AIC and BIC

Code
AIC(m_polr, m_multi)
        df      AIC
m_polr   6 2887.925
m_multi 10 2881.874
Code
BIC(m_polr,m_multi)
        df      BIC
m_polr   6 2920.922
m_multi 10 2936.868

m_multi demonstrates a slightly superior AIC. However, in terms of BIC, m_polroutperform m_multi. Based on the initial assumption that our outcome is an ordered factor, we will proceed with predictions using the polr model.

Note:

We attempted to perform traditional validation using a separate test set, but this approach proved unsuitable because our outcome variable dentistvisit is an ordered categorical factor. Traditional validation methods require numerical output for predictions, which is not possible with this type of outcome.

8.1.16 Cross Tabulation Approach

  • Predicted vs. Observed: How well does the model fit?

Next, we will apply our proportional odds logistic regression model, m_polr, to predict the frequency of preventive dental visits for each of the 603 subjects in our test sample nsch22_order_test.

Code
addmargins(table(predict(m_polr, newdata = nsch22_order_test), nsch22_order_test$dentistvisit,
dnn = c("Predicted", "Observed Values")))
                      Observed Values
Predicted              No preventive visits 1 visit 2 or more visits Sum
  No preventive visits                    0       0                0   0
  1 visit                                 8      49               38  95
  2 or more visits                       12     179              317 508
  Sum                                    20     228              355 603

The polr model m_polr predicts the dentistvisit category correctly for (0+49+317)=366 out of 603 children in the sample, accounting for 60.6% of the children in the test sample.

Our model achieves the following accurate predictions:

  • 0% correct for children who actually never had a preventive dental care visit.
  • 51.6% correct for children who actually had one preventive dental care visit (49 out of 95).
  • 62.4% correct for children who actually had two or more preventive dental care visits (317 out of 508).

Here, we’ll fit the appropriate m_polr and see what predictions it makes across our 1807 subjects in training sample nsch22_order_train.

Code
addmargins(table(predict(m_polr), nsch22_order_train$dentistvisit,
dnn = c("Predicted", "Observed Values")))
                      Observed Values
Predicted              No preventive visits 1 visit 2 or more visits  Sum
  No preventive visits                    0       0                0    0
  1 visit                                13     160              106  279
  2 or more visits                       53     559              916 1528
  Sum                                    66     719             1022 1807

The polr model m_polr predicts the dentistvisit category correctly for (0+160+916)= 1076 out of 1807 children in the sample, accounting for 60 % of the children in the training sample.

Our model achieves the following accurate predictions:

  • 0% correct for children who actually never had a preventive dental care visit.
  • 57.3% correct for children who actually had one preventive dental care visit (160 out of 279).
  • 60 % correct for children who actually had two or more preventive dental care visits (916 out of 1528).

8.1.17 Making Predictions

We will utilize our model, m_polr, to predict the frequency of preventive dental visits for two hypothetical children: - Child 1: 8 years old, 4 or more ER visits, and hospitalized. - Child 2: 15 years old, 4 or more ER visits, and hospitalized.

Code
temp.dat <- data.frame(name = c("1", "2"), age = c(8,15), er_visits = c("2 or more visits", "2 or more visits"), hospitalstay = c("yes", "yes"))

predict(m_polr, temp.dat, type = "p")
  No preventive visits   1 visit 2 or more visits
1           0.02630921 0.3382051        0.6354857
2           0.01684620 0.2498815        0.7332723

Based on our model’s predictions m_polr, child 1 (8 years old) has a higher probability of having one preventive dental visit, while Child 2 (15 years old) has a higher probability of having two or more preventive dental visits.

8.1.18 Multiple Imputation Approach

The provided code use multiple imputation approach to account for the missing values in the nsch22_model_one data. It first generates multiple imputations using the aregImpute function and then fits a logistic regression model lrm to the imputed data using the fit.mult.imputefunction, yielding the output m_imp

Code
set.seed(4322024)
d2 <- datadist(nsch22_model_one)
options(datadist = "d2")
imp_fit22 <- aregImpute(~ dentistvisit+age+ er_visits+ hospitalstay,
nk = c(0,3), tlinear = TRUE, data = nsch22_model_one,
B = 10, n.impute = 10, pr = FALSE)
m_imp <- fit.mult.impute(dentistvisit~age + er_visits+ hospitalstay,
fitter = lrm, xtrans = imp_fit22, pr = FALSE,
data = nsch22_model_one, fitargs=list(x=TRUE,y=TRUE))

m_imp
Logistic Regression Model

fit.mult.impute(formula = dentistvisit ~ age + er_visits + hospitalstay, 
    fitter = lrm, xtrans = imp_fit22, data = nsch22_model_one, 
    pr = FALSE, fitargs = list(x = TRUE, y = TRUE))

                             Model Likelihood       Discrimination    Rank Discrim.    
                                   Ratio Test              Indexes          Indexes    
      Obs          2410    LR chi2      65.23       R2       0.033    C       0.590    
 No preventive visits86    d.f.             4      R2(4,2410)0.025    Dxy     0.180    
       1 visit      947    Pr(> chi2) <0.0001    R2(4,1814.1)0.033    gamma   0.188    
   2 or more visits1377                             Brier    0.238    tau-a   0.093    
      max |deriv| 1e-05                                                                

                           Coef    S.E.   Wald Z Pr(>|Z|)
y>=1 visit                  3.1839 0.3328  9.57  <0.0001 
y>=2 or more visits         0.1296 0.3175  0.41  0.6830  
age                         0.0650 0.0086  7.53  <0.0001 
er_visits=1 visit          -0.1980 0.1293 -1.53  0.1258  
er_visits=2 or more visits -0.2857 0.2538 -1.13  0.2603  
hospitalstay=no            -0.5017 0.3018 -1.66  0.0964  

The The Nagelkerke R^2 value remains similar with that obtained from single imputation (0.035). Similarly, the C statistic of 0.591 almost matches the result from single imputation, indicating a lack of predictive ability in the model.

8.1.19 Comparing the Coeffiecnts (exponentiated)

Code
round_half_up(exp(m_lrm$coefficients),3)
                y>=1 visit        y>=2 or more visits 
                    29.434                      1.387 
                       age          er_visits=1 visit 
                     1.067                      0.804 
er_visits=2 or more visits            hospitalstay=no 
                     0.747                      0.486 
Code
round_half_up(exp(m_imp$coefficients),3)
                y>=1 visit        y>=2 or more visits 
                    24.141                      1.138 
                       age          er_visits=1 visit 
                     1.067                      0.820 
er_visits=2 or more visits            hospitalstay=no 
                     0.751                      0.606 

Comparing the lrm models obtained from single vs. multiple imputation reveals that the coefficients for age and er_visits (2 times or more category) are relatively similar. However, the coefficients for other variables show some differences between the two imputation methods.

8.2 Analysis 2

The outcome for this analysis is cavities. The predictors are er_visits, hospitalstay, and med_use.

We will start by creating a subset for the logistic analysis.

Code
nsch22_log <- nsch22|>
  select(hhid, er_visits, hospitalstay, med_use, cavities)

nsch22_log
# A tibble: 3,000 × 5
   hhid     er_visits hospitalstay med_use cavities
   <chr>    <fct>     <fct>        <fct>   <fct>   
 1 22017610 None      no           no      no      
 2 22075400 1 visit   no           no      yes     
 3 22103566 None      no           yes     no      
 4 22317899 None      no           yes     <NA>    
 5 22270255 None      no           no      no      
 6 22308731 None      no           no      no      
 7 22304506 None      no           no      no      
 8 22287109 None      no           no      no      
 9 22261632 None      no           no      no      
10 22020482 None      no           yes     no      
# ℹ 2,990 more rows

8.2.1 Prepare the binary outcome

We will observe the cavities outcome distribution in the study sample.

Code
nsch22_cc <- nsch22|>
  drop_na()
Code
ggplot(nsch22_cc, aes(x = cavities, fill = cavities)) + 
    geom_bar(aes(y = 
        (after_stat(count)/sum(after_stat(count))))) +
    geom_text(aes(y = 
        (after_stat(count))/sum(after_stat(count)), 
          label = scales::percent((after_stat(count)) / 
                            sum(after_stat(count)))),
              stat = "count", vjust = 1.5, 
              color = "black", size = 5) +
    scale_y_continuous(labels = scales::percent) +
    scale_fill_brewer(palette = "PuRd") +
    guides(fill = "none") + 
labs(title = "Chronic Cavities Experience",
         subtitle = "NSCH 22",
         x = "Chronic Cavities Experience",
         y = "Percentage",
         caption = "Number of children: 2391")

There are 2,391 subjects with complete cases out of the 3000 study sample subjects. 11% of the children guardians/ parents reported that they have experienced chronic cavities, while the remining 89% did not.

8.2.2 Missingness

First we will observe the missingness in the data set:

Code
miss_var_summary(nsch22_log) |> gt()
variable n_miss pct_miss
hospitalstay 14 0.467
cavities 14 0.467
er_visits 10 0.333
med_use 2 0.0667
hhid 0 0

The proportion of missing data in this sample is about 1.33%. The outcome cavities has the highest proportion of missing cases 0.47%, as well as the predictor hospitalstay followed by the other predictors er_visits and med_use (0.33 and 0.06). Overall, the proportion of missingness is minute. Due to the small proportion of missingness, we can assume that the data is missing at random (MAR).

For the sake of this project, we will replace the missing values through the multiple imputation method using aregImpute.

Code
set.seed(432)
dd <- datadist(nsch22_raw)
options(datadist = "dd")

nsch22_imp <- 
    aregImpute(~ cavities + hospitalstay + er_visits + med_use, 
               nk = c(0, 3:5), tlinear = FALSE, data = nsch22_raw,
               B = 10, n.impute = 3, pr = FALSE)

8.2.3 Ordinary logistic model

First, we will fit an ordinary logistic model m2_ord.

Code
m2_ord <- 
    fit.mult.impute((cavities == "yes") ~  er_visits + hospitalstay + med_use,
                    fitter = lrm, xtrans = nsch22_imp, 
                    data = nsch22_raw, 
                    fitargs = list(x = TRUE, y = TRUE), pr = FALSE)

m2_ord
Logistic Regression Model

fit.mult.impute(formula = (cavities == "yes") ~ er_visits + hospitalstay + 
    med_use, fitter = lrm, xtrans = nsch22_imp, data = nsch22_raw, 
    pr = FALSE, fitargs = list(x = TRUE, y = TRUE))

                      Model Likelihood      Discrimination    Rank Discrim.    
                            Ratio Test             Indexes          Indexes    
Obs          3000    LR chi2     11.38      R2       0.008    C       0.548    
 FALSE       2718    d.f.            4     R2(4,3000)0.002    Dxy     0.097    
 TRUE         282    Pr(> chi2) 0.0227    R2(4,768.9)0.010    gamma   0.189    
max |deriv| 6e-05                           Brier    0.085    tau-a   0.016    

                           Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                  -2.3388 0.4543 -5.15  <0.0001 
er_visits=1 visit          -0.0737 0.1976 -0.37  0.7091  
er_visits=2 or more visits -0.0377 0.3531 -0.11  0.9150  
hospitalstay=no             0.4844 0.4483  1.08  0.2799  
med_use=no                 -0.4938 0.1474 -3.35  0.0008  
Code
round_half_up(exp(m2_ord$coefficients),3)
                 Intercept          er_visits=1 visit 
                     0.096                      0.929 
er_visits=2 or more visits            hospitalstay=no 
                     0.963                      1.623 
                med_use=no 
                     0.610 

We will observe the size effects through this plot:

Code
plot(summary(m2_ord))

Our model predicts that children that have visited the ER in the hospital (1 visit and 2 or more visits) have lower odds of experiencing toothaches (0.929 and 0.963 respectively) compared to those with no history of ER visits. It also predicts that children that have been admitted to stay for at least one night in the hospital have lower odds (0.62) of experiencing chronic cavities compared to children with no history of hospital admission. However, children using medication have (1.61) the odds of experiencing chronic cavities. All of these assumptions are made while holding other factors constant.

All confidence intervals, with the exception of med_use, cross the value of 1. This indicates that the medication use variable is the only variable associate with the outcome and therefore contributes to the model predictability.

We will validate our model using the Bootstrap method.

Code
set.seed(432)
validate(m2_ord, B = 50)
          index.orig training    test optimism index.corrected   Lower  Upper
Dxy           0.0966   0.1091  0.0854   0.0238          0.0728  0.0247 0.1248
R2            0.0082   0.0120  0.0062   0.0058          0.0024 -0.0073 0.0124
Intercept     0.0000   0.0000 -0.5156   0.5156         -0.5156 -1.3299 1.1362
Slope         1.0000   1.0000  0.7703   0.2297          0.7703  0.3971 1.4355
Emax          0.0000   0.0000  0.2103  -0.2103          0.2103 -0.0414 0.5239
D             0.0035   0.0052  0.0025   0.0027          0.0008 -0.0038 0.0054
U            -0.0007  -0.0007  0.0003  -0.0010          0.0003 -0.0011 0.0032
Q             0.0041   0.0059  0.0022   0.0037          0.0004 -0.0072 0.0057
B             0.0848   0.0847  0.0850  -0.0003          0.0851  0.0752 0.0925
g             0.1655   0.2112  0.1541   0.0571          0.1084  0.0093 0.2117
gp            0.0153   0.0187  0.0136   0.0051          0.0102  0.0019 0.0199
           n
Dxy       50
R2        50
Intercept 50
Slope     50
Emax      50
D         50
U         50
Q         50
B         50
g         50
gp        50

The C statistic = 0.5 + (0.0705/2) = 0.53525 With a low AUC value, we can see that the model has a poor predictive performance. The Nagelkerke score of 0.002 further supports this conclusion regarding the model.

We will check whether collinearity is a problem.

Code
rms::vif(m2_ord)
         er_visits=1 visit er_visits=2 or more visits 
                  1.025313                   1.083527 
           hospitalstay=no                 med_use=no 
                  1.092496                   1.024044 

The value of inflation is not concerning for the presented predictors. Using an interaction term is not necessary in this case.

8.2.4 Bayesian logistic model

For the Bayesian method, the data source should be a data frame. With the proportion of missingness being small (1.33%), we can use a complete data set nsch22_cc.

We will fit the Bayesian logistic model m2_bayes:

Code
set.seed(432)

m2_bayes <- stan_glm((cavities == "yes") ~ er_visits + hospitalstay + med_use, 
               data = nsch22_cc, refresh = 0)


summary(m2_bayes)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      (cavities == "yes") ~ er_visits + hospitalstay + med_use
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 2391
 predictors:   5

Estimates:
                            mean   sd   10%   50%   90%
(Intercept)                0.1    0.0  0.0   0.1   0.2 
er_visits1 visit           0.0    0.0  0.0   0.0   0.0 
er_visits2 or more visits  0.0    0.0 -0.1   0.0   0.0 
hospitalstayno             0.0    0.0  0.0   0.0   0.1 
med_useno                  0.0    0.0  0.0   0.0   0.0 
sigma                      0.3    0.0  0.3   0.3   0.3 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.1    0.0  0.1   0.1   0.1  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
                          mcse Rhat n_eff
(Intercept)               0.0  1.0  5256 
er_visits1 visit          0.0  1.0  6008 
er_visits2 or more visits 0.0  1.0  5206 
hospitalstayno            0.0  1.0  5508 
med_useno                 0.0  1.0  6127 
sigma                     0.0  1.0  5837 
mean_PPD                  0.0  1.0  5635 
log-posterior             0.0  1.0  2015 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We will extract the posterior

Code
posteriors <- get_parameters(m2_bayes)

head(posteriors)
  (Intercept) er_visits1 visit er_visits2 or more visits hospitalstayno
1  0.05580140     0.0020734743              -0.071229744     0.06069418
2  0.16153815     0.0245856682              -0.007515778    -0.02896179
3  0.04445948     0.0114038569              -0.031896033     0.07720803
4  0.16229485    -0.0004394557              -0.061206473    -0.01454756
5  0.01272410     0.0185294712               0.007591949     0.09752850
6  0.16632708     0.0052493745              -0.046523998    -0.02386816
    med_useno
1 -0.02149012
2 -0.02820832
3 -0.03507464
4 -0.03362259
5 -0.02284911
6 -0.02864820

A summary of the posterior distribution can be observed below.

Code
describe_posterior(posteriors, test = c("pd", "ROPE")) |> print_md(decimals = 3)
Summary of Posterior Distribution
Parameter Median 95% CI pd ROPE % in ROPE
(Intercept) 0.09 [ 0.00, 0.18] 97.65% [-0.10, 0.10] 57.18%
er_visits1 visit 0.01 [-0.03, 0.05] 70.50% [-0.10, 0.10] 100%
er_visits2 or more visits -0.02 [-0.10, 0.06] 70.12% [-0.10, 0.10] 100%
hospitalstayno 0.04 [-0.05, 0.13] 80.73% [-0.10, 0.10] 93.08%
med_useno -0.03 [-0.06, 0.00] 97.47% [-0.10, 0.10] 100%

A summary of the prior distribution can be also be produced.

Code
describe_prior(m2_bayes) |> print_html(decimals = 3)
Parameter Prior_Distribution Prior_Location Prior_Scale
(Intercept) normal 0.11 0.77
er_visits1 visit normal 0.00 2.44
er_visits2 or more visits normal 0.00 4.50
hospitalstayno normal 0.00 5.18
med_useno normal 0.00 1.91

We can also visualize the coefficients and credible intervals.

Code
plot(m2_bayes, prob = 0.5, prob_outer = 0.95)

To compare the ordinary logistic model results with the Bayesian logistic model’s, we will produce a table with the exponeniated coefficients. This will also help in interpreting the odds ratio.

Code
broom.mixed::tidy(m2_bayes, exponentiate = TRUE, 
                  conf.int = TRUE, conf.level = 0.95) |> 
  gt() |> fmt_number(decimals = 3)
term estimate std.error conf.low conf.high
(Intercept) 1.096 0.051 1.001 1.196
er_visits1 visit 1.011 0.020 0.970 1.052
er_visits2 or more visits 0.979 0.040 0.905 1.058
hospitalstayno 1.039 0.048 0.953 1.135
med_useno 0.970 0.015 0.941 1.000

Similar the regular logistic regression model, the Bayesian model predicts that children with a history of ER visits have the same odds of experiencing chronic cavities compared to children with no ER visit history (1.01 and 0.98). The same findings can be observed with the hospital overnight stay (1.04). Contrary to the regular logistic regression model, the medication use is predicted to have the same odds (0.97). All confidence intervals cross the value of one, indicating these variables have weak association with the outcome (chronic cavities). All of these assumptions were made while holding other variables constant.

Graphical Posterior Predictive Checks

Code
pp_check(m2_bayes, nreps = 10)

Code
pp_check(m2_bayes, plotfun = "stat_2d", stat = c("mean", "sd"))
Note: in most cases the default test statistic 'mean' is too weak to detect anything of interest.

9 Conclusions and Discussion

9.1 Analysis One

What are the odds of children with hospitalization records of receiving preventive dental care when other variables (age) are taken into account?

We hypothesized that children who experienced frequent ER visits and hospitalization might be more likely to receive dental care, and that younger children might be more likely to receive frequent dental care visits compared to older children.

The proportional odds logistic regression model (POLR) revealed that age plays a role in predicting dental care visits. The model predicts that the likelihood of a child seeking preventive dental care increases by 1.067 times for each additional year of age. The impact of hospital ER visits was surprising. While we anticipated a positive association, the results showed a slight decrease in the odds of dental visits with each additional ER visit. However, for children with frequent hospital admission, the model suggested a higher likelihood of dental care than those without history of hospitalization visits. However, the small effect sizes across all these factors indicate their limited individual impact.

Our POLR predicting preventive dental visits with three levels (no preventive visits,1 visit,2 or more visits) using age, er_visits and hospitalstay has relatively poor or weak predictive ability. The validated 𝑅^ 2 of 0.0310, and a validated C of 0.5 + (0.1735/2) = 0.586 indicate that the model fails to fit our data well. However, it was able to correctly predict dental visit categories to some extent, particularly for children with one and two or more visits (51.6% and 62%, respectively).

Accounting for more sample size and incorporating additional factors beyond those currently included is crucial to enhancing the model’s performance. Socioeconomic status and access to healthcare are potential variables that could contribute to predicting preventive dental care among children in the US.

9.2 Analysis Two

What are the odds of children with hospitalization records of experiencing chronic cavities, when other variables (ER visits and medication) are taken into account?

In this study, we hypothesized that children with hospitalization history have higher odds of experiencing chronic cavities. The same assumption was made for children with a history of medication use. The ordinary logistic model predicted that children with ER visitation history have the same odds (0.929 and 0.963) of experiencing chronic cavities, compared to children without history. As hypothesized, children who use medication have higher odds (1.61) of chronic cavities. Children admitted to a hospital for at least one night had lower odds (0.62). In the ordinary logistic regression model, the medication use variable was the only variable with a confidence interval excluding the value of 1, indicating high association with the outcome.

The Bayes logistic regression produced the same results regarding the ER visits and hospital stay variables. However, the results were different for the medication use variable. The model predicted that children with 1 visit or 2 or more ER visits have the same odds (1.01 and 0.98) of children with no history of ER visits. The same findings can be observed with the hospital overnight stay (1.04) and medication use (0.97). All confidence intervals contain the value of one, indicating these variables have a weak association with the outcome (chronic cavities).

The Bayes regression model presents a better understanding of the findings. This AUC value of the model was very low (0.56) and the Nagelkerke R2 value (0.002) indicate weak predictive performance of the model. Therefore, the predictions produced from the model are not reliable. The predictors have a weak association with the outcome and as a result cannot contribute meaningfully to the model predictability.

Based on these findings, future studies regarding children with hospitalization records should use scores to evaluate the level of the health status. Objective methods evaluating oral health (such as the DMFT score) can provide reliable results compared to subjective measures (reported cavities experience).

10 References and Acknowledgments

10.1 References

The data set was pulled from the National Survey for Child Health year 2022. Data Source: https://www.childhealthdata.org/learn-about-the-nsch/nsch-codebooks

10.2 Acknowledgments

First and foremost, we would like to express our deepest gratitude to Professor Love for his excellent teaching, unwavering support, and the abundance of resources he provided throughout the semester. His patience and guidance were invaluable. We also extend our heartfelt appreciation to Alex, Miza, Monica, Lindsey, and all the TAs for their tireless assistance and willingness to address all our concerns throughout the semester, especially during the project period.

11 Session Information

Code
xfun::session_info()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8             askpass_1.2.1           backports_1.5.0        
  base64enc_0.1-3         bayesplot_1.14.0        bayestestR_0.17.0      
  bestglm_0.37.3          BH_1.87.0.1             bigD_0.3.1             
  bit_4.6.0               bit64_4.6.0.1           bitops_1.0.9           
  blob_1.2.4              boot_1.3-31             broom_1.0.10           
  broom.mixed_0.2.9.6     bslib_0.9.0             cachem_1.1.0           
  callr_3.7.6             car_3.1-3               carData_3.0-5          
  caret_7.0-1             caTools_1.18.3          cellranger_1.1.0       
  checkmate_2.3.3         class_7.3-23            cli_3.6.5              
  clipr_0.8.0             clock_0.7.3             cluster_2.1.8.1        
  cmprsk_2.2-12           coda_0.19.4.1           codetools_0.2-20       
  colorspace_2.1-2        colourpicker_1.3.0      commonmark_2.0.0       
  compiler_4.5.1          conflicted_1.2.0        corrplot_0.95          
  cowplot_1.2.0           cpp11_0.5.2             crayon_1.5.3           
  crosstalk_1.2.2         curl_7.0.0              cutpointr_1.2.1        
  data.table_1.17.8       datasets_4.5.1          datawizard_1.3.0       
  DBI_1.2.3               dbplyr_2.5.1            Deriv_4.2.0            
  desc_1.4.3              diagram_1.6.5           dials_1.4.2            
  DiceDesign_1.10         digest_0.6.37           distributional_0.5.0   
  doBy_4.7.0              dplyr_1.1.4             DT_0.34.0              
  dtplyr_1.3.2            dygraphs_1.1.1.6        e1071_1.7.16           
  Epi_2.61                etm_1.1.2               evaluate_1.0.5         
  exactRankTests_0.8.35   farver_2.1.2            fastmap_1.2.0          
  fontawesome_0.5.3       fontBitstreamVera_0.1.1 fontLiberation_0.1.0   
  fontquiver_0.2.1        forcats_1.0.1           foreach_1.5.2          
  foreign_0.8-90          Formula_1.2-5           fs_1.6.6               
  furrr_0.3.1             future_1.67.0           future.apply_1.20.0    
  gargle_1.6.0            gdtools_0.4.4           generics_0.1.4         
  GGally_2.4.0            ggformula_1.0.0         ggiraph_0.9.2          
  ggplot2_4.0.0           ggpubr_0.6.2            ggrepel_0.9.6          
  ggridges_0.5.7          ggsci_4.1.0             ggsignif_0.6.4         
  ggstats_0.11.0          ggtext_0.1.2            glmnet_4.1-10          
  globals_0.18.0          glue_1.8.0              googledrive_2.1.2      
  googlesheets4_1.1.2     gower_1.0.2             GPfit_1.0-9            
  gplots_3.2.0            graphics_4.5.1          grDevices_4.5.1        
  grid_4.5.1              gridExtra_2.3           gridtext_0.1.5         
  grpreg_3.5.0            gt_1.1.0                gtable_0.3.6           
  gtools_3.9.5            hardhat_1.4.2           haven_2.5.5            
  highr_0.11              Hmisc_5.2-4             hms_1.1.4              
  htmlTable_2.4.3         htmltools_0.5.8.1       htmlwidgets_1.6.4      
  httpuv_1.6.16           httr_1.4.7              ids_1.0.1              
  igraph_2.2.0            infer_1.0.9             inline_0.3.21          
  insight_1.4.2           ipred_0.9-15            isoband_0.2.7          
  iterators_1.0.14        janitor_2.2.1           jomo_2.7-6             
  jpeg_0.1.11             jquerylib_0.1.4         jsonlite_2.0.0         
  juicyjuice_0.1.0        KernSmooth_2.23.26      km.ci_0.5-6            
  KMsurv_0.1-6            knitr_1.50              labeling_0.4.3         
  labelled_2.15.0         later_1.4.4             lattice_0.22-7         
  lava_1.8.1              lazyeval_0.2.2          leaps_3.2              
  lhs_1.2.0               lifecycle_1.0.4         listenv_0.9.1          
  litedown_0.7            lme4_1.1-37             loo_2.8.0              
  lubridate_1.9.4         magrittr_2.0.4          markdown_2.0           
  MASS_7.3-65             Matrix_1.7-3            MatrixModels_0.5-4     
  matrixStats_1.5.0       maxstat_0.7.26          memoise_2.0.1          
  methods_4.5.1           mgcv_1.9-3              mice_3.18.0            
  microbenchmark_1.5.0    mime_0.13               miniUI_0.1.2           
  minqa_1.2.8             mitml_0.4-5             mitools_2.4            
  modeldata_1.5.1         modelenv_0.2.0          ModelMetrics_1.2.2.2   
  modelr_0.1.11           mosaic_1.9.2            mosaicCore_0.9.5       
  mosaicData_0.20.4       multcomp_1.4-29         mvtnorm_1.3-3          
  naniar_1.1.0            nlme_3.1-168            nloptr_2.2.1           
  nnet_7.3-20             norm_1.0.11.1           numDeriv_2016.8-1.1    
  openssl_2.3.4           ordinal_2023.12.4.1     pan_1.9                
  parallel_4.5.1          parallelly_1.45.1       parsnip_1.3.3          
  patchwork_1.3.2         pbkrtest_0.5.5          pillar_1.11.1          
  pkgbuild_1.4.8          pkgconfig_2.0.3         pls_2.8-5              
  plyr_1.8.9              png_0.1.8               polspline_1.1.25       
  polynom_1.4.1           posterior_1.6.1         prettyunits_1.2.0      
  pROC_1.19.0.1           processx_3.8.6          prodlim_2025.04.28     
  progress_1.2.3          progressr_0.17.0        promises_1.3.3         
  proxy_0.4.27            ps_1.9.1                pscl_1.5.9             
  purrr_1.1.0             quantreg_6.1            QuickJSR_1.8.1         
  R6_2.6.1                ragg_1.5.0              rappdirs_0.3.3         
  rbibutils_2.3           RColorBrewer_1.1-3      Rcpp_1.1.0             
  RcppArmadillo_15.0.2.2  RcppEigen_0.3.4.0.2     RcppParallel_5.1.11-1  
  Rdpack_2.6.4            reactable_0.4.4         reactR_0.6.1           
  readr_2.1.5             readxl_1.4.5            recipes_1.3.1          
  reformulas_0.4.1        rematch_2.0.0           rematch2_2.1.2         
  reprex_2.1.1            reshape2_1.4.4          rlang_1.1.6            
  rmarkdown_2.30          rms_8.1-0               ROCR_1.0-11            
  rpart_4.1.24            rsample_1.3.1           rstan_2.32.7           
  rstanarm_2.32.2         rstantools_2.5.0        rstatix_0.7.3          
  rstudioapi_0.17.1       rvest_1.0.5             S7_0.2.0               
  sandwich_3.1-1          sass_0.4.10             scales_1.4.0           
  selectr_0.4.2           sfd_0.1.0               shape_1.4.6.1          
  shiny_1.11.1            shinyjs_2.1.0           shinystan_2.6.0        
  shinythemes_1.2.0       slider_0.3.2            snakecase_0.11.1       
  sourcetools_0.1.7.1     SparseM_1.84-2          sparsevctrs_0.3.4      
  splines_4.5.1           SQUAREM_2021.1          StanHeaders_2.32.10    
  stats_4.5.1             stats4_4.5.1            stringi_1.8.7          
  stringr_1.5.2           survey_4.4-8            survival_3.8-3         
  survminer_0.5.1         survMisc_0.5.6          sys_3.4.3              
  systemfonts_1.3.1       tailor_0.1.0            tensorA_0.36.2.1       
  textshaping_1.0.4       TH.data_1.1-4           threejs_0.3.4          
  tibble_3.3.0            tidymodels_1.4.1        tidyr_1.3.1            
  tidyselect_1.2.1        tidyverse_2.0.0         timechange_0.3.0       
  timeDate_4051.111       tinytex_0.57            tools_4.5.1            
  tune_2.0.1              tzdb_0.5.0              ucminf_1.2.2           
  UpSetR_1.4.0            utf8_1.2.6              utils_4.5.1            
  uuid_1.2.1              V8_8.0.1                vctrs_0.6.5            
  viridis_0.6.5           viridisLite_0.4.2       visdat_0.6.0           
  vroom_1.6.6             warp_0.2.1              withr_3.0.2            
  workflows_1.3.0         workflowsets_1.1.1      xfun_0.52              
  xml2_1.4.0              xtable_1.8-4            xts_0.14.1             
  yaml_2.3.10             yardstick_1.3.2         zoo_1.8-14