Output

Numerical results

The package provides the following outputs:

Data table of g-formula estimates: The result table of g-formula estimates is returned by the fit function, containing (1) the nonparametric estimates of the natural course risk/mean outcome, (2) the parametric g-formula estimates of the risk/mean outcome under each user-specified intervention, (3) the risk ratio between each intervention and the reference intervention (natural course by default, can be specified in the argument ‘‘ref_int’’), (4) the risk difference between each intervention and the reference intervention.
Simulated data table for interventions: The package gives the simulated data table in the simulation step under each specified intervention, which can be obtained by:
sim_data = g.summary_dict['sim_data']
To get the simulated data under a particular intervention:
sim_data = g.summary_dict['sim_data'][intervention_name]
The IP weights: To get the inverse probability weights when there is censoring event:
ip_weights = g.summary_dict['IP_weights']
The model summary: The package gives the model summary for each covariate, outcome, competing event (if applicable), censoring event (if applicable). First the argument ‘‘model_fits’’ should be set to True, then the model summary can be obtained by:
fitted_models = g.summary_dict['model_fits_summary']
To get the fitted model for a particular variable:
fitted_model = g.summary_dict['model_fits_summary'][variable_name]
The coefficients: The package gives the parameter estimates of all the models, which can be obtained by:
model_coeffs = g.summary_dict['model_coeffs']
To get the coefficients of the model for a particular variable, please use:
model_coeffs = g.summary_dict['model_coeffs'][variable_name]
The standard errors: The package gives the standard errors of the parameter estimates of all the models, which can be obtained by:
model_stderrs = g.summary_dict['model_stderrs']
To get the standard errors of the model for a particular variable, please use:
model_stderrs = g.summary_dict['model_stderrs'][variable_name]
The variance-covariance matrices: The package gives the variance-covariance matrices of the parameter estimates of all the models, which can be obtained by:
model_vcovs = g.summary_dict['model_vcovs']
To get the variance-covariance matrix of the parameter estimates of the model for a particular variable, please use:
model_vcovs = g.summary_dict['model_vcovs'][variable_name]
The root mean square error: The package gives the RMSE values of the models, which can be obtained by:
rmses = g.summary_dict['rmses']
To get the RMSE of the model for a particular variable, please use:
rmses = g.summary_dict['rmses'][variable_name]
Nonparametric estimates at each time point: The package gives the nonparametric estimates of all covariates and risk at each time point for survival outcomes, which can be obtained by:
obs_estimates = g.summary_dict['obs_plot']
To get the nonparametric estimates of a particular variable, e.g., risk, please use:
obs_estimates = g.summary_dict['obs_plot']['risk']
Parametric estimates at each time point: The package gives the parametric estimates of all covariates and risk at each time point for survival outcomes, which can be obtained by:
est_estimates = g.summary_dict['est_plot']
To get the parametric estimates of a particular variable, e.g., risk, please use:
est_estimates = g.summary_dict['est_plot']['risk']
Hazard ratio: The package gives hazard ratio value for the two interventions specified, which can be obtained by:
hazard_ratio = g.summary_dict['hazard_ratio']

The package also implement nonparametric bootstrapping to obtain 95% confidence intervals for risk/mean estimates by repeating the algorithm for many bootstrap samples. Users can choose the argument ‘‘nsamples’’ to specify the number of new generated bootstrap samples. Users may choose the argument ‘‘parallel’’ to parallelize bootstrapping and simulation steps under each intervention to make the algorithm run faster. The argument ‘‘ncores’’ can be used to specify the desired number of CPU cores in parallarization.

The package provides two ways for calculating the confidence intervals in argument ‘‘ci_method’’, ‘‘percentile’’ means using percentile bootstrap method which takes the 2.5th and 97.5th percentiles of the bootstrap estimates to get the 95% confidence interval, “normal” means using the normal bootstrap method which uses the the original estimate and the standard deviation of the bootstrap estimates to get the normal approximation 95% confidence interval.

The g-formula estimates of bootstrap samples: The package gives the parametric g-formula estimates of all bootstrap samples, which can be obtained by:
g = ParametricGformula(..., nsamples = 20, parallel=True, n_core=10, ci_method = 'percentile', ...) g.fit() bootests = g.summary_dict['bootests']
To get the parametric g-formula estimates of a particular bootstrap sample, please use:
g.summary_dict['bootests']['sample_{id}_estimates']
where id is the sample id which should be an integer between 0 and ‘‘nsamples’’ - 1.

The coefficients of bootstrap samples: The package gives the parameter estimates of all the models for all generated bootstrap samples, which can be obtained by:

g = ParametricGformula(..., nsamples = 20, parallel=True, n_core=10, ci_method = 'percentile', boot_diag=True, ...)
g.fit()
bootcoeffs = g.summary_dict['bootcoeffs']

Note that the ‘‘boot_diag’’ should be set to true if users want to obtain the coefficients, standard errors or variance-covariance matrices of bootstrap samples.

To get the coefficients of a particular bootstrap sample, please use:
g.summary_dict['bootcoeffs']['sample_{id}_coeffs']

The standard errors of bootstrap samples: The package gives the standard errors of the parameter estimates of all the models for all generated bootstrap samples, which can be obtained by:
g = ParametricGformula(..., nsamples = 20, parallel=True, n_core=10, ci_method = 'percentile', boot_diag=True, ...) g.fit() bootstderrs = g.summary_dict['bootstderrs']
To get the standard errors of a particular bootstrap sample, please use:
g.summary_dict['bootstderrs']['sample_{id}_stderrs']
The variance-covariance matrices of bootstrap samples: The package gives the variance-covariance matrices of the parameter estimates of all the models for all generated bootstrap samples, which can be obtained by:
g = ParametricGformula(..., nsamples = 20, parallel=True, n_core=10, ci_method = 'percentile', boot_diag=True, ...) g.fit() bootvcovs = g.summary_dict['bootvcovs']
To get the variance-covariance matrices of a particular bootstrap sample, please use:
g.summary_dict['bootvcovs']['sample_{id}_vcovs']

Note that to get bootstrap results of coefficients, standard errors, and variance-covariance matrices, the argument ‘‘boot_diag’’ must be set to True.

All the output results above can be saved by the argument ‘‘save_results’’, once it is set to True, results will be saved locally by creating a folder automatically. Users can also specify the folder path by the argument ‘‘save_path’’:

g = ParametricGformula(..., save_results = True, save_path = 'user-specified path', ...)
g.fit()

Arguments:

Arguments	Description
n_simul	(Optional) An integer indicating the number of subjects for whom to simulate data. It is set equal to the number (M) of subjects in obs_data, if not specified by users.
ref_int	(Optional) An integer indicating the intervention to be used as the reference for calculating the end-of-follow-up mean/risk ratio and mean/risk difference. 0 denotes the natural course, while subsequent integers denote user-specified interventions in the order that they are named in interventions. It is set to 0 if not specified by users.
nsamples	(Optional) An integer specifying the number of bootstrap samples to generate.
parallel	(Optional) A boolean value indicating whether to parallelize simulations of different interventions to multiple cores.
ncores	(Optional) An integer indicating the number of cores used in parallelization. It is set to 1 if not specified by users.
model_fits	(Optional) A boolean value indicating whether to return the parameter estimates of the models.
ci_method	(Optional) A string specifying the method for calculating the bootstrap 95% confidence intervals, if applicable. The options are “percentile” and “normal”. It is set to “percentile” if not specified by users.
boot_diag	(Optional) A boolean value indicating whether to return the parametric g-formula estimates as well as the coefficients, standard errors, and variance-covariance matrices of the parameters of the fitted models in the bootstrap samples.
save_results	(Optional) A boolean value indicating whether to save all the returned results to the save_path.
save_path	(Optional) A path to save all the returned results. A folder will be created automatically in the current working directory if the save_path is not specified by users.
seed	(Optional) An integer indicating the starting seed for simulations and bootstrapping. It is set to 1234 if not specified by users.

Graphical results

The package also provides two plotting functions: “plot_natural_course” and “plot_interventions”. The plot_natural_course function plots the curves of each covariate mean (for all types of outcomes) and risk (for survival outcomes only) under g-formula parametric and non-parametric estimation.

plot_natural_course(time_points, covnames, ...)

This is an internal function that plots the results comparison of covariate means and risks between non-parametric estimates and g-formula parametric estimates.

pygformula.plot.plot_natural_course(time_points, covnames, covtypes, time_name, obs_data, obs_means, est_means, censor, outcome_type, plot_name, marker, markersize, linewidth, colors, save_path, save_figure, boot_table)

This is an internal function that plots the results comparison of covariate means and risks between non-parametric estimates and g-formula parametric estimates.

Parameters:

time_points (Int) – An integer indicating the number of time points to simulate. It is set equal to the maximum number of records (K) that obs_data contains for any individual plus 1, if not specified by users.
covnames (List) – A list of strings specifying the names of the time-varying covariates in obs_data.
covtypes (List) – A list of strings specifying the “type” of each time-varying covariate included in covnames. The supported types: “binary”, “normal”, “categorical”, “bounded normal”, “zero-inflated normal”, “truncated normal”, “absorbing”, “categorical time”, “square time” and “custom”. The list must be the same length as covnames and in the same order.
time_name (Str) – A string specifying the name of the time variable in obs_data.
obs_data (DataFrame) – A data frame containing the observed data.
obs_means (Dict) – A dictionary, where the key is the covariate / risk name and the value is its observational mean at all the time points.
est_means (Dict) – A dictionary, where the key is the covariate / risk name and the value is its parametric mean at all the time points.
censor (Bool) – A boolean value indicating the if there is a censoring event.
outcome_type (Str) – A string specifying the “type” of outcome. The possible “types” are: “survival”, “continuous_eof”, and “binary_eof”.
plot_name (Str) – A string specifying the name for plotting, which is set to “all”, “risk” or one specific covariate name.
marker (Str) – A string used to customize the appearance of points in plotting.
markersize (Int) – An integer specifies the size of the markers in plotting.
linewidth (Float) – A number that specifies the width of the line in plotting.
colors (List) – A list that contains two strings, the first specifies the color for plotting nonparametric estimates, the second specifies the color for plotting the parametric estimates.
save_path (Path) – A path to save all the figure results. A folder will be created automatically in the current working directory if the save_path is not specified by users.
save_figure (Bool) – A boolean value indicating whether to save the figure or not.
boot_table (DataFrame) – A DataFrame with nonparametric risk and parametric risks of all interventions.

Return type:

Nothing is returned, the figure will be shown.

The plot_interventions function plots the curves of risk under interventions of interest (for survival outcomes only).

plot_interventions(time_points, time_name, ...)

An internal function to plot the risk results comparison of all interventions and the natural course.

pygformula.plot.plot_interventions(time_points, time_name, risk_results, int_descript, outcome_type, colors, marker, markersize, linewidth, save_path, save_figure, boot_table)

An internal function to plot the risk results comparison of all interventions and the natural course.

Parameters:

time_points (Int) – An integer indicating the number of time points to simulate. It is set equal to the maximum number of records (K) that obs_data contains for any individual plus 1, if not specified by users.
time_name (Str) – A string specifying the name of the time variable in obs_data.
risk_results (List) – A list that contains the risk estimates at all the time points of all interventions.
int_descript (List) – A list of strings, each describing a user-specified intervention.
outcome_type (Str) – A string specifying the “type” of outcome. The possible “types” are: “survival”, “continuous_eof”, and “binary_eof”.
colors (List) – A list that contains strings, each of which specifies the color for plotting the risk curve of the intervention.
marker (Str) – A string used to customize the appearance of points in plotting.
markersize (Int) – An integar specifies the size of the markers in plotting.
linewidth (Float) – A number that specifies the width of the line in plotting.
save_path (Path) – A path to save all the figure results. A folder will be created automatically in the current working directory if the save_path is not specified by users.
save_figure (Bool) – A boolean value indicating whether to save the figure or not.
boot_table (DataFrame) – A DataFrame with nonparametric risk and parametric risks of all interventions.

Return type:

Nothing is returned, the figure will be shown.

Arguments for plotting:

Arguments	Description
plot_name	A string specifying the name for plotting, which is set to “all”, “risk” or one specific covariate name. Only applicable for the plot_natural_course function. The default is “all”.
colors	For plot_natural_course function, it is a list wth two elements, specifying the non-parametric estimate curve and parametric curve respectively. Users can choose colors from matplotlib colors. For plot_interventions function, it is a list wth m elements with m the number of interventions plus 1, specifying all intervention curves. If not specified, the function will use default colors.
marker	A string used to customize the appearance of points in plotting. Users can also choose markers from matplotlib markers library.
markersize	An integer specifies the size of the markers in plotting.
linewidth	A number that specifies the width of the line in plotting.
save_figure	A boolean value indicating whether to save the figure or not.

Users can call the ‘plot_natural_course’ function by:

g.plot_natural_course()

Users can call the ‘plot_interventions’ function by:

g.plot_interventions()

Note that the plotting functions can only be applied after calling the ‘g.fit’ function.

The figures can be saved by the argument ‘‘save_figure’’, once it is set to True, results will be saved locally by creating a folder automatically. If the argument ‘‘save_path’’ is specified, the figure will be saved to the corresponding folder.

Sample syntax:

g.plot_natural_course(plot_name='L1', colors=['blue', 'red'], markersize=5, linewidth=1, marker='v', save_figure=True)
g.plot_interventions(colors =['green', 'red', 'yellow'], markersize=5, linewidth=1, marker='v', save_figure=True)

Note

We recommend setting the ‘‘save_figure’’ as True if users want to access the figure when running the package on Linux system.