The goal of MPHIACausal package is to provide the code for the paper. This README file illustrates the basic usage of the MPHIACausal package and reproduces some of the results in the paper. A more complete result reproduction can be seen in the “ResultReproduction.Rmd”.
To install the package from MPHIACausal_0.1.0.tar.gz
, the easy way is to use packages devtools
or remotes
as follows:
# install.packages("devtools")
devtools::install_local("MPHIACausal_0.1.0.tar.gz")
# install.packages("remotes")
remotes::install_local("MPHIACausal_0.1.0.tar.gz")
To speed up the proposed algorithm, here we use parallel computing facility and set the number of parallel session workers to 10.
library(MPHIACausal)
options(CausalLearnWorkders = 10)
There are some datasets and functions to generate random datasets for the testing of the causal structural learning algorithms in the package.
data("Tri90Data")
can load the processed MPHIA Tri90 datasets for each target and gender.randomGaussian(n, p, prob, theta)
can generate random data of sample size \(n\) and number of covariates \(p\) together with its true skeleton (graphical structure without orientation).randomMPHIA(n, dn)
can generate random data of sample size \(n\) which mimic the MPHIA dataset dn
(dn
is one of Tri90AwareMale, Tri90AwareFemale, Tri90ARTMale, Tri90ARTFemale, Tri90VLSMale, Tri90VLSFemale) together with its true skeleton (graphical structure without orientation).rnet(G, data, n=nrow(data))
can generate a simulated dataset of sample size \(n\) according to the fitted distribution of graph G
on data
.The R functions corresponding to causal structural learning algorithms used in the paper can be accessed by getAlgo(method)
, such as getAlgo("GS")
, getAlgo("PC.stable")
, getAlgo("New")
(the proposed algorithm). The first argument to the causal structural learning functions should be the dataset. Other useful arguments are alpha
(the size of hypothesis testing), max.sx
(the upper-bound for the size of the conditional independence set), and undirected
(whether the learned graph is undirected or not). Note that the undirected
argument does not impact the skeleton of the learned graph.
We first generate a random dataset with sample size \(n=100\), number of covariates \(p = 20\), \(\rho = 0.2\), \(\theta = 1\) together with its true skeleton (graphical structure without orientation) using the random Gaussian data generation method.
data <- randomGaussian(n = 100, p = 20, prob = 0.2, theta = 1)
Or we can generate a random dataset which mimics the Tri90AwareMale MPHIA dataset with sample size \(n = 100\) together with its true skeleton (graphical structure without orientation).
data <- randomMPHIA(n = 100, dn = "Tri90AwareMale")
Then we can apply the methods to the random generated data.
methods <- c("GS", "PC.stable", "IAMB", "MMPC", "New")
graphs <- lapply(methods,
function(method) {
suppressWarnings(getAlgo(method)(data$data,
alpha = 0.05))
})
names(graphs) <- methods
And we can compare the skeletons learned by the algorithms with the true skeleton as follows. TPR and TNR stand for true positive and negative rates, respectively.
results <- lapply(graphs, compareSkel,
skel0 = data$skel)
names(results) <- methods
results
#> $GS
#> TPR TNR
#> 0.08108108 0.99903241
#>
#> $PC.stable
#> TPR TNR
#> 0.05405405 0.99951621
#>
#> $IAMB
#> TPR TNR
#> 0.09009009 0.99903241
#>
#> $MMPC
#> TPR TNR
#> 0.09009009 0.99854862
#>
#> $New
#> TPR TNR
#> 0.3423423 0.9874214
Here we apply the methods to the MPHIA male awareness dataset.
data("Tri90Data")
methods <- c("GS", "PC.stable", "IAMB", "MMPC", "New")
graphs <- lapply(methods,
function(method){
f <- getAlgo(method)
suppressWarnings(f(Tri90Data$Tri90AwareMale,
alpha = 0.05, max.sx = 3, undirected = FALSE))
})
names(graphs) <- methods
We can see some statistics of the graph learned by the proposed algorithm, check whether the result matches the previous results saved in the package (Tri90Results
), and the DFs and the Log-likelihoods as follows.
## The details of the graph learned by the proposed algorithm.
## The learning algorithm is incorrectly displayed as "Grow-Shrink" due to some coding issue.
graphs$New
#>
#> Bayesian network learned via Constraint-based methods
#>
#> model:
#> [partially directed graph]
#> nodes: 66
#> arcs: 113
#> undirected arcs: 14
#> directed arcs: 99
#> average markov blanket size: 6.33
#> average neighbourhood size: 3.42
#> average branching factor: 1.50
#>
#> learning algorithm: Grow-Shrink
#> conditional independence test: Mutual Information (cond. Gauss.)
#> alpha threshold: 0.05
#> tests used in the learning procedure: 17624
#> optimized: FALSE
## We can see that the graph just learned is the same as the result saved in the package. Here tp, fp, and fn stand for true positive, false positive, and false negative (all considering edge orientation), respectively.
data("Tri90Results")
bnlearn::compare(Tri90Results$Tri90AwareMale$New, graphs$New)
#> $tp
#> [1] 113
#>
#> $fp
#> [1] 0
#>
#> $fn
#> [1] 0
## We can check the degree of freedom (df) and Log-likelihood* (score) of the learned graphs.
results <- lapply(graphs, fitNetwork, data = Tri90Data$Tri90AwareMale)
results
#> $GS
#> $GS$df
#> [1] 81
#>
#> $GS$score
#> [1] 2348.336
#>
#>
#> $PC.stable
#> $PC.stable$df
#> [1] 50
#>
#> $PC.stable$score
#> [1] 3076.681
#>
#>
#> $IAMB
#> $IAMB$df
#> [1] 71
#>
#> $IAMB$score
#> [1] 2650.435
#>
#>
#> $MMPC
#> $MMPC$df
#> [1] 44
#>
#> $MMPC$score
#> [1] 2218.742
#>
#>
#> $New
#> $New$df
#> [1] 948
#>
#> $New$score
#> [1] 17755.49
A code chunk to reproduce and check the graphs saved in the package is as follows. The code chunk is currently not evaluated to save time. You can also run the “ResultReproduction.Rmd” to do the check.
library(MPHIACausal)
options(CausalLearnWorkders = 10)
data("Tri90Data")
data("Tri90Results")
methods <- c("GS", "PC.stable", "IAMB", "MMPC", "New")
learnGraphs <- function(data){
r <- lapply(methods,
function(method) {
f <- getAlgo(method)
suppressWarnings(f(data,
alpha = 0.05, max.sx = 3,
undirected = FALSE))
})
names(r) <- methods
r
}
system.time(Tri90ResultsCheck <- lapply(Tri90Data, learnGraphs))
checkgraph <- function(g, g0){
comp <- bnlearn::compare(g0, g)
if(!identical(comp$fp, 0)) {
cat("check failed.\n")
return()
}
if(!identical(comp$fn, 0)) {
cat("check failed.\n")
return()
}
cat("check passed.\n")
}
datasets <- names(Tri90Data)
for (d in datasets) {
cat(paste0(d, "\n"))
for (method in methods) {
cat(paste0(method, ": "))
checkgraph(Tri90ResultsCheck[[d]][[method]],
Tri90Results[[d]][[method]])
}
}
The complete simulation study takes a long time. Here we see a quick example for one simulation which mimics the Tri90AwareMale MPHIA dataset with sample size \(n = 100\). TPR and TNR stand for true positive rate and true negative rate, respectively.
suppressWarnings(SimulationI(n = 100, dn = "Tri90AwareMale"))
#> GS PC.stable IAMB MMPC New
#> TPR 0.08108108 0.05405405 0.08108108 0.08108108 0.2702703
#> TNR 0.99806483 0.99903241 0.99758104 0.99806483 0.9825835
And we can repeat the simulation for \(M=5\) times and obtain the empirical true positive rate and true negative rate as follows.
## repeat the simulation 10 times
M <- 5
results <- replicate(M,
suppressWarnings(SimulationI(n = 100, dn = "Tri90AwareMale")),
simplify = FALSE)
## get average
Reduce("+", results) / M
#> GS PC.stable IAMB MMPC New
#> TPR 0.08828829 0.05945946 0.09369369 0.09369369 0.2504505
#> TNR 0.99874214 0.99961297 0.99903241 0.99903241 0.9811321
Note that we can use a similar method for the simulation studies in S.5.1 and S.5.3 below to get empirical true positive and negative rates.
The complete simulation study S.5.1 takes a long time. Here we see a quick example for one simulation which mimics the Tri90AwareMale MPHIA dataset with sample size \(n = 100\). TPR and TNR stand for true positive rate and true negative rate, respectively.
suppressWarnings(SimulationS51(n = 100, dn = "Tri90AwareMale"))
#> MCI=2 MCI=3 MCI=4 MCI=5 MCI=Inf
#> TPR 0.3063063 0.3063063 0.3063063 0.3063063 0.3063063
#> TNR 0.9830672 0.9830672 0.9830672 0.9830672 0.9830672
The complete simulation study in S.5.2 also takes a long time. We save the results of simulation study S.5.2, which can be retrieved quite easily. For example,
library(ROCR)
data("SimulationS52ROCRs")
SimulationS52AUC(SimulationS52ROCRs$Tri90AwareMale)
#> GS PC.stable IAMB MMPC New
#> 0.7080620 0.7262506 0.7189025 0.7643728 0.9223383
SimulationS52PlotROC(SimulationS52ROCRs$Tri90AwareMale)
Here is a detailed example to see how to replicate the results in simulation study S.5.2. In simulation study S.5.2, random datasets are generated according to one of Tri90 graphs learned by the proposed algorithm like the following:
data <- rnet(Tri90Results$Tri90AwareMale$New,
Tri90Data$Tri90AwareMale)
data <- rnet(Tri90Results$Tri90AwareFemale$New,
Tri90Data$Tri90AwareFemale)
data <- rnet(Tri90Results$Tri90ARTMale$New,
Tri90Data$Tri90ARTMale)
data <- rnet(Tri90Results$Tri90ARTFemale$New,
Tri90Data$Tri90ARTFemale)
data <- rnet(Tri90Results$Tri90VLSMale$New,
Tri90Data$Tri90VLSMale)
data <- rnet(Tri90Results$Tri90VLSFemale$New,
Tri90Data$Tri90VLSFemale)
Here we generate some pseudoTri90Data
for illustration purposes.
pseudoTri90Data <- randomMPHIA(n = 100, dn = "Tri90AwareMale")$data
pseudoTri90Result <- MPHIACausal:::DAGLearn(pseudoTri90Data, max.sx = 2)
As in simulation study S.5.2, we can generate \(M\) random datasets with the same sample size with data
. Here we set \(M = 10\).
M <- 10
datasets <- replicate(M, rnet(pseudoTri90Result, pseudoTri90Data),
simplify = FALSE)
And for each method, we calculate an averaged graph and compare the averaged graph to the true graph as follows.
options(CausalLearnWorkders = NULL)
methods <- c("GS", "PC.stable", "IAMB", "MMPC", "New")
ROCRs <- list()
for (method in methods) {
## Learn M graphs from the M datasets using method
graphs <- lapply(datasets,
function(data){
methodFunction <- getAlgo(method)
suppressWarnings(methodFunction(data, max.sx = 2))
})
## build "averaged graphs" from the graphs we have and compare it to the "true" graph
ROCRs[[method]] <- buildROCR(truth = pseudoTri90Result,
graphs = graphs)
}
## calculate AUC and plot ROC from the results
SimulationS52AUC(ROCRs, methods = methods)
#> GS PC.stable IAMB MMPC New
#> 0.6434824 0.6086957 0.6773408 0.6945422 0.9898530
SimulationS52PlotROC(ROCRs, methods = methods)
The complete simulation study S.5.3 takes a long time. Here we see a quick example for one simulation with sample size \(n = 100\), \(p = 100\) and \(\rho = 0.01\). TPR and TNR stand for true positive rate and true negative rate, respectively.
suppressWarnings(SimulationS53(n = 100, p = 100, prob = 0.01))
#> GS PC.stable IAMB MMPC New
#> TPR 0.3666667 0.6166667 0.5500000 0.5833333 0.6833333
#> TNR 0.9971660 0.9979757 0.9977733 0.9975709 0.9923077