With an annual budget in excess of $500 Billion, the U.S. Department of Defense (DOD) is one of the single largest money-spenders in the world. Each year, they award billions of dollars in contracts to research, develop, build, procure, or otherwise produce materials and technology to promote the public defense. These contracts can take vastly different forms and terms, encompassing everything from next-generation weapon systems to replenishing office supplies.
Given the sheer scope and variety of these contacts, it can be difficult for those making procurement decisions to reliably know what new contracts are the best deals, and which have the highest risk. Using a record of completed DoD contracts from the past 16 years, we use a Random Forest model to see what combination of contract features (e.g. duration, cost, competitive bids) are the most powerful predictors for whether a given new contract will be terminated before completion.
This file contains the code to train, run, and evaluate the Random Forest model from data that have already been appropriately cleaned and aggregated to the appropirate contract format. The preliminary scripts to perform that cleaning and aggregation can be found separately on the project GitHub Repo.
The following packages are required for our analysis:
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
library(dplyr)
library(caret)
library(pROC)
## Warning: package 'pROC' was built under R version 3.3.3
The data used in this analysis have already been aggregated and somewhat cleaned; they are loaded from our online repository, which also includes the scripts used to access and aggregate the raw data. For more information on the data source, see the README file.
rm(list=ls(all=TRUE))
load("clean_dataset.Rda")
df <- dataset
In addition to the functions included in the packages already loaded in, our analysis also makes use of two funtions to check the accuracy of predictions. The first calculates the Mean F1 score, which is a measure of predictive accuracy that ranges from 0 to 1. The target for reliable predictions in this case is somewhere in the area of 0.8 and above, but values very close to 1 could be a sign of overfitting in the test process.
#Mean
meanf1 <- function(actual, predicted){
#Mean F1 score function
#actual = a vector of actual labels
#predicted = predicted labels
classes <- unique(actual)
results <- data.frame()
for(k in classes){
results <- rbind(results,
data.frame(class.name = k,
weight = sum(actual == k)/length(actual),
precision = sum(predicted == k & actual == k)/sum(predicted == k),
recall = sum(predicted == k & actual == k)/sum(actual == k)))
}
results$score <- results$weight * 2 * (results$precision * results$recall) / (results$precision + results$recall)
return(sum(results$score))
}
The second custom function is simply a shortcut for applying the meanf1 function to different subpopulations, simplifying the Train-Validate-Test process.
#Simple predict function
pred.check <- function(x,y) {
#"x" is the partition for validation and "y" is the model
#so I only need to modify one model when checking
pred <- predict(y, x, type="class")
z <- meanf1(x$IsTerminated, pred)
print(z)
}
The third function is to simply calculate the “Area Under the Curve”, or AUC, score. AUC represents a slightly different way of measuring accuracy than Mean F1, but both measure on the same scale of 0–1. For a binary outcome like IsTerminated, the difference between the AUC and 0.5 is the extent to which the model predicts outcomes more accurately than a random coin flip.
#simple AUC function
auc.check <- function(x,y) {
#"x" is the partition for validation and "y" is the model
pred <- predict(y, x, type="class")
#convert to numeric binary so auc can be calculated
pred <- ifelse(pred == "Terminated", 1, 0)
real <- ifelse(x$IsTerminated == "Terminated", 1, 0)
#calculate auc
z <- auc(real, pred)
print(z)
}
As this is a case of supervised learning, our analysis cannot make use of data points that do not have a known outcome, the variable IsTerminated. (They can be predicted, but not verified, making them functionally equivalent to observations outside of the data set.) These observations are dropped.
#remove missing from outcome variables
df <- df[!is.na(df$IsTerminated),]
For the sake of parsimony, we also drop variables that are not included in our analysis, either because they are not predictive (e.g. the contract ID) or missing so many observations that their inclusion would severely diminish sample size without imputation.
#subset relevant variables (drop irrelevant, date-based, and more than 1mm missing)
drop <- names(df) %in% c("CSIScontractID", "StartFiscal_Year", "MinOfSignedDate",
"MinOfEffectiveDate", "unmodifiedSystemequipmentcode",
"UnmodifiedUltimateCompletionDate", "UnmodifiedLastDateToOrder",
"Unmodifiedmultipleorsingleawardidc", "UnmodifiedLastDateToOrder",
"UnmodifiedIsOnlyOneSource", "UnmodifiedIsFollowonToCompetedAction",
"UnmodifiedIsCombination", "Unmodifiedaddmultipleorsingawardidc",
"UnmodifiedCustomer","UnmodifiedIsFullAndOpen", "UnmodifiedIsSomeCompetition",
"UnmodifiedNumberOfOffersReceived",
"UnmodifiedPlaceCountryISO3", "pChangeOrderUnmodifiedBaseAndAll")
df <- df[!drop]
Some variables are recoded to be a more amenenable type in R, especially to de-factorize variables that R reads as factors by default. For some variables with a significant number of NA values, NA is recoded as a factor level to accomodate any predictive power that may be associated with failing to have that particular value. For example, it is possible that a factor being improperly or insufficiently documented in some cases might be associated with higher risk of other organizational problems that could lead to contract termination.
#change class
df$UnmodifiedDays <- as.numeric(df$UnmodifiedDays)
df$UnmodifiedCurrentCompletionDate <- months(df$UnmodifiedCurrentCompletionDate)
df$UnmodifiedAwardOrIDVcontractactiontype <- as.character(df$UnmodifiedAwardOrIDVcontractactiontype)
df$UnmodifiedSubCustomer <- as.character(df$UnmodifiedSubCustomer)
#recode NA to dummy for those with >100k NA
df$UnmodifiedAwardOrIDVcontractactiontype[is.na(df$UnmodifiedAwardOrIDVcontractactiontype)] <- "Unknown"
df$UnmodifiedSubCustomer[is.na(df$UnmodifiedSubCustomer)] <- "Unknown"
#recode outcome IsTerminated with characters
df$IsTerminated <- ifelse(df$IsTerminated == 0, "Not Terminated", "Terminated")
#turn all char into factor
for(i in colnames(df)) {
x <- class(df[,i])
if(x == 'character') {
df[,i] <- as.factor(df[,i])
}
}
#turn all int into num
for(i in colnames(df)) {
x <- class(df[,i])
if(x != 'factor') {
df[,i] <- as.numeric(df[,i])
}
}
Finally, we drop all incomplete observations, insteaad of attempting to impute. (Random Forests do not accomodate missing values well.) We also balance the dataset on the outcome variable. Before balancing, there is only approximately 1 Terminated contract for every 100 Not Terminated contract; we balanced that ratio to 50% by downsampling the Not Terminated values. The final dataset, after balancing the outcome and remove irrelevant data is 86,586. This represents a very conservative approach for our proof of concept, but as we explore more robust predictions, we will want to incorporate upsampling or bootstrapping to make use of more observations.
#can only use complete cases
df_comp <- df[complete.cases(df),]
#fix balance
table(df_comp$IsTerminated)
##
## Not Terminated Terminated
## 2098584 28864
df_term <- df_comp[df_comp$IsTerminated == "Terminated",]
df_non <- df_comp[df_comp$IsTerminated == "Not Terminated",]
set.seed(3)
downsample <- df_non[sample(nrow(df_non),57724),]
df_balanced <- rbind(df_term,downsample)
The available data are randomly split into three groups: 70% of observations go into a training set, 15% go to a validate set to check the predictive accuracy of the trained model, and a final 15% for a final test of the model settled on after experimenting with the train-validate combination.
Note that while the bootstrapped nature of the Random Forest approach does provide a certain amount of “Out of Bag” cases for free testing, but with the large number of observations we have available, we safely proceed with the more rigourous formal partitioniong.
set.seed(100)
sample_train <- floor(.7*nrow(df_balanced))
sample_valid <- floor(.15*nrow(df_balanced))
index_train <- sort(sample(seq_len(nrow(df_balanced)), size=sample_train))
index_not.train <- setdiff(seq_len(nrow(df_balanced)), index_train)
index_valid <- sort(sample(index_not.train, size=sample_valid))
index_test <- setdiff(index_not.train, index_valid)
df_train <- df_balanced[index_train, ]
df_valid <- df_balanced[index_valid, ]
df_test <- df_balanced[index_test, ]
Our model of choice is a Random Forest using the 9 variables listed to make a classification prediction of yes or no for the outcome variable IsTerminated. The model is run on the training set df_train.
The configuration shown here, consisting of 9 input variables, and 100 trees, is the final result of the iteration of trying different configurations of the Train-Validate loop. The number of decision trees is set to only 100 instead of the default 500 because the tuneRF function below shows that there is little precision benefit after the first 100 trees, despite increasing computational costs.
###random forest classification
set.seed(123)
rf <- randomForest(IsTerminated ~
UnmodifiedIsInternational +
UnmodifiedCurrentCompletionDate +
UnmodifiedDays +
UnmodifiedPlatformPortfolio +
UnmodifiedProductOrServiceArea +
UnmodifiedSimpleArea +
UnmodifiedAwardOrIDVcontractactiontype +
UnmodifiedSubCustomer +
UnmodifiedTypeOfContractPricing,
data=df_train,
mtry = 2, ntree = 100,
#omit missing values
na.action = na.omit, importance = T,
type = "classification")
Because the importance paramater above is set to TRUE, we can display the relative importance of different input variables in the predicted outcomes of the model. The values shown have little directly intepretable value, but shows that UnmodifiedDays and UnmodifiedIsInternational show up as crticial variables more often than other input factors. This means that the length of the contract and the presence or absence of international work were more likely to predict a contract’s termination than the included details of the contractor the targetted completion date at the time of signing.
#check importance of each variable
imp <- as.data.frame(sort(importance(rf)[,1],decreasing = TRUE),optional = T)
names(imp) <- "% Inc MSE"
imp
## % Inc MSE
## UnmodifiedIsInternational 13.3917727
## UnmodifiedDays 12.1054214
## UnmodifiedSubCustomer 11.0271321
## UnmodifiedAwardOrIDVcontractactiontype 10.2852295
## UnmodifiedTypeOfContractPricing 8.4455056
## UnmodifiedProductOrServiceArea 6.9308964
## UnmodifiedSimpleArea 6.7244371
## UnmodifiedPlatformPortfolio 2.5703421
## UnmodifiedCurrentCompletionDate 0.2953323
As discussed above, tuneRF here produces a chart showing that there is relatively little benefit to additional computationally-intensive levels to trees after the first 2 levels.
Similarly, the basic plot of the Random Forest object rf shows that error decreases level off before the point of the first 100 trees, justifying the use of 100 trees instead of the default 500.
tuneRF(df_train[,c("IsTerminated","UnmodifiedIsInternational","UnmodifiedCurrentCompletionDate",
"UnmodifiedDays","UnmodifiedPlatformPortfolio",
"UnmodifiedProductOrServiceArea","UnmodifiedSimpleArea",
"UnmodifiedAwardOrIDVcontractactiontype","UnmodifiedSubCustomer",
"UnmodifiedTypeOfContractPricing")],
#IsTerminated as the outcome variable
df_train$IsTerminated,
ntreeTry = 400, na.action = na.omit,
mtryStart = 1, stepFactor = 2,
improve = 0.001, trace = TRUE, plot = F)
## mtry = 1 OOB error = 1.05%
## Searching left ...
## Searching right ...
## mtry = 2 OOB error = 0%
## 0.9984227 0.001
## mtry = 4 OOB error = 0%
## 0 0.001
## mtry OOBError
## 1.OOB 1 1.046015e-02
## 2.OOB 2 1.649866e-05
## 4.OOB 4 1.649866e-05
#Plotting the Random Forest object shows the decreasing error with additional trees
plot(rf)
With this trained Random Forest model in hand, we can now evaluate its accuracy in predicting outcome values against not only itself (as an internal consistency check) and the validate set, but also the holdout test set.
pred.check(df_train, rf)
## [1] 0.8661777
pred.check(df_valid, rf)
## [1] 0.8640094
pred.check(df_test, rf)
## [1] 0.8544884
From this, we see a stable Mean F1 of around 0.85 for predictions in all three sets, a decently high score that roughly means that the model can predict contract terminations with 85% accuracy.
Similarly, the AUC scores are favorably high and stable between train, validate, and test, suggesting a relatively robust predictive capability without any sign of overfitting.
auc.check(df_train, rf)
## Area under the curve: 0.8946
auc.check(df_valid, rf)
## Area under the curve: 0.892
auc.check(df_test, rf)
## Area under the curve: 0.8861
In conclusion, using only 9 input factors out of dozens of possible factors, a Random Forest model is able to predict whether not a Defense contract will be terminated with 85% accuracy. This result represents a very conservative approach with relatively tiny amount of potential variables and observations that are more stable and polished than the rest of the available data. As a proof of concept, this shows that there is considerable promise if the approach were to continue with the following improvements.
The set of observations used for this proof of concept is very restricted to start. Of millions of total contracts, only 82,000 meet all of the criteria for inclusion, raising the risk of weaker than expected predictive power when working with cases that are even just slightly different from the defined set.
More fine-grain tuning, especially around the many missing values in the aggregated data, would increase the predictive power and reliability of the model in two key ways: First, the integrity of more input variables could be more throughly established, allowing them to be easily added to the Random Forest without fear of endogeniety. Second, fewer observations would need to be dropped to run the model, increasing presicion and generalizability by taking full advantage of as many example contracts as possible.
With some additional data cleaning and investigation, it should be possible to reliably apply a categorical variable to all contracts indicating whether or not the contract went over budget, and if so, a numerical variable showing by how much the budget was exceeded. If these values can be shown to be reliable at the contract level, it would be a straightforward endeavour to repurporse the Random Forest model used here to predict overages instead of terminations.