Comparison of four statistical and machine learning methods for crash severity prediction

Amirfarrokh Iranitalab, Aemal Khattak

Research output: Contribution to journalArticle

34 Citations (Scopus)

Abstract

Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012–2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012–2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.

Original languageEnglish (US)
Pages (from-to)27-36
Number of pages10
JournalAccident Analysis and Prevention
Volume108
DOIs
StatePublished - Nov 2017

Fingerprint

learning method
Cluster Analysis
Learning systems
Costs and Cost Analysis
performance
Support vector machines
Machine Learning
costs
Costs
Forests
Set theory
Support Vector Machine
Datasets
data analysis
traffic

Keywords

  • Crash costs
  • Multinomial logit
  • Nearest neighbor classification
  • Random forests
  • Support vector machines
  • Traffic crash severity prediction

ASJC Scopus subject areas

  • Human Factors and Ergonomics
  • Safety, Risk, Reliability and Quality
  • Public Health, Environmental and Occupational Health

Cite this

Comparison of four statistical and machine learning methods for crash severity prediction. / Iranitalab, Amirfarrokh; Khattak, Aemal.

In: Accident Analysis and Prevention, Vol. 108, 11.2017, p. 27-36.

Research output: Contribution to journalArticle

@article{8edb6b5b346148eca7b7ab682bac71f0,
title = "Comparison of four statistical and machine learning methods for crash severity prediction",
abstract = "Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012–2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012–2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.",
keywords = "Crash costs, Multinomial logit, Nearest neighbor classification, Random forests, Support vector machines, Traffic crash severity prediction",
author = "Amirfarrokh Iranitalab and Aemal Khattak",
year = "2017",
month = "11",
doi = "10.1016/j.aap.2017.08.008",
language = "English (US)",
volume = "108",
pages = "27--36",
journal = "Accident Analysis and Prevention",
issn = "0001-4575",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Comparison of four statistical and machine learning methods for crash severity prediction

AU - Iranitalab, Amirfarrokh

AU - Khattak, Aemal

PY - 2017/11

Y1 - 2017/11

N2 - Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012–2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012–2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.

AB - Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012–2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012–2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.

KW - Crash costs

KW - Multinomial logit

KW - Nearest neighbor classification

KW - Random forests

KW - Support vector machines

KW - Traffic crash severity prediction

UR - http://www.scopus.com/inward/record.url?scp=85027874583&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85027874583&partnerID=8YFLogxK

U2 - 10.1016/j.aap.2017.08.008

DO - 10.1016/j.aap.2017.08.008

M3 - Article

C2 - 28841408

AN - SCOPUS:85027874583

VL - 108

SP - 27

EP - 36

JO - Accident Analysis and Prevention

JF - Accident Analysis and Prevention

SN - 0001-4575

ER -