TY - JOUR
T1 - A data-driven early warning system for Escherichia coli in water based on microbial community analysis using flow cytometry 2D histograms
AU - Erb, Isabel Katharina
AU - Gador, Niklas
AU - Jinbäck, Moa
AU - Lindberg, Elisabet
AU - Paul, Catherine
PY - 2025/12/1
Y1 - 2025/12/1
N2 - Traditional methods for microbial water quality testing take up to two days to produce results, putting humans in contact with this water risk during this period. Flow cytometry, including with online capacity, is a fast and efficient way to profile microbes in water. In this study, Escherichia coli concentrations determined by Colilert18 and flow cytometry profiles from the same water samples were taken from sixteen bathing locations in Southern Sweden. Applying machine learning algorithms confirmed correlations and identified patterns in the microbial community described by the flow cytometry 2D histograms associated with the presence of E. coli. A Random Forest algorithm was best in discriminating between water containing > 100 CFU/100 mL and water containing < 100 CFU/100 mL E. coli when compared to logistic regression and support vector machines, improving prediction accuracy to 80 % from a baseline approach of 55 % when using optimised parameters. The introduction of a two-threshold model, which only considered safe predictions, further improved accuracy to 87 % by utilizing the prediction probability information in random forest. This approach, however, could only predict 65 % of the samples. A feature importance ranking using random forest identified the most important region within the flow cytometric 2D histogram for classification. This study suggests machine learning can leverage microbial community information from flow cytometry, that when combined with established methods quantifying indicators, can rapidly assess microbial water quality as an early warning system that complements traditional approaches.
AB - Traditional methods for microbial water quality testing take up to two days to produce results, putting humans in contact with this water risk during this period. Flow cytometry, including with online capacity, is a fast and efficient way to profile microbes in water. In this study, Escherichia coli concentrations determined by Colilert18 and flow cytometry profiles from the same water samples were taken from sixteen bathing locations in Southern Sweden. Applying machine learning algorithms confirmed correlations and identified patterns in the microbial community described by the flow cytometry 2D histograms associated with the presence of E. coli. A Random Forest algorithm was best in discriminating between water containing > 100 CFU/100 mL and water containing < 100 CFU/100 mL E. coli when compared to logistic regression and support vector machines, improving prediction accuracy to 80 % from a baseline approach of 55 % when using optimised parameters. The introduction of a two-threshold model, which only considered safe predictions, further improved accuracy to 87 % by utilizing the prediction probability information in random forest. This approach, however, could only predict 65 % of the samples. A feature importance ranking using random forest identified the most important region within the flow cytometric 2D histogram for classification. This study suggests machine learning can leverage microbial community information from flow cytometry, that when combined with established methods quantifying indicators, can rapidly assess microbial water quality as an early warning system that complements traditional approaches.
M3 - Article
SN - 2589-9147
VL - 29
JO - Water Research X
JF - Water Research X
ER -