TY - JOUR
T1 - Acid sulfate soil mapping in western Finland
T2 - How to work with imbalanced datasets and machine learning
AU - Estévez, Virginia
AU - Mattbäck, Stefan
AU - Boman, Anton
AU - Liwata-Kenttälä, Pauliina
AU - Björk, Kaj Mikael
AU - Österholm, Peter
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/7
Y1 - 2024/7
N2 - Imbalanced datasets are one of the main challenges in digital soil mapping. For these datasets, machine learning techniques commonly overestimate the majority classes and underestimate the minority ones. In general, this generates maps with poor precision and unrealistic results. Considering these maps for land use decision-making can have dire consequences. This is the case of acid sulfate (AS) soils, a type of harmful soil that can generate serious environmental damage when drained in agricultural or forestry activities. Therefore, it is necessary to create high-precision maps to avoid environmental damage. Although most soil class datasets in nature are imbalanced, this problem has hardly been studied. One of the main objectives of this work is the evaluation of different techniques to address the problem of imbalanced datasets. The methods considered to balance the dataset are an undersampling technique, the addition of more samples, and the combination of both. For increasing the number of samples from the minority class, we develop a new technique by creating artificial samples from the quaternary geological map. The method used for the modeling is Random Forest, one of the best methods for the classification of AS soils. Balancing the dataset improves the performance of the model in all the studied cases, where the values of the metrics for both classes are above 80%. The consideration of artificial non-AS soil samples improves the prediction of the model for the AS soils. Furthermore, we create AS soil probability maps for the four balanced datasets and the imbalanced dataset. The modeled AS soil probability maps created from the balanced datasets have high precision. A detailed comparison between the maps is made. The predictions of some of these maps match between 75%–80% of the study area. In addition, the extent of the AS soils obtained in all the cases is compared with the extent of the AS soils in the conventionally produced occurrence map. The good results of this study confirm the importance of balancing the dataset to improve the prediction and classification of AS soils.
AB - Imbalanced datasets are one of the main challenges in digital soil mapping. For these datasets, machine learning techniques commonly overestimate the majority classes and underestimate the minority ones. In general, this generates maps with poor precision and unrealistic results. Considering these maps for land use decision-making can have dire consequences. This is the case of acid sulfate (AS) soils, a type of harmful soil that can generate serious environmental damage when drained in agricultural or forestry activities. Therefore, it is necessary to create high-precision maps to avoid environmental damage. Although most soil class datasets in nature are imbalanced, this problem has hardly been studied. One of the main objectives of this work is the evaluation of different techniques to address the problem of imbalanced datasets. The methods considered to balance the dataset are an undersampling technique, the addition of more samples, and the combination of both. For increasing the number of samples from the minority class, we develop a new technique by creating artificial samples from the quaternary geological map. The method used for the modeling is Random Forest, one of the best methods for the classification of AS soils. Balancing the dataset improves the performance of the model in all the studied cases, where the values of the metrics for both classes are above 80%. The consideration of artificial non-AS soil samples improves the prediction of the model for the AS soils. Furthermore, we create AS soil probability maps for the four balanced datasets and the imbalanced dataset. The modeled AS soil probability maps created from the balanced datasets have high precision. A detailed comparison between the maps is made. The predictions of some of these maps match between 75%–80% of the study area. In addition, the extent of the AS soils obtained in all the cases is compared with the extent of the AS soils in the conventionally produced occurrence map. The good results of this study confirm the importance of balancing the dataset to improve the prediction and classification of AS soils.
KW - Acid sulfate soils
KW - Digital soil mapping
KW - Imbalanced dataset
KW - Machine learning
KW - Random Forest
KW - Resampling techniques
UR - http://www.scopus.com/inward/record.url?scp=85193479386&partnerID=8YFLogxK
U2 - 10.1016/j.geoderma.2024.116916
DO - 10.1016/j.geoderma.2024.116916
M3 - Article
AN - SCOPUS:85193479386
SN - 0016-7061
VL - 447
JO - Geoderma
JF - Geoderma
M1 - 116916
ER -