Research Results

How Far Can We Improve the Accuracy of a Machine Learning Classifier?

Research on Estimating the Best Possible Performance in Machine LearningFY2024

ISHIDA Takashi (Research Scientist, RIKEN Center for Advanced Intelligence Project; Lecturer, Graduate School of Frontier Sciences, The University of Tokyo): ACT-X; Researcher (2020-2022), Research on Bayes Error Estimation and Regularization Methods, Frontier of Mathematics and Information Science Area

Proposal of Bayes error estimation methods for appropriately evaluating machine learning data

Decisions on what kind of data to collect to train a machine learning model, and how much, directly impact the accuracy and cost of that system. Bayes error^*1 estimation serves as a crucial criterion for such decisions.

In the past, Bayes error estimation methods were studied only for specific scenarios. Takashi Ishida, a research scientist at RIKEN and a lecturer at the University of Tokyo, studies Bayes error estimation technology that can be applied to various classification problems. Based on this research, he derived a new Bayes error estimation method for classification problems, and the proposed method was accepted by ICLR 2023, an international conference that focuses on the field of machine learning and deep learning, as notable top 5%.

Dr. Ishida also studies regularization methods for preventing overfitting. These are thought to be applicable to tuning hyperparameters (the values set prior to training a machine learning model) for regularization, based on the estimated Bayes error, and studies are underway on new learning methods for classifiers and neural network regularization methods.

Apart from Dr. Ishida’s research on Bayes error, he studies regularization methods in situations in which hard labels^*2 are assigned. He contributed to a project that was accepted by Transactions on Pattern Analysis and Machine Intelligence of the Institute of Electrical and Electronics Engineers (IEEE), the world’s largest scientific society in the electronics field.

*1 Bayes error
The minimal prediction error that can be achieved in classification problems. It can be used as an indicator of the best achievable performance.

Identifying the appropriate quality and quantity of the dataset for effective learning

Machine learning is fundamental to the use of AI, and the kind of dataset to use for training a machine learning model is a factor of critical importance. In general, a dataset that is low in quantity and quality leads to poor prediction accuracy, while an appropriate amount of high quality data for training leads to higher accuracy. However, too much learning of data, even if such data is high in quality, may cause a phenomenon known as “overfitting,” in which accuracy is rather low. Moreover, certain purposes or certain characteristics of data may render it unsuitable for machine learning in the first place. Deciding how much data to collect for training is a difficult challenge. One way to do so is to use the estimated Bayes error as a basic input for making a decision.

The Bayes error in supervised learning^*3 refers to the minimal prediction error that can be achieved in a given problem. If Bayes error estimation technology advances, researchers will be able to ascertain in advance the best achievable performance for the data analysis problems that they want to work on. Then, depending on whether the best achievable performance is suitable for their purposes, they can not only determine the feasibility of the given project in advance, but later decide whether to add data to achieve appropriate performance or to stop training early to avoid overfitting. In other words, this advancement provides the ability to set quantitative targets for training machine learning models.

In addition, if Bayes error estimation becomes possible for various datasets, it will significantly expand application opportunities, such as tuning the strength of regularization.

*2 Soft labels and hard labels
In supervised learning, the term “hard label” is used when the training data contains distinct positive/negative answers, while the term “soft label” is used when no deterministic answers are given, with only ratios provided.

*3 Supervised learning
In this category of machine learning, numerous pairs of input instance and label are created in advance to be used for training.

Enabling accurate estimation of the Bayes error even with limited data and supervision

A significant aspect of this research in Bayes error estimation is the development of a method that estimates the Bayes error by solely collecting the maximum posterior probabilities for classes associated with each instance (input data point). As long as information on the class with the highest posterior probability is available, individual instances are not required.

This means that Bayes error estimation can be conducted even when it's cost-prohibitive to collect instances or when there are restrictions on data collection. Naturally, this method can also be applied in scenarios where instances are already available.

The method has ideal statistical properties, such as consistency, unbiasedness, and/or asymptotic unbiasedness, even in more realistic scenarios, such as when specific noisy soft labels^*2 (data for which the answer to the training data is not 1 or 0 (positive or negative), but rather represents the degree of the label, such as 0.5 or 0.75) can be collected or when multiple hard labels can be collected.

Experiments using artificial data demonstrated that the true Bayes error can be accurately estimated when soft labels are given, when noisy soft labels are given, and when positive confidence is given. For example, Fig. 1 shows that accurate estimation is achievable even with a very small quantity of data.

Fig. 1 Results of experiments using artificial data with soft labels

As the quantity of data increases, the value estimated by the new method (blue line) approaches the true Bayes error (black dotted line).

The research also estimated the Bayes error for more realistic datasets, such as benchmark datasets typically used for machine learning. The performance achieved with recent models was close to the estimated Bayes error, suggesting that the best achievable performance has already been achieved for these datasets. Other attempts in this research include applications such as estimation of the difficulty level of international academic conferences using the Bayes error.

As for regularization methods, the researcher is studying a new method to train classifiers using soft labels and has proposed an regularization method based on studies of neural networks.

Exploring methods to estimate the best achievable performance for various metrics

As described above, this research successfully proposed a method for estimating the Bayes error in binary classification, and going forward, it will seek to expand the scope further. The methodology has already been adapted by others for the multi-class classification case and has found applications in other settings. Future work will explore the potential of this approach in various metrics.