• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

HSE Researchers Develop Method to Verify Reliability of Computer-Based Cancer Recurrence Prediction

Research by a collaborative team from HSE University's Faculty of Biology and Biotechnology, Moscow State University, and the Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences has been officially published in the international journal Stat. The study addresses a critical challenge in biomedicine: determining whether machine learning algorithms identify genuine biological patterns or merely overfit to random noise in data.

A "non-random classifier" based on IGFBP6 and ELOVL5 gene expression.

A "non-random classifier" based on IGFBP6 and ELOVL5 gene expression.
A team of authors

The Urgent Need for Algorithm Verification

Contemporary clinical research frequently relies on gene expression analysis to forecast disease progression and guide treatment decisions. Researchers often work with gene pairs and construct linear classifiers based on them—computational models that assign individual patients to specific groups. However, when the number of analyzed gene pairs grows large, a serious methodological problem emerges: does the algorithm genuinely detect differences between patient populations, or has it simply stumbled upon a random combination that happens to work on the available dataset?

"Consider screening 570 gene pairs searching for markers of breast cancer recurrence. Even without true underlying patterns, some pairs would randomly separate the data perfectly," explains Anton Zhiyanov, researcher at the laboratory of molecular physiology. "We needed a statistical procedure to distinguish such lucky accidents from authentic biomarkers."

Mathematical Framework for Verification

The authors developed a test grounded in probabilistic theory of linear separability. The core principle: if two samples differ biologically, the probability of their accidental linear separability should be extremely low. Conversely, if an algorithm operates by chance, this probability remains high even on synthetic data devoid of biological meaning.

The researchers mathematically computed upper bounds for p-values—statistical measures indicating how likely results are due to randomness. Special attention was given to the two-dimensional case and normally distributed data, yielding practically applicable formulas. Algorithms were implemented in C++ using parallel computing, ensuring scalability to high-dimensional datasets.

Striking Results on Real Medical Data

When the team applied their developed test to 570 candidate gene pair markers for breast cancer recurrence, findings proved sobering: 559 of 570 proposed classifiers failed statistical verification. This demonstrates that the vast majority of models displaying high accuracy on original data essentially relied on random coincidences rather than true biological distinctions between patients.

"This discovery illustrates the magnitude of the multiple testing problem in biomedicine," notes Alexander Tonevitsky, Dean of the faculty of biology and biotechnology. "Without proper verification, researchers and clinicians may draw erroneous conclusions and recommend ineffective treatment strategies."

Identifying Authentic Biomarkers

However, the analysis also revealed classifiers meeting statistical criteria. A model based on the ELOVL5 and IGFBP6 gene pair, previously developed by HSE University researchers, merits particular attention. This combination not only demonstrated statistical significance according to the proposed test but was successfully validated on an independent dataset from The Cancer Genome Atlas (TCGA). Differences in expression levels of these genes genuinely correlated with recurrence risk, confirming the classifier's biological validity.

Practical Impact for Medicine and Biology

The developed approach becomes an essential tool for critical evaluation of algorithms in biological and clinical research. It enables:

  • Differentiating valid findings from multiple testing artifacts
  • Strengthening the evidence base for gene signature-based clinical decisions
  • Avoiding costly, unnecessary investigations based on unreliable predictions
  • Directing resources toward markers with confirmed statistical significance

Open Access to Tools

The study is published open access in the journal Stat. All computational algorithms are available in an open GitHub repository (https://github.com/zhiyanov/random-classifier), allowing the scientific community to apply the developed methods to their own investigations and datasets.

The work was supported by the HSE University Basic Research Program within the "Centers of Excellence" initiative and exemplifies the importance of integrating mathematical and biological approaches in addressing pressing medical challenges.