Deep Learning in Medical Imaging with small datasets

8 min readApr 11, 2021

Deep learning(DL) has applications in diverse domains such as Computer Vision, Natural Language Processing, Credit underwriting, and Finance. One of the most interesting and impactful applications is in the Healthcare domain. Deep learning is being applied in Medical Imaging, Drug discovery, and Extracting valuable information from EHR(Electronic Health Records). In this blog post, we will be focussing on DL applications in the Medical Imaging domain. To be specific, we review two papers that found clever ways to overcome the problem of small Medical Images datasets to apply Deep learning.

Deep learning requires massive amounts of training data to learn the underlying distribution and do well on unseen data. Such large training datasets are created using human annotations or using web images with noisy labels such as hashtags. However, we cannot follow a similar approach in Medical Imaging, because manually annotating Medical images requires trained professionals in their respective domains with many years of experience, and also the datasets generated are highly imbalanced with very few Images containing disease or abnormality. The following papers tackle these issues in different tasks.

The first paper is “Curriculum Learning Strategy to Classify various Lesions in Chest-PA X-ray Images”[1]. The goal of the paper is to detect and localize various pulmonary abnormalities including nodules, consolidation, interstitial opacity, pleural effusion, and pneumothorax with chest-PA X-ray(CXR) images. The authors employed a two-step curriculum learning strategy to learn abnormality patterns for detection and localization of Lesions. CXR images of 6069 healthy subjects and 3417 patients at AMC(Asan Medical Center) and 1035 healthy subjects and 4404 patients at SNUBH(Seoul National University Bundang Hospital) were used.

To handle the problem of small datasets, the authors employed Curriculum learning strategy. The idea behind curriculum learning is to gradually train the model from simple to complex tasks. There are very few images to train a large model and it still takes a lot of time for training, so the authors trained the model on small patches of X-ray images with and without abnormality to overcome these problems. By training the model on small patches, the models learn to identify and localize artifacts in the patches which can be later used for the entire image classification.

Below is the diagram of how the patches were generated

Patch generation with/without abnormality

In the second step, the model was fine-tuned using the entire image. Resnet-50 architecture(pre-trained on ImageNet) was modified and trained to detect various disease patterns. Resnet model uses the softmax function as an activation for the multi-class classification problem, but in this task, a single CXR image can have multiple lesions hence it's a multi-label problem, therefore the Resnet model was modified by using multiple sigmoid functions.

Below is the diagram showing the entire process of training.

1) training patch images, (2) fine-tuning with entire images 3) Localization using class activation Map

Original CXR images have a high resolution of about 2000 × 2000 pixels. The Resnet model was pre-trained for general natural images which could be a problem in this case because of change in the receptive field. Therefore, all images were converted using bi-linear interpolation to a fixed size of 1024 × 1024 pixels. Sample-wise standardization was performed by subtracting the average pixel value of each image. Apart from this, different pre-processing techniques such as Sharpening, and blurring were randomly applied to images during training. Since different x-ray images are generated by different manufacturers, they have different patterns, sharpening and blurring helps in making the model robust to these differences.

Results:

To evaluate the performance, the curriculum-based learning model was compared with the baseline model trained on entire images.

Training curves of a curriculum learning-based model and a baseline model on the tuning dataset

The curriculum learning-based model converged faster than the base model. Also, accuracy and loss were better with the curriculum-based model. Hence even with a small dataset, the authors were able to train a deep learning model with better accuracy in a much lesser number of epochs using a curriculums strategy with smaller patches.

Next, we take a look at another paper that follows a similar approach.

Paper-2:

In the paper titled “Detecting Cancer Metastases on Gigapixel Pathology Images”[2], the authors propose an ingenious to automatically detect and localize tumors in a 100,100x100,000 image. Usually, the process of identifying and localizing the tumor regions is done by pathologists by carefully looking at the large gigapixel pathology image which is time-consuming and error-prone. The researchers start with a relatively small dataset called CAMELYON16 which contains 270 such images.

Similar to the method in Paper-1, the authors trained the models on small patches of images. The small patches tackle two problems 1) We have a very small dataset(only 270 images) 2) The images are too big(100,100x100,000) from the normal images with which we usually train the models. Hence by extracting patches we not only generate a lot of training data but make it easier for the models to localize and detect tumors.

Below is an example of various patches from the original image.

For each input patch, they predicted the label of the center 128×128 region and labeled the patch as a tumor if at least one pixel in the center region was annotated as a tumor. Because there were few tumor patches per image, they devised a sampling method to avoid bias. First, they decide whether they need a tumor or normal patch uniformly at random. Second, they choose an image that contains such a patch uniformly at random and pick a patch of the required type at random. Also, they increase the training data with different augmentation techniques such as rotations, flipping, changing the brightness, hue, and contrast of the image.

After generating the patches, they are fed into a standard InceptionV3 architecture to train the model. To evaluate the model, they perform inference over patches in a sliding window across the slide (similar to how a pathologist might slide across an image using their microscope), generating a tumor probability heatmap. For each patch, they first perform the 8 different data augmentations and then average the prediction. For each slide, they then report the maximum value in the heatmap as the slide-level tumor prediction.

Results:

The authors trained the model with Stochastic Gradient descent in Tensorflow, with 8 replicas each running on an NVIDIA Pascal GPU with asynchronous gradient updates and batch size of 32 per replica. RMSProp optimizer with a momentum of 0.9, decay=0.9. The initial learning rate was 0.05, with a decay of 0.5 every 2 million examples. For fine-tuning the pre-trained model on ImageNet, the researchers used an initial learning rate of 0.002.

The models were evaluated on two metrics 1) AUC 2) FROC. The first metric AUC(Area under the receiving operating characteristic) was used to evaluate the entire image level classification. The second metric FROC (Free Response Operating Characteristic) was used to evaluate tumor detection and localization. FROC is similar to ROC except that the False Positive rate in X-axis is replaced by the average False Positive rate. To be more precise, FROC is sensitivity at 0.25, 0.5, 1, 2, 4, 8 average FPs per tumor-negative slide. The authors focussed more on the FROC metric than the AUC metric because there were twice as many tumors as slides(pathology images) making the metric more reliable.

At 8 false positives per image, the model detects 92.4% of the tumors, relative to 82.7% by the previous best-automated approach. For comparison, a human pathologist attempting exhaustive search achieved 73.2% sensitivity.

Generally, models pre-trained on ImageNet data are used to train models on medical images because of the small dataset. However, the authors were able to do better than a pre-trained model(40X vs 40X-pretrained ) because they have enough data from patch generation and data augmentation techniques to train the model without using pre-trained weights. The pre-trained model was trained on a dataset from a different domain(ImageNet), therefore it's expected to be no better than a model that was trained completely on pathology images. Also, to improve the turn-around time the authors experimented with different model sizes, they found that small models(40X-small 300K parameters) have similar accuracy compared to Inception (40X -20M parameters). The authors also experimented with a multi-scale approach taking inspiration from the way pathologists examine the image at different magnification levels to get more context of the image, but they found no improvement in performance when the 40X model was combined with lower magnification input like in the picture below.

The three colorful blocks represent Inception(V3) towers up to the second-last layer. Single Scale utilizes one tower with input images at 40x magnification. Multi-scale utilizes input magnifications that are input to separate towers and merged

The heatmaps generated(shown below) for localizing the tumor are smoother with a multi-scale approach.

Left to right: sample image, ground truth (tumor in white), and heatmap outputs (40X-ensemble-of-3, 40X+20X, and 40X+10X). Heatmaps of 40X and 40X-ensemble-of-3 look identical. The red circular regions at the bottom left quadrant of the heatmaps are unannotated tumors.

Similar to previous approaches the authors experimented with color normalization but did not find any performance improvement, most likely it could be because the models are learning features that are invariant to color. Finally, the authors also experimented with ensembling models in two ways. First, averaging predictions across the 8 rotations or flips yielded a few percent improvements in performance. Second, using independently trained models in the ensemble improved performance but it was not so significant. Further, they found that adding more than 3 models in ensemble training decreased the performance.

One of the interesting things the authors found was that the model was resilient to the noise in the labels of training data. They discovered that two slides in the Camelyon16 training set were erroneously labeled normal. Even though the mislabeled images were part of their training data, the model still managed to achieve such high accuracies on the validation and test data.

The authors also performed additional validation of the models by testing them on another 110 pathology images that were digitized on different scanners, from different patients and obtained from various tissue preparation protocols. The model's performance on this unseen data set was exactly in line with its performance on Camelyon16 test set performance. The AUC obtained on this additional test set was 97.6

All of these above-described experiments strongly indicate the effectiveness of the patching technique that was employed to train the deep learning model.

Conclusion:

So, we briefly discussed two papers that proposed a clever way to train large deep learning models even with small datasets in Medical Imaging. Both the papers used patch generation and various data augmentation to combat the problem of small datasets. The authors showed that these steps helped them in not only improving the accuracy of their models but also their training time.

References:

[1] https://www.nature.com/articles/s41598-019-51832-3

[2] https://arxiv.org/abs/1703.02442

Deep Learning in Medical Imaging with small datasets

Written by Akhil Konda