RadBERT-CL: Factually-Aware Contrastive Learning For Radiology Report Classification

Nov 15, 2021

論文資訊

會議: Machine Learning for Health (ML4H) 2021 (論文連結)
作者單位: The University of Texas at Austin

研究動機

BERT 跟 BlueBert 沒辦法把 uncertainty 跟 negation 的問題處理得很好

本篇重點

Data augmentation
- 因為要做 contrastive pre-training learning
極度仰賴 Wu et al. (2020a)

Fang and Xie (2020) 針對正向資料使用句子層級的 back-translation
Wu et al. (2020a) 整合4種句子層級的資料擴增
- Word deletion
- Span deletion
- Reordering
- Synonym substitution

方法

在 pre-training 的部分進行改進
- 利用 Contrastive pre-training 來增加模型在任務上的事實確認能力
資料: radiology reports
- MIMIC-CXR (Johnson et al., 2019)
資料答案: 使用 CheXpert (Irvin et al., 2019)，供 fine-tuning 時使用

資料擴增

Info-preservation module

對應論文 Section 3.2.1
目的: 確定 augmented reports 仍然保有重要概念 (concepts)
方法: rule-based (類似於 Dynamic-LCS)
- 先對 augmented reports 做 lemmatization
- 將 RadLex ontology 中的 concepts 與 augmented reports 做 greedy match
  - longer matches are returned when possible
處理 Negations 和 Uncertainties
- Negations
  - 建立含有30種 negations 寫法的字典
    - not, without, clear of, ruled out, free of, disappearance of, without evidence of, no evidence of, absent, miss, …
- Uncertainty
  - 參照 Chen et al., 2018
  - we create a dictionary of uncertainty keywords with a wide range of uncertain types, from speculations to inconsistencies present in the reports
  - 同樣是建立字典，但是細節沒有提及
- 使用 NegBio 的作法來處理 Negations 和 Uncertainties

Sent-level augmentation

斷句
random word / phrase dropping (Wu et al., 2020)
- 同時利用 info-preservation module 來確定轉換後的句子有保留重要概念
- 丟棄沒有 match 到任何疾病的句子

Doc-level augmentation (report 層級)

前處理:
- 去除多的空白、換行
- 刪去不想要的字
Word deletion / Span deletion / Reordering / Synonym substitution (Wu et al., 2020)
- 同時利用 info-preservation module 來確定轉換後的 report 有保留重要概念

Contrastive pre-training

使用 SimCLR (Chen et al., 2020) 架構
本篇依據資料擴增手段的不同提出3種算法，整理如下:

	Algorithm 1	Algorithm 2	Algorithm 3
Negation/Uncertainty	-	只保留 positives (確實有病)	皆保留 (有病+沒病)
類型	Doc-level	Sent-level	Sent-level
相同的 view	同病人的報告	相同的疾病的句子	相同事實的句子 (確定有該疾病)
不同的 view	不同病人的報告	不同疾病的句子	不同事實的句子 (沒有該疾病或是有不同疾病)

模型

Encoder 架構 $f$ 使用 BlueBert (Peng et al., 2019)
view 1 和 view 2 就是原始 report 跟其擴增後的版本
views 經過 $f$ 之後，取 CLS token 的輸出
Projection Head $g$ 為兩層 hidden size 為 768 的 MLP
- 其中包含 RELU 和 Batch normalization (BN)
- CLS token => BN => MLP => RELU => BN => MLP
Fine-tuning 時期拔掉 Projection Head $g$

實驗

Table 4

比較不同模型在 687 篇 reports (test set) 的分數
以平均分數來說 Algorithm 3 > Algorithm 2 > Algorithm 1 > CheXbert > CheXpert

Table 5

比較不同的 pre-trained model 的影響
用 contrastive learning 進行 pre-training 至少可贏 6 個百分點

Table 6

舉幾個句子來算 cos similarity，驗證 RadBERT-CL 的事實辨別能力