Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports.

TitleAutomatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports.
Publication TypeJournal Article
Year of Publication2013
JournalAMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
Date Published2013
AbstractAlthough registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.
PubMed Link
Short TitleAMIA Jt Summits Transl Sci Proc