ClearTK: A Framework
for Statistical Biomedical Natural Language Processing
Philip Ogren
Philipp
Wetzler
Department of Computer Science
University of Colorado at Boulder
Introduction
(contact philip@ogren.info)
UIMA 101
text
Common
Analysis
Structure (CAS)
collection reader
analysis engines
consumers
Statistical Biomedical
Natural Language Processing 101
Training Manually annotate a bunch of data Extract features from text * Write out training data * Train a model Run time Extract features from unseen text * Classify features with trained model* Create annotations
* ClearTK facilitates these tasks
The concentration of alpha
2-macroglobulin, alpha 1-antitrypsin, plasminogen, C3-complement, fibrinogen
degradation products (FDP)
and fibrinolytic activity...
O
O
O
O
O
O
B
B
B
B
B
I
I
I
I
ClearTK Analysis Engine
UIMA CAS
input annotations
extract features
feature
set
classify
UIMA CAS
output annotations
find foci of analysis
interpret result /
create annotations
training data
ClearTK Summary
1
2
generated features are understood by
all ClearTK classifiers
some classifiers may be better at making
use of some features than others (e.g. svm and numbers vs maxent) but
all of them support the 4 basic feature types
ClassifierAnnotator
decided not to integrate training to
avoid complexity; is easily done outside of ClearTK, then packaged for
use by ClearTK; this is more flexible
classification: just need one model
file (our format, jar), works the same for all classifier types
sequential classifiers are a special
case, require classification of a list of samples instead of just one
don't implement any of our own classifiers; good libraries exist and are easy to interface with; we just define a standardized interface suitable for NLP purposes in UIMA context
download ClearTK: A Framework for Statistical Biomedical Natural Language ...
