search

 投影片1

0 comments

file time: 2008-02-16

filetype:pptx

Click Here To Download...

>  

David A. Campbell M. Phil., Stephen B. Johnson PhD

Department of Medical Informatics, Columbia University

ACL workshop on NLP in the biomedical domain 2002 
 

A Transformation-based Learner for Dependency Grammars in Discharge Summaries

 

Outline 

Introduction Transformation Based Learning The Learning Algorithm Method & Results Discussion & Conclusion  

Introduction 

Pursuing  lexical acquisition through the syntactic relationships of words in medical corpora. Most language processors require a domain-specific semantic lexicon to function but need time and cost. One solution to this bottleneck is to use machine learning to assist in categorizing lexemes into semantic classes.  

Introduction-Dependency Grammars 

One approach to semantic categorization is the use of syntactic features; that is based on the assumption that lexemes that share similar syntactic relations to other lexemes in the corpus will be semantically similar. Dependency Grammars generate parses where words in a sentence are related directly to the word which is its syntactic head.  

Introduction-Dependency Grammars 

There are three attributes : The semantics of a word are defined by a feature space of related words. It may be a better fit for parsing medical text. The syntactic grammar of medical English specifically regarding discharge summaries is simpler overall.  

Transformation Based Learning 

TBL has been applied to many language learning problem, such as POS tag, parsing and learning dependency grammar. The goal of  TBL is to then generate rules which transform the na茂ve training state into the goal state. It will have templates which describe the environment in the training corpus where a transformation can occur. Scoring function allows the comparison of the training state to the goal state. And the paired template and transformation which has highest score becomes a rule. The final product is an order set of rules which can be applied to any un-annotated corpus. It00 a good choice for learning a dependency grammar of medical language.  

Transformation Based Learning

 

The Learning Algorithm 

Algorithm Template Design Transformation Rule scoring  

The Learning Algorithm -Algorithm 

The goal of  TBL is generating some rule make the initial state to change the goal state The essential components  includes the template design, the transformation, and the scoring system. To improve efficiency, we use the indexed TBL method.  

The Learning Algorithm- 
template design 

Triggers are defined by the proximal relationship of two or more parts of speech within a sentence. In order to capture long distance relationships explicitly in a trigger, it would be necessary to expand the vicinity to be searched.  

The Learning Algorithm-Transformation 

Transformation defines a change to the structure of the sentence. The transformations seem intuitive for POS tagging, they are not as transparent for parsing.  

The Learning Algorithm-Rule Scoring 

The rule which produces the best parse for that iteration is the one that is chosen and applied before continuing on to the next iteration. Keeping our goal of generating word-modifier pairs for subsequent machine learning, we chose an aggressive scoring function, counting only correct parent-child relationships.  

Method & Results 

A corpus of 1000 sentences of text from medical discharge summaries was 830 sentences as training set and 170 sentences as testing set. First POS tagged using a tagger trained specifically for discharge summaries. And parsed with a dependency grammar and the TBL learner was allowed to learn rules on the training set. No restriction for length of sentences and three sets of increasingly complex templates were used to learn rules.  

Three template set used

Chart1 shows the improvement in accuracy gained through larger training set.  

Method & Results

 

Method & Results

 

Discussion  

The greatest drawback to this approach is the computing requirement and the consequence of complex template design used is a large number need to be kept in memory. It00 crucial to incorporate rule pruning in the future and allow larger training sets and more complex templates. Learning algorithm is domain independent, such as radiology report, pathology report and progress note. Keeping in mind our goal of gathering head-modifier pairs for machine learning , a 77% accurate parse is approaching an acceptable parse.  

Conclusion 

NLP in the medical domain will be more flexible and portable with assist lexicon design. The limited amount of training material will allow the technique to be used on the other medical domains without extensive manual parsing.

 

 

And we know that NLP is a good method to make computer know what people said.

In this paper, I will introduce some NLP method to deal with discharge summaries syntactically.

So we attempt to get the related lexical through the syntactic relationship of words in medical corpus.

and we require a syntactic parser which is flexible , portable , and can capture some important pairs and needn00  large training set. 

1

 

2

 

In this paper, we have a assumption, that lexeme are semantically similar , maybe the lexeme share similar syntactic relation.

And the idea is also investigated  in general language several years ago. And using syntactic relationship to identify word class should be simpler and more useful in this kind of language.

And what is Dependency Grammar ?

In the sentence , each word have related to the word which is its syntactic head.

And in dependency grammar parse tree, except the root , each word has exactly one syntactic head. And figure 1 show it structure. 

3

 

And there are many attributes of dependency grammar, which can make them ideal for investigating this language.

And first, ..the semantic of a word are defined by a feature space of related words. And the mean is very trivial.

Second, it maybe a better fit for parsing medical text, because that in medical text , there are frequency lost data ,or run-on structures sentences, or improper use of conjunctions. And run-on sentence mean there are no conjunction between the two main sentence, or misuse the punctuation, the above were abnormal grammar.

Or you have difficult to find the traditional phrase, but the dependency grammar may still capture useful syntactic relationship when accurate phrase was absent.

Third, D.G use the relative syntactic relationship to identify the useful structure.  
 

4

 

Abbreviation is TBL.

There are many applied in others field such as POS tag , parsing or learning D.G in this study.

And the goal of TBL is to generate some rules which transform the training set to the goal state. And TBL is also use less training set to generate the rule than probabilistic approach.

In order to do this , this algorithm have templates 00/font>

And later we00l use scoring function to evaluate the paired and transformation and comparison of the training state to goal state, the highest score become the rule. 
 
 

5

 

The left tree is initial state to represent this sentence , and we can use the rule to transform the tree to the goal one.

so , TBL is a good choice for learning a D.G of medical language. 

6

 

And this learning algorithm separate three main components ,

Template design to find the pattern  in sentences or in the tree.

Transformation to define a change of structure .

And the last scoring to decide the highest rule. 

7

 

To summarize the overall of the algorithm, TBL000000瑕00000浜0ule浣跨000000000arse璁00姝g⒑00arse,棣00000000template锛000ュ000yntactic tree瑁¢000rigger,000000trigger,瑙000000ュ0瑕00璁00goal state瑕00000浜00绀00璁00,000瑷00000渚跨00000革000000氨000缈000轰00舵0rule. 

8

 

At first , we can talk about template design.

We create trigger define by proximal relationship of more parts of speech within a sentence.

And in order to find the long distance relationship explicitly in a trigger , so we must expand the range to search.

And the below six parameter are we define.

And see the next slide will be more clear. 

9

 

Let see the example 1 ,X is target and it will search the words within the right side of the x .and tree is the same to search.

Let see the example 2 ,because of its scope is exactly at ,so we focus on the only one word .so we just look at the word of two words distance of x. 

10

 

We see the two examples directly.

The left example first we discuss, that is POS tag, and we can see the all part of speech, in general pos tag, the computer  often make a mistake to tag wrong.

And we can use the rule to modify it, like this.

And see the right example , bracket-tree to parse, after the bracket-tree parsing, we use the rule to check it , and according to the rule ,we delete the bracket of fly and on ,merger them with  THE . 

11

 

Finally, we use a simple scoring function to evaluate those rules.

At every iteration, it is necessary to evaluate the goodness of the parse.

And many of measures for measuring parsing accuracy have been considered, including bracket sensitivity and specificity. 
 

12

 

At first, we make the entire corpus Pos tagged, and parse with dependency grammar, and the TBL learner was allowed to learn rules on the training set. 
 

13

 

And the chart 1 show the improvement in accuracy gained through larger training sets. 

14

 

The three template sets generate three rule sets , each of which was evaluated on the test 170 sentences set.

Each template set was trained with increasing amounts of the training corpus , and measure the effect of the training set size on the learning accuracy. 

Last slide.

And the best dependency accuracy and number of rules generated for each template set is report at table 1.

And to measure the effect of sentence length on parsing accuracy, the best parser rules were retested on two subsets of the test sets.

See table 2 first

For all set of templates, the learner produced a rule-based parser with dependency accuracy exceeding 75% when sentence without restriction. 
 

15

 

The third template set generate over 1870000 rules which need to store it in memory. But only 240 rules were kept in the rule set.

Because of each rule need to store a list of pointers back to sentence, the size of a rule grows with the size of the training set.

It00 crucial to incorporate rule pruning in the future and allow larger training sets and more complex templates. 

We can also generate parser on a number of medical corpora, including radiology report, pathology report and progress note needn00 rebuilding the method. 

16

 

The rules produced were intuitive and understandable. 

17

   download 投影片1

Responses to 投影片1

It's no comment...

 

Your Name:
Your Email:
Your Talk: