Coreference resolution is the task of determining linguistic expressions that refer to the same real-world entity in natural language. Research on coreference resolution in the general English domain dates back to s and s.

Lecture 15: Coreference Resolution

However, research on coreference resolution in the clinical free text has not seen major development. The recent US government initiatives that promote the use of electronic health records EHRs provide opportunities to mine patient notes as more and more health care institutions adopt EHR. Our goal was to review recent advances in general purpose coreference resolution to lay the foundation for methodologies in the clinical domain, facilitated by the availability of a shared lexical resource of gold standard coreference annotations, the Ontology Development and Information Extraction ODIE corpus.

It shows a wide QRS with a normal rhythm but no delta waves. It has been widely acknowledged that the unstructured clinical narratives are a rich source of information that complements the structured data in the electronic health record EHR.

Applying natural language processing NLP technologies to extract information from the narratives can not only unlock information that is only present in the free text portion of the EHR but also improve performance when combined with structured data.

Table of Contents

Liao et al. Kullo et al. Savova et al. However, to take full advantage of the information in the clinical free text, coreference resolution is an indispensable component. Coreference serves the critical role of linking related information together. Garla et al.

Consider the short snippet in Example 1 Table 1 from a clinical note. Small focus of invasive grade 2 of 4 adenocarcinoma m 1 arising in association with a serrated adenoma m 2 with …. The focus of adenocarcinoma m 3 shows invasion into superficial submucosa and is located approximately.

The patient presents with gastrointestinal symptoms including nausea, vomiting. The patient has had symptoms for 10 days. In fact, is having that problem since early pregnancy but worst since 10 days.

Her pain control appears to be adequate with the Tramadol increased to q. She had an arthrogram in We have those films. They show the capsule is tight, and they show the cartilage of the glenoid is present.

I would support continuing speech therapy m 1 for his speech deficit m 2.

Patient fell down a flight of stairs. The incident caused minor hemorrhage. Attributes, temporal descriptions, and contextual information necessary for understanding whether conditions, symptoms, and treatments have occurred or are merely planned are often spread over several sentences or even paragraphs rather than within a single sentence and require coreference resolution for accurate interpretation.

For example, accurate assignment of attributes to named entities Examples 2 and 5 , accurate assignment of temporal information to an event Example 3 , distinguishing planned events from events that occurred Example 4 can only be achieved by resolving the coreferential phrases.

In Example 3, the quality that the chest discomfort occurs at rest and lasts 30 min requires the resolution of the two highlighted phrases. Resolving the three highlighted phrases in Example 5 is critical as they are the force that holds the other pieces of information together. Only after linking the three phrases can one ascertain that symptoms of nausea and vomiting have occurred earlier but wors-ened recently.

Armed with a textual coreference resolution system, a higher-level system can resolve coreference between the narrative notes and the structured data to yield a richer picture. For example, such a system can link the detailed prescription and laboratory data from the EHR with the textual mentions in a clinical note.

In Example 6, the start date of the Tramadol and previous dosing information can be retrieved from the structured data. But this is beyond the scope of our review. We aim at methods for coreference resolution in text.

Coreference resolution has long been recognized as a difficult task. Research in the general English domain dates back to s and s [9, chapter 3].

Various systems from heuristics-based ones to statistical ones have been developed. In particular, there have been growing efforts since the 6th and 7th Message Understanding Conferences MUC [ 10 , 11 ] and the Automatic Content Extraction ACE program 1 initiated shared tasks on coreference resolution and released their annotated corpora in the last two decades.

However, the clinical domain has not seen major development, which can be partially attributed to the lack of sharable annotated clinical text.

The recent US government initiatives that promote the use of electronic health records provide opportunities to mine patient notes as more and more health care institutions adopt EHR. In this paper we give a review of the approaches in the general English and biomedical literature domains and discuss challenges in applying those techniques in the clinical narrative.

Hirst [ 9 ] provided a survey of research on anaphora during the early years. The approaches, mostly heuristic-based, have largely been superseded since the s. Trends of transition of research focus from heuristics to statistical and machine learning approaches can be seen in Mitkov [ 12 ].

Ng [ 13 ] concentrated exclusively on supervised machine learning approaches that started in the mids. Our survey is not limited to particular methodologies, and has a focus on clinical applications.

Formally, coreference consists of two linguistic expressions—antecedent and anaphor. The anaphor is the expression whose interpretation i. The antecedent is the linguistic expression on which an anaphor depends.

A broader concept of anaphora includes a pair of linguistic expressions whose relationship does not have to be identity. These linguistic expressions, the antecedents and the anaphors, are collectively called markables in the MUC corpus. Two coreferring markables form a pair , while one or more pairs that refer to the same entity form a chain.

Coreference Resolution

In the ACE corpus, the linguistic expressions are called mentions , and the entities these mentions refer to are, naturally, entities. The coreference resolution task is to discover the antecedent for each anaphor in a document. Since the coreference relation is transitive, the set of all the transitive closures of the markables forms a partition, in other words, a set that contains the sets of markables in each chain. For text processing systems, such as information retrieval IR and information extraction IE , identifying the exact antecedent is less important than correctly partitioning the markables.

Moreover, it is not always clear which is the antecedent. Therefore, most systems strive to generate a correct partition. The types of markables that a coreference resolution system resolve are unique to the domains.

The general English domain focuses on person, location, and organization [ 11 ]. The shared task 2 in the biomedical literature domain focused on finding coreferential mentions of genes and proteins.

In the clinical narrative, however, the types are mainly disorders, signs or symptoms, anatomical sites, medications, and procedures. In addition to the difference in the markable types, Coden et al.

The average sentence length in clinical notes is only approximately half of that in the general English texts. The vocabulary size of clinical notes is also smaller than the general English texts. Meystre et al.

Methods for coreference resolution need to account for these subdomain language characteristics, such as the word and sentence distance between coreferential mentions. Furthermore, different genres of clinical texts show different patterns.

For example, anatomical site concepts are more prevalent in procedure notes, including radiology, pathology, and operation notes, than discharge summaries. During the past two decades, several systems have been developed to extract named entities NEs from clinical narrative, first specialized in certain report types [ 16 - 19 ], and later more general purpose [ 20 , 3 , 21 , 22 ].

The community is now moving towards semantic analysis and discourse processing, including relation discovery and semantic role labeling.

However, there have been only a handful of efforts researching coreference in the clinical narrative. Hahn et al.

Coreference resolution with world knowledge

As part of the Ontology Development and Information Extraction project ODIE , 3 a corpus of , words of clinical text was doubly annotated and adjudicated [ 24 ] to include markables, pairs and chains. As part of their work on developing a tool for cancer characteristics information extraction, Coden et al. The annotation schema included coreference annotations for anatomical sites and histologies mapped to the International Classification of Diseases for Oncology ICD-O [ 26 ].

Two mentions that are exact strings and map to the same concept were annotated as coreferential. In addition, each anatomical site or histology mention is coreferenced with any instance of its parent anatomical site as defined by ICD-O.

Roberts et al. Our goal was to review recent advances in general purpose coreference resolution to lay the foundation for methodologies in the clinical domain, facilitated by the availability of a shared lexical resource of gold standard coreference annotations, the ODIE corpus.

1. Introduction

The search returned about results. We also selected publications using the same keywords in PubMed, but excluded papers that focused on neuroscientific or psycholinguistic discoveries. This query yielded fewer than 10 papers. Finally, publications frequently referenced in the papers from the above two sets were also included.

Coreference resolution: A review of general methodologies and applications in the clinical domain

Early attempts at the coreference resolution task mainly involved heuristic approaches, motivated by linguistic theories. The general theme was to incorporate a knowledge source to prune unlikely antecedent candidates until a small set is obtained, and then select the best candidate based on the current focus [ 28 ] of attention or the preferred center.

These approaches tended to employ a multitude of features, including syntactic the gender of the two mentions must agree , semantic a mention with the same semantic role as the anaphor is given preference , and pragmatic the topic under discussion usually remains unchanged unless there are indications otherwise 5 constraints and preferences.

Many of them also resolved different types of anaphoric phrases at once, even some not exactly coreferential. Hobbs [ 30 ] employed a deepest-first tree search procedure on the syntactic parse tree of a sentence to find the first candidate that satisfies a set of hand-crafted constraints. The search started from the immediate dominating noun phrase NP of the pronoun.

The candidate NP antecedent was selected based on two criteria. Criterion one selected as antecedent the NP on a branch to the left of the pronoun-dominating NP path.

Coreference resolution with world knowledge book