SentenceDetector (Apache cTAKES 4.0.0 API)

java.lang.Object
- org.apache.uima.analysis_component.AnalysisComponent_ImplBase
- - org.apache.uima.analysis_component.Annotator_ImplBase
  - - org.apache.uima.analysis_component.JCasAnnotator_ImplBase
    - - org.apache.ctakes.ytex.uima.annotators.SentenceDetector

All Implemented Interfaces:

org.apache.uima.analysis_component.AnalysisComponent
```
public class SentenceDetector
extends org.apache.uima.analysis_component.JCasAnnotator_ImplBase
```
Wraps the OpenNLP sentence detector in a UIMA annotator. Changes:
- split on paragraphs before feeding into maximum entropy model
- don't split on newlines
- split on periods
- split on semi-structured text such as checkboxes
Parameters (optional):
- paragraphPattern: regex to split paragraphs. default PARAGRAPH_PATTERN
- acronymPattern: default ACRONYM_PATTERN. If the text preceding period matches this pattern, we do not split at the period
- periodPattern: default PERIOD_PATTERN. If the text following period matches this pattern, we split it.
- splitPattern: regex to split at semi-structured fields. default SPLIT_PATTERN
Author:

Mayo Clinic, vijay

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`ACRONYM_PATTERN` vng change split sentences periods that do not have this acronym preceding it
`static String`	`PARAGRAPH_PATTERN` vng change split paragraphs on this pattern
`static String`	`PARAM_SEGMENTS_TO_SKIP` Value is "SegmentsToSkip".
`static String`	`PERIOD_PATTERN` vng change split sentences periods after which this pattern is seen
`static String`	`SD_MODEL_FILE_PARAM`
`static String`	`SPLIT_PATTERN` vng change split sentences on these patterns

Constructor Summary

Constructors
Constructor and Description

SentenceDetector()

Constructors
Constructor and Description
`SentenceDetector()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected int`	`annotateParagraph(org.apache.uima.jcas.JCas jcas, String text, int b, int e, int sentenceCount)` split paragraphs.
`protected int`	`annotateRange(org.apache.uima.jcas.JCas jcas, String text, int b, int e, int sentenceCount)` Detect sentences within a section of the text and add annotations to the CAS.
`static File`	`getFileInExistingDir(String fn)`
`static File`	`getReadableFile(String fn)`
`void`	`initialize(org.apache.uima.UimaContext aContext)`
`static void`	`main(String[] args)` Train a new sentence detector from the training data in the first file and write the model to the second file. The training data file is expected to have one sentence per line.
`static int`	`parseInt(String s, org.apache.log4j.Logger log)`
`void`	`process(org.apache.uima.jcas.JCas jcas)` Entry point for processing.
`static void`	`usage(org.apache.log4j.Logger log)`

Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
getRequiredCasInterface, process

Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase
getCasInstancesRequired, hasNext, next

Methods inherited from class org.apache.uima.analysis_component.AnalysisComponent_ImplBase
batchProcessComplete, collectionProcessComplete, destroy, getContext, getResultSpecification, reconfigure, setResultSpecification

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PARAM_SEGMENTS_TO_SKIP
```
public static final String PARAM_SEGMENTS_TO_SKIP
```
    Value is "SegmentsToSkip". This parameter specifies which sections to skip. The parameter should be of type String, should be multi-valued and optional.
    
    See Also:
    
    Constant Field Values
  - SD_MODEL_FILE_PARAM
```
public static final String SD_MODEL_FILE_PARAM
```
    See Also:
    
    Constant Field Values
  - PARAGRAPH_PATTERN
```
public static final String PARAGRAPH_PATTERN
```
    vng change split paragraphs on this pattern
    
    See Also:
    
    Constant Field Values
  - ACRONYM_PATTERN
```
public static final String ACRONYM_PATTERN
```
    vng change split sentences periods that do not have this acronym preceding it
    
    See Also:
    
    Constant Field Values
  - PERIOD_PATTERN
```
public static final String PERIOD_PATTERN
```
    vng change split sentences periods after which this pattern is seen
    
    See Also:
    
    Constant Field Values
  - SPLIT_PATTERN
```
public static final String SPLIT_PATTERN
```
    vng change split sentences on these patterns
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - SentenceDetector
```
public SentenceDetector()
```
- Method Detail
  - initialize
```
public void initialize(org.apache.uima.UimaContext aContext)
                throws org.apache.uima.resource.ResourceInitializationException
```
    Specified by:
    
    initialize in interface org.apache.uima.analysis_component.AnalysisComponent
    
    Overrides:
    
    initialize in class org.apache.uima.analysis_component.AnalysisComponent_ImplBase
    
    Throws:
    
    org.apache.uima.resource.ResourceInitializationException
  - process
```
public void process(org.apache.uima.jcas.JCas jcas)
             throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
```
    Entry point for processing.
    
    Specified by:
    
    process in class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
    
    Throws:
    
    org.apache.uima.analysis_engine.AnalysisEngineProcessException
  - annotateParagraph
```
protected int annotateParagraph(org.apache.uima.jcas.JCas jcas,
                                String text,
                                int b,
                                int e,
                                int sentenceCount)
                         throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
```
    split paragraphs. Arc v1.0 had a paragraph splitter, and sentences never crossed paragraph boundaries. paragraph splitter was lost in upgrade to ctakes 1.3.2. Now split paragraphs before running through maximum entropy model - this resolves situations where the model would split after a period, e.g.:
```
 Clinical History:
 Mr. So and so
 
```
    Without the paragraph splitter, the model splits after Mr. With the paragraph splitter, the model doesn't split after Mr.
    Parameters:
    
    jcas -
    
    text -
    
    b -
    
    e -
    
    sentenceCount -
    
    Returns:
    
    Throws:
    
    org.apache.uima.analysis_engine.AnalysisEngineProcessException
    
    org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
  - annotateRange
```
protected int annotateRange(org.apache.uima.jcas.JCas jcas,
                            String text,
                            int b,
                            int e,
                            int sentenceCount)
                     throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
```
    Detect sentences within a section of the text and add annotations to the CAS. Uses OpenNLP sentence detector, and then additionally forces sentences to end at end-of-line characters (splitting into multiple sentences). Also trims sentences. And if the sentence detector does happen to form a sentence that is just white space, it will be ignored.
    
    Parameters:
    
    jcas - view of the CAS containing the text to run sentence detector against
    
    text - the document text
    
    section - the section this sentence is in
    
    sentenceCount - the number of sentences added already to the CAS (if processing one section at a time)
    
    Returns:
    
    count The sum of sentenceCount and the number of Sentence annotations added to the CAS for this section
    
    Throws:
    
    org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
    
    org.apache.uima.analysis_engine.AnalysisEngineProcessException
  - main
```
public static void main(String[] args)
                 throws IOException
```
    Train a new sentence detector from the training data in the first file and write the model to the second file.
    The training data file is expected to have one sentence per line.
    
    Parameters:
    
    args - training_data_filename name_of_model_to_create iters? cutoff?
    
    Throws:
    
    IOException
  - usage
```
public static void usage(org.apache.log4j.Logger log)
```
  - parseInt
```
public static int parseInt(String s,
                           org.apache.log4j.Logger log)
```
  - getReadableFile
```
public static File getReadableFile(String fn)
                            throws IOException
```
    Throws:
    
    IOException
  - getFileInExistingDir
```
public static File getFileInExistingDir(String fn)
                                 throws IOException
```
    Throws:
    
    IOException

Class SentenceDetector

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase

Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase

Methods inherited from class org.apache.uima.analysis_component.AnalysisComponent_ImplBase

Methods inherited from class java.lang.Object

Field Detail

PARAM_SEGMENTS_TO_SKIP

SD_MODEL_FILE_PARAM

PARAGRAPH_PATTERN

ACRONYM_PATTERN

PERIOD_PATTERN

SPLIT_PATTERN

Constructor Detail

SentenceDetector

Method Detail

initialize

process

annotateParagraph

annotateRange

main

usage

parseInt

getReadableFile

getFileInExistingDir