SentenceDetector

java.lang.Object
- JCasAnnotator_ImplBase
- - org.apache.ctakes.ytex.uima.annotators.SentenceDetector

```
public class SentenceDetector
extends JCasAnnotator_ImplBase
```
Wraps the OpenNLP sentence detector in a UIMA annotator. Changes:
- split on paragraphs before feeding into maximum entropy model
- don't split on newlines
- split on periods
- split on semi-structured text such as checkboxes
Parameters (optional):
- paragraphPattern: regex to split paragraphs. default PARAGRAPH_PATTERN
- acronymPattern: default ACRONYM_PATTERN. If the text preceding period matches this pattern, we do not split at the period
- periodPattern: default PERIOD_PATTERN. If the text following period matches this pattern, we split it.
- splitPattern: regex to split at semi-structured fields. default SPLIT_PATTERN

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`ACRONYM_PATTERN` vng change split sentences periods that do not have this acronym preceding it
`private java.util.regex.Pattern`	`acronymPattern` vng change
`private UimaContext`	`context`
`private Logger`	`logger`
`private java.lang.String`	`NEWLINE`
`static java.lang.String`	`PARAGRAPH_PATTERN` vng change split paragraphs on this pattern
`private java.util.regex.Pattern`	`paragraphPattern` vng change
`static java.lang.String`	`PARAM_SEGMENTS_TO_SKIP` Value is "SegmentsToSkip".
`static java.lang.String`	`PERIOD_PATTERN` vng change split sentences periods after which this pattern is seen
`private java.util.regex.Pattern`	`periodPattern` vng change
`static java.lang.String`	`SD_MODEL_FILE_PARAM`
`private opennlp.tools.sentdetect.SentenceModel`	`sdmodel`
`private int`	`sentenceCount`
`private SentenceDetectorCtakes`	`sentenceDetector`
`private java.util.Set<?>`	`skipSegmentsSet`
`static java.lang.String`	`SPLIT_PATTERN` vng change split sentences on these patterns
`private java.util.regex.Pattern`	`splitPattern` vng change

Constructor Summary

Constructors
Constructor and Description

SentenceDetector()

Constructors
Constructor and Description
`SentenceDetector()`

Method Summary

Methods
Modifier and Type	Method and Description
`protected int`	`annotateParagraph(JCas jcas, java.lang.String text, int b, int e, int sentenceCount)` split paragraphs.
`protected int`	`annotateRange(JCas jcas, java.lang.String text, int b, int e, int sentenceCount)` Detect sentences within a section of the text and add annotations to the CAS.
`private java.util.regex.Pattern`	`compilePatternCheck(java.lang.String patternKey, java.lang.String patternDefault)` vng change
`private void`	`configInit()` Reads configuration parameters.
`static java.io.File`	`getFileInExistingDir(java.lang.String fn)`
`static java.io.File`	`getReadableFile(java.lang.String fn)`
`void`	`initialize(UimaContext aContext)`
`static void`	`main(java.lang.String[] args)` Train a new sentence detector from the training data in the first file and write the model to the second file. The training data file is expected to have one sentence per line.
`static int`	`parseInt(java.lang.String s, Logger log)`
`void`	`process(JCas jcas)` Entry point for processing.
`static void`	`usage(Logger log)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PARAM_SEGMENTS_TO_SKIP
```
public static final java.lang.String PARAM_SEGMENTS_TO_SKIP
```
    Value is "SegmentsToSkip". This parameter specifies which sections to skip. The parameter should be of type String, should be multi-valued and optional.
    
    See Also:
    Constant Field Values
  - logger
```
private Logger logger
```
  - SD_MODEL_FILE_PARAM
```
public static final java.lang.String SD_MODEL_FILE_PARAM
```
    See Also:
    Constant Field Values
  - sdmodel
```
private opennlp.tools.sentdetect.SentenceModel sdmodel
```
  - PARAGRAPH_PATTERN
```
public static final java.lang.String PARAGRAPH_PATTERN
```
    vng change split paragraphs on this pattern
    
    See Also:
    Constant Field Values
  - ACRONYM_PATTERN
```
public static final java.lang.String ACRONYM_PATTERN
```
    vng change split sentences periods that do not have this acronym preceding it
    
    See Also:
    Constant Field Values
  - PERIOD_PATTERN
```
public static final java.lang.String PERIOD_PATTERN
```
    vng change split sentences periods after which this pattern is seen
    
    See Also:
    Constant Field Values
  - SPLIT_PATTERN
```
public static final java.lang.String SPLIT_PATTERN
```
    vng change split sentences on these patterns
    
    See Also:
    Constant Field Values
  - paragraphPattern
```
private java.util.regex.Pattern paragraphPattern
```
    vng change
  - splitPattern
```
private java.util.regex.Pattern splitPattern
```
    vng change
  - periodPattern
```
private java.util.regex.Pattern periodPattern
```
    vng change
  - acronymPattern
```
private java.util.regex.Pattern acronymPattern
```
    vng change
  - context
```
private UimaContext context
```
  - skipSegmentsSet
```
private java.util.Set<?> skipSegmentsSet
```
  - sentenceDetector
```
private SentenceDetectorCtakes sentenceDetector
```
  - NEWLINE
```
private java.lang.String NEWLINE
```
  - sentenceCount
```
private int sentenceCount
```
- Constructor Detail
  - SentenceDetector
```
public SentenceDetector()
```
- Method Detail
  - initialize
```
public void initialize(UimaContext aContext)
                throws ResourceInitializationException
```
    Throws:
    
    ResourceInitializationException
  - configInit
```
private void configInit()
                 throws ResourceAccessException,
                        InvalidFormatException,
                        java.io.IOException
```
    Reads configuration parameters.
    
    Throws:
    
    ResourceAccessException
    
    java.io.IOException
    
    InvalidFormatException
  - compilePatternCheck
```
private java.util.regex.Pattern compilePatternCheck(java.lang.String patternKey,
                                          java.lang.String patternDefault)
```
    vng change
  - process
```
public void process(JCas jcas)
             throws AnalysisEngineProcessException
```
    Entry point for processing.
    
    Throws:
    
    AnalysisEngineProcessException
  - annotateParagraph
```
protected int annotateParagraph(JCas jcas,
                    java.lang.String text,
                    int b,
                    int e,
                    int sentenceCount)
                         throws AnalysisEngineProcessException
```
    split paragraphs. Arc v1.0 had a paragraph splitter, and sentences never crossed paragraph boundaries. paragraph splitter was lost in upgrade to ctakes 1.3.2. Now split paragraphs before running through maximum entropy model - this resolves situations where the model would split after a period, e.g.:
```
 Clinical History:
 Mr. So and so
 
```
    Without the paragraph splitter, the model splits after Mr. With the paragraph splitter, the model doesn't split after Mr.
    Parameters:
    jcas -
    text -
    b -
    e -
    sentenceCount -
    
    Returns:
    
    Throws:
    
    AnalysisEngineProcessException
    
    AnnotatorProcessException
  - annotateRange
```
protected int annotateRange(JCas jcas,
                java.lang.String text,
                int b,
                int e,
                int sentenceCount)
                     throws AnalysisEngineProcessException
```
    Detect sentences within a section of the text and add annotations to the CAS. Uses OpenNLP sentence detector, and then additionally forces sentences to end at end-of-line characters (splitting into multiple sentences). Also trims sentences. And if the sentence detector does happen to form a sentence that is just white space, it will be ignored.
    
    Parameters:
    jcas - view of the CAS containing the text to run sentence detector against
    text - the document text
    section - the section this sentence is in
    sentenceCount - the number of sentences added already to the CAS (if processing one section at a time)
    
    Returns:
    count The sum of sentenceCount and the number of Sentence annotations added to the CAS for this section
    
    Throws:
    
    AnnotatorProcessException
    
    AnalysisEngineProcessException
  - main
```
public static void main(java.lang.String[] args)
                 throws java.io.IOException
```
    Train a new sentence detector from the training data in the first file and write the model to the second file.
    The training data file is expected to have one sentence per line.
    
    Parameters:
    args - training_data_filename name_of_model_to_create iters? cutoff?
    
    Throws:
    
    java.io.IOException
  - usage
```
public static void usage(Logger log)
```
  - parseInt
```
public static int parseInt(java.lang.String s,
           Logger log)
```
  - getReadableFile
```
public static java.io.File getReadableFile(java.lang.String fn)
                                    throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - getFileInExistingDir
```
public static java.io.File getFileInExistingDir(java.lang.String fn)
                                         throws java.io.IOException
```
    Throws:
    
    java.io.IOException

Class SentenceDetector

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PARAM_SEGMENTS_TO_SKIP

logger

SD_MODEL_FILE_PARAM

sdmodel

PARAGRAPH_PATTERN

ACRONYM_PATTERN

PERIOD_PATTERN

SPLIT_PATTERN

paragraphPattern

splitPattern

periodPattern

acronymPattern

context

skipSegmentsSet

sentenceDetector

NEWLINE

sentenceCount

Constructor Detail

SentenceDetector

Method Detail

initialize

configInit

compilePatternCheck

process

annotateParagraph

annotateRange

main

usage

parseInt

getReadableFile

getFileInExistingDir