TokenizerPTB

java.lang.Object
- org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB

```
public class TokenizerPTB
extends java.lang.Object
```
A class used to break natural text into tokens following PTB rules. See Supplementary Guidelines for ETTB 2.0 dated April 6th, 2009. The token markup is external to the text and is not embedded. Character offset location is used to identify the boundaries of a token.

Field Summary

Fields
Modifier and Type	Field and Description
`private static char`	`DASH`
`private static java.lang.String`	`ellipsis`
`(package private) static java.lang.String[]`	`emptyStringList`
`(package private) static java.util.ArrayList<BaseToken>`	`emptyTokenList`
`(package private) static java.lang.String[]`	`nameStartingWithApostrophe`
`private java.lang.String`	`possibleFinalPunctuation`
`(package private) static java.lang.String[]`	`testsForEmailAddress`
`(package private) static java.lang.String[]`	`testsForNumbers`
`private static java.lang.String[]`	`urlStarters`
`private java.lang.String`	`validOtherEmailAddressCharacters`

Constructor Summary

Constructors
Constructor and Description

TokenizerPTB()
Constructor

Constructors
Constructor and Description
`TokenizerPTB()` Constructor

Method Summary

Methods
Modifier and Type	Method and Description
`private int`	`checkFormat2(java.lang.String s)`
`private boolean`	`containsLetter(java.lang.String lowerCasedText, int currentPosition, int tokenLen)`
`private java.lang.Object`	`createToken(java.lang.Class<? extends BaseToken> clas, java.lang.String s, JCas jcas, int begin, int end, int offsetAdjustment)` if clas is null, determine token class for the caller if jcas is null,
`private java.lang.Class<? extends BaseToken>`	`determineTokenType(java.lang.String s, int begin, int end)`
`int`	`findFirstCharOfNextToken(java.lang.String s, int startPosition)`
`private int`	`getLengthIfIsNumberThatStartsWithPeriod(int currentPosition, java.lang.String textSegment)`
`private int`	`getLengthIfNameStartingWithApostrophe(int currentPosition, java.lang.String textSegment)`
`private int`	`getLenToNextNonDigit(java.lang.String s, int startingPosition)`
`private boolean`	`isContraction(char c)`
`private boolean`	`isEllipsis(int currentPosition, java.lang.String textSegment)`
`private boolean`	`isEndOfLine(char c)`
`private boolean`	`isNumericChar(char ch)` ",.0123456789"
`private boolean`	`isPossibleFinalPunctuation(char c)`
`private boolean`	`isTelephoneNumberChar(char ch)` "0123456789-"
`private int`	`lenIfIsAbbreviation(int currentPosition, java.lang.String mixedCaseText, int afterEndOfInputToConsider)` Assumes no white space between currentPosition and endOfInputToConsider If last of a sentence is a period, then don't include the period with the abbreviation, count it as punctuation.
`private int`	`lenIfIsEmailAddress(int currentPosition, java.lang.String lowerCasedText, int endOfInputToConsider)` Assumes no white space between currentPosition and endOfInputToConsider
`private int`	`lenIfIsNumberContainingComma(int currentPosition, java.lang.String text, int nextNonNumericChar)` such as -4,012.67 or 5 or 5.5 or 4,000,153
`private int`	`lenIfIsPostalCode(int currentPosition, java.lang.String text, int nextNonPostalCodeChar)`
`private int`	`lenIfIsTelephoneNumber(int currentPosition, java.lang.String text, int nextNonTelephoneNumberChar)`
`private int`	`lenIfIsUrl(int currentPosition, java.lang.String lowerCasedText, int endOfInputToConsider)`
`static void`	`main(java.lang.String[] args)`
`(package private) static void`	`runEmailTests()`
`(package private) static void`	`runNumberTests()`
`private void`	`setCapitalization(WordToken wta, java.lang.String tokenText)`
`private void`	`setNumPosition(WordToken wta, java.lang.String tokenText)`
`private void`	`setNumType(NumToken nta, java.lang.String tokenText)`
`java.util.List<?>`	`tokenize(java.lang.String text)` Tokenize a string that is assumed to be the entire document (or at least to start at 0)
`java.util.List<?>`	`tokenizeTextSegment(JCas jcas, java.lang.String textSegment, int offsetAdjustment, boolean includeTextNotJustOffsets)` Tokenize text that starts at offset offsetAdjustment within the complete text
`private boolean`	`verify(int begin, int end, int offsetAdjustment)`
`private java.lang.Class<? extends BaseToken>`	`wordTokenOrNumToken(java.lang.String lowerCasedText, int currentPosition, int tokenLen)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

emptyStringList

static final java.lang.String[] emptyStringList

emptyTokenList

static final java.util.ArrayList<BaseToken> emptyTokenList

DASH
```
private static char DASH
```

ellipsis

private static java.lang.String ellipsis

nameStartingWithApostrophe

static java.lang.String[] nameStartingWithApostrophe

possibleFinalPunctuation

private java.lang.String possibleFinalPunctuation

validOtherEmailAddressCharacters

private java.lang.String validOtherEmailAddressCharacters

urlStarters

private static java.lang.String[] urlStarters

testsForNumbers

static java.lang.String[] testsForNumbers

testsForEmailAddress

static java.lang.String[] testsForEmailAddress

Constructor Detail
- TokenizerPTB
```
public TokenizerPTB()
```
  Constructor

Method Detail

tokenizeTextSegment
```
public java.util.List<?> tokenizeTextSegment(JCas jcas,
                                    java.lang.String textSegment,
                                    int offsetAdjustment,
                                    boolean includeTextNotJustOffsets)
```
Tokenize text that starts at offset offsetAdjustment within the complete text

Parameters:
textSegment - the text to tokenize
offsetAdjustment - what to add to all offsets within textSegment to make them be offsets from the start of the text for the jcas
includeTextNotJustOffsets - whether to copy the text covered by this token into the token object itself

Returns:
the list of new tokens

tokenize
```
public java.util.List<?> tokenize(java.lang.String text)
```
Tokenize a string that is assumed to be the entire document (or at least to start at 0)

Parameters:
text - the String to tokenize

Returns:
the list of new tokens

lenIfIsNumberContainingComma

private int lenIfIsNumberContainingComma(int currentPosition,
                               java.lang.String text,
                               int nextNonNumericChar)

such as -4,012.67 or 5 or 5.5 or 4,000,153

Parameters:: currentPosition -; text -; nextNonNumericChar -
Returns:

lenIfIsPostalCode

private int lenIfIsPostalCode(int currentPosition,
                    java.lang.String text,
                    int nextNonPostalCodeChar)

lenIfIsTelephoneNumber

private int lenIfIsTelephoneNumber(int currentPosition,
                         java.lang.String text,
                         int nextNonTelephoneNumberChar)

checkFormat2

private int checkFormat2(java.lang.String s)

isTelephoneNumberChar
```
private boolean isTelephoneNumberChar(char ch)
```
"0123456789-"

Parameters:
ch -

Returns:

isNumericChar
```
private boolean isNumericChar(char ch)
```
",.0123456789"

Parameters:
ch -

Returns:

getLenToNextNonDigit

private int getLenToNextNonDigit(java.lang.String s,
                       int startingPosition)

wordTokenOrNumToken

private java.lang.Class<? extends BaseToken> wordTokenOrNumToken(java.lang.String lowerCasedText,
                                                       int currentPosition,
                                                       int tokenLen)

containsLetter
```
private boolean containsLetter(java.lang.String lowerCasedText,
                     int currentPosition,
                     int tokenLen)
```
Parameters:
lowerCasedText -
currentPosition -
tokenLen -

Returns:
true if at least one of the characters between currentPosition and currentPosition+tokenLen is a letter

isEllipsis

private boolean isEllipsis(int currentPosition,
                 java.lang.String textSegment)

getLengthIfNameStartingWithApostrophe

private int getLengthIfNameStartingWithApostrophe(int currentPosition,
                                        java.lang.String textSegment)

getLengthIfIsNumberThatStartsWithPeriod

private int getLengthIfIsNumberThatStartsWithPeriod(int currentPosition,
                                          java.lang.String textSegment)

lenIfIsAbbreviation
```
private int lenIfIsAbbreviation(int currentPosition,
                      java.lang.String mixedCaseText,
                      int afterEndOfInputToConsider)
```
Assumes no white space between currentPosition and endOfInputToConsider If last of a sentence is a period, then don't include the period with the abbreviation, count it as punctuation. That way we don't have to differentiate between "mg." being an abbreviation and "me." being simply the end of a sentence

Parameters:
currentPosition -
mixedCaseText -
afterEndOfInputToConsider -

Returns:

isPossibleFinalPunctuation

private boolean isPossibleFinalPunctuation(char c)

lenIfIsEmailAddress
```
private int lenIfIsEmailAddress(int currentPosition,
                      java.lang.String lowerCasedText,
                      int endOfInputToConsider)
```
Assumes no white space between currentPosition and endOfInputToConsider

Parameters:
currentPosition -
lowerCasedText -
endOfInputToConsider -

Returns:

lenIfIsUrl

private int lenIfIsUrl(int currentPosition,
             java.lang.String lowerCasedText,
             int endOfInputToConsider)

determineTokenType

private java.lang.Class<? extends BaseToken> determineTokenType(java.lang.String s,
                                                      int begin,
                                                      int end)

isContraction

private boolean isContraction(char c)

verify

private boolean verify(int begin,
             int end,
             int offsetAdjustment)

createToken

private java.lang.Object createToken(java.lang.Class<? extends BaseToken> clas,
                           java.lang.String s,
                           JCas jcas,
                           int begin,
                           int end,
                           int offsetAdjustment)

if clas is null, determine token class for the caller if jcas is null,

See Also:: org.apache.ctakes.core.ae.TokenConverter#convert(org.apache.ctakes.core.nlp.tokenizer.Token, org.apache.uima.jcas.JCas, int)

setNumType

private void setNumType(NumToken nta,
              java.lang.String tokenText)

See Also:: Tokenizer.isNumber(java.lang.String)

setNumPosition

private void setNumPosition(WordToken wta,
                  java.lang.String tokenText)

setCapitalization
```
private void setCapitalization(WordToken wta,
                     java.lang.String tokenText)
```
See Also:
Tokenizer.applyCapitalizationRules(org.apache.ctakes.core.nlp.tokenizer.Token, java.lang.String)

findFirstCharOfNextToken

public int findFirstCharOfNextToken(java.lang.String s,
                           int startPosition)

isEndOfLine
```
private boolean isEndOfLine(char c)
```

main

public static void main(java.lang.String[] args)

runNumberTests
```
static void runNumberTests()
```

runEmailTests
```
static void runEmailTests()
```

Class TokenizerPTB

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

emptyStringList

emptyTokenList

DASH

ellipsis

nameStartingWithApostrophe

possibleFinalPunctuation

validOtherEmailAddressCharacters

urlStarters

testsForNumbers

testsForEmailAddress

Constructor Detail

TokenizerPTB

Method Detail

tokenizeTextSegment

tokenize

lenIfIsNumberContainingComma

lenIfIsPostalCode

lenIfIsTelephoneNumber

checkFormat2

isTelephoneNumberChar

isNumericChar

getLenToNextNonDigit

wordTokenOrNumToken

containsLetter

isEllipsis

getLengthIfNameStartingWithApostrophe

getLengthIfIsNumberThatStartsWithPeriod

lenIfIsAbbreviation

isPossibleFinalPunctuation

lenIfIsEmailAddress

lenIfIsUrl

determineTokenType

isContraction

verify

createToken

setNumType

setNumPosition

setCapitalization

findFirstCharOfNextToken

isEndOfLine

main

runNumberTests

runEmailTests