public class HyphenatedPTB
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
(package private) static java.lang.String[] |
contractionsStartingWithApostrophe |
(package private) static java.lang.String[] |
hyphenatedPrefixes |
(package private) static java.util.HashSet<java.lang.String> |
hyphenatedPrefixesLookup |
(package private) static java.lang.String[] |
hyphenatedSuffixes |
(package private) static java.util.HashSet<java.lang.String> |
hyphenatedSuffixesLookup |
(package private) static java.lang.String[] |
hyphenatedWords |
(package private) static java.util.HashSet<java.lang.String> |
hyphenatedWordsLookup |
(package private) static java.lang.String |
lettersAfterApostropheForMiddleOfContraction |
(package private) static char |
MINUS_OR_HYPHEN |
(package private) static int[] |
MultiTokenWordLenToken1 |
(package private) static int[] |
MultiTokenWordLenToken2 |
(package private) static java.lang.String[] |
MultiTokenWords |
(package private) static java.util.HashMap<java.lang.String,java.lang.Integer> |
MultiTokenWordsLookup |
(package private) static java.lang.String[] |
possibleContractionEndings |
Constructor and Description |
---|
HyphenatedPTB() |
Modifier and Type | Method and Description |
---|---|
(package private) static boolean |
isContractionThatStartsWithApostrophe(int currentPosition,
java.lang.String textSegment) |
(package private) static int |
lenIfHyphenatedSuffix(java.lang.String lowerCasedString,
int position) |
private static int |
lenIncludingHyphensToKeep(java.lang.String s,
int indexOfFirstHyphen,
int numberOfHyphensToConsiderKeeping,
int secondBreak,
int thirdBreak) |
(package private) static int |
lenOfFirstTokenInContraction(java.lang.String s) |
static void |
main(java.lang.String[] args) |
static int |
tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup)
And here are some made-up examples of words using affixes to keep together
chronic-itis 1 suffix
mega-huge 1 prefix
e-game-fest 1 prefix and 1 suffix
salon-o-torium 1 suffix that contains 2 hyphens
urban-esque-wise 2 suffixes
|
static java.lang.String[] MultiTokenWords
static int[] MultiTokenWordLenToken1
static int[] MultiTokenWordLenToken2
static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup
static java.lang.String[] possibleContractionEndings
static java.lang.String lettersAfterApostropheForMiddleOfContraction
static java.lang.String[] contractionsStartingWithApostrophe
static java.lang.String[] hyphenatedPrefixes
static java.util.HashSet<java.lang.String> hyphenatedPrefixesLookup
static java.lang.String[] hyphenatedSuffixes
static java.util.HashSet<java.lang.String> hyphenatedSuffixesLookup
static java.lang.String[] hyphenatedWords
static java.util.HashSet<java.lang.String> hyphenatedWordsLookup
static char MINUS_OR_HYPHEN
static int lenOfFirstTokenInContraction(java.lang.String s)
s
- isMiddleOfContraction
static boolean isContractionThatStartsWithApostrophe(int currentPosition, java.lang.String textSegment)
public static void main(java.lang.String[] args)
public static int tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
lowerCasedString
- because of "-o-torium", input might contain more than 1 hyphen....private static int lenIncludingHyphensToKeep(java.lang.String s, int indexOfFirstHyphen, int numberOfHyphensToConsiderKeeping, int secondBreak, int thirdBreak)
static int lenIfHyphenatedSuffix(java.lang.String lowerCasedString, int position)