org.basex.util
Class Tokenizer

java.lang.Object
  extended by org.basex.util.Tokenizer
All Implemented Interfaces:
IndexToken

public final class Tokenizer
extends Object
implements IndexToken

Full-text tokenizer.

Author:
Workgroup DBIS, University of Konstanz 2005-10, ISC License, Christian Gruen

Nested Class Summary
static class Tokenizer.FTUnit
          Units.
 
Field Summary
 boolean cs
          Sensitivity flag.
 boolean dc
          Diacritics flag.
 boolean fast
          Fast evaluation flag.
 boolean fz
          Fuzzy flag.
 boolean lc
          Lowercase flag.
 int p
          Current character position.
 int para
          Current paragraph.
 int pm
          Last punctuation mark.
 int pos
          Current token.
 StemDir sd
          Stemming dictionary.
 int sent
          Current sentence.
 boolean st
          Stemming flag.
 byte[] text
          Text.
 boolean uc
          Uppercase flag.
 boolean wc
          Wildcard flag.
 
Constructor Summary
Tokenizer(byte[] txt, FTOpt fto, boolean f, Prop pr)
          Constructor.
Tokenizer(byte[] txt, Prop pr)
          Constructor.
Tokenizer(Prop pr)
          Empty constructor.
 
Method Summary
 int count()
          Counts the number of tokens.
 byte[] get()
          Returns the current token.
 byte[] get(byte[] tok)
          Returns a normalized version of the specified token.
static int[][] getInfo(byte[] t)
          Gets full-text info for the specified token; needed for visualizations.
 void init()
          Initializes the iterator.
 void init(byte[] txt)
          Sets the text.
 boolean more()
          Checks if more tokens are to be returned.
 byte[] orig()
          Returns the original token.
 int pos(int w, Tokenizer.FTUnit u)
          Calculates a position value, dependent on the specified unit.
 String toString()
           
 Data.Type type()
          Returns the index type.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

sd

public StemDir sd
Stemming dictionary.


st

public boolean st
Stemming flag.


dc

public boolean dc
Diacritics flag.


cs

public boolean cs
Sensitivity flag.


uc

public boolean uc
Uppercase flag.


lc

public boolean lc
Lowercase flag.


wc

public boolean wc
Wildcard flag.


fz

public boolean fz
Fuzzy flag.


fast

public boolean fast
Fast evaluation flag.


sent

public int sent
Current sentence.


para

public int para
Current paragraph.


pos

public int pos
Current token.


text

public byte[] text
Text.


p

public int p
Current character position.


pm

public int pm
Last punctuation mark.

Constructor Detail

Tokenizer

public Tokenizer(Prop pr)
Empty constructor.

Parameters:
pr - (optional) database properties

Tokenizer

public Tokenizer(byte[] txt,
                 Prop pr)
Constructor.

Parameters:
pr - (optional) database properties
txt - text

Tokenizer

public Tokenizer(byte[] txt,
                 FTOpt fto,
                 boolean f,
                 Prop pr)
Constructor.

Parameters:
txt - text
fto - full-text options
f - fast evaluation
pr - database properties
Method Detail

type

public Data.Type type()
Description copied from interface: IndexToken
Returns the index type.

Specified by:
type in interface IndexToken
Returns:
type

init

public void init(byte[] txt)
Sets the text.

Parameters:
txt - text

init

public void init()
Initializes the iterator.


more

public boolean more()
Checks if more tokens are to be returned.

Returns:
result of check

get

public byte[] get()
Description copied from interface: IndexToken
Returns the current token.

Specified by:
get in interface IndexToken
Returns:
token

get

public byte[] get(byte[] tok)
Returns a normalized version of the specified token.

Parameters:
tok - input token
Returns:
result

orig

public byte[] orig()
Returns the original token.

Returns:
original token

count

public int count()
Counts the number of tokens.

Returns:
number of tokens

pos

public int pos(int w,
               Tokenizer.FTUnit u)
Calculates a position value, dependent on the specified unit. Once calculated values are cached.

Parameters:
w - word position
u - unit
Returns:
new position

getInfo

public static int[][] getInfo(byte[] t)
Gets full-text info for the specified token; needed for visualizations. int[0]: length of each token int[1]: sentence info, length of each sentence int[2]: paragraph info, length of each paragraph int[3]: each token as int[] int[4]: punctuation marks of each sentence

Parameters:
t - text to be parsed
Returns:
int arrays

toString

public String toString()
Overrides:
toString in class Object