org.basex.util
Class Token

java.lang.Object
  extended by org.basex.util.Token

public final class Token
extends Object

This class provides convenience operations for handling so-called 'Tokens'. Tokens in this project are nothing else than UTF8 encoded strings, stored in a byte array. Note that, to guarantee a consistent string representation, all string conversions should be done via the methods of this class.

Author:
Workgroup DBIS, University of Konstanz 2005-10, ISC License, Christian Gruen

Field Summary
static byte[] AMP
          Ampersand Entity.
static byte[] APOS
          Apostrophe Entity.
static byte[] EMPTY
          Empty token.
static byte[] FALSE
          False token.
static byte[] GT
          GreaterThan Entity.
static byte[] INF
          Positive infinity.
static DecimalFormatSymbols LOC
          US charset.
static byte[] LT
          LessThan Entity.
static int MAXCATS
          Maximum number of categories in statistics.
static int MAXLEN
          Maximum length for hash calculation and index terms.
static byte[] MZERO
          Zero token.
static byte[] NAN
          Not available number.
static byte[] NINF
          Negative infinity.
static byte[] ONE
          One token.
static byte[] QU
          Quote Entity.
static byte[] SPACE
          Space token.
static byte[] TRUE
          True token.
static String UTF16
          UTF16 encoding string.
static String UTF16BE
          UTF16 encoding string.
static String UTF16LE
          UTF16 encoding string.
static String UTF8
          UTF8 encoding string.
static String UTF82
          UTF8 encoding string (variant).
static byte[] XML
          XML Token.
static byte[] XMLNS
          XMLNS Token.
static byte[] XMLNSC
          XMLNS Token with colon.
static byte[] ZERO
          Zero token.
 
Method Summary
static boolean ascii(byte[] text)
          Checks if the specified token only consists of ASCII characters.
static byte[] chop(byte[] t, int l)
          Chops a token to the specified length and adds dots.
static byte[] chopNumber(byte[] t)
          Finishes the numeric token, removing trailing zeroes.
static int cl(byte v)
          Returns the expected codepoint length of the specified byte.
static byte[] concat(byte[]... t)
          Concatenates the specified tokens.
static boolean contains(byte[] tok, byte[] sub)
          Checks if the first token contains the second token.
static boolean contains(byte[] tok, int c)
          Checks if the first token contains the specified character.
static int cp(byte[] t, int p)
          Returns the codepoint (unicode value) of the specified token, starting at the specified position.
static byte[] delete(byte[] t, int c)
          Deletes the specified character out of the token.
static int diff(byte[] tok, byte[] tok2)
          Calculates the difference of two character arrays.
static int diff(byte c1, byte c2)
          Calculates the difference of two characters.
static boolean digit(int c)
          Checks if the specified character is a digit (0 - 9).
static String enc(String enc)
          Returns a unified representation of the specified encoding.
static boolean endsWith(byte[] tok, byte[] sub)
          Checks if the first token ends with the second token.
static boolean endsWith(byte[] tok, int c)
          Checks if the first token starts with the specified character.
static boolean eq(byte[] tok, byte[] tok2)
          Compares two character arrays for equality.
static boolean ftChar(int ch)
          Returns true if the specified character is a full-text letter or digit.
static int hash(byte[] tok)
          Calculates a hash code for the specified token.
static int indexOf(byte[] tok, byte[] sub)
          Returns the position of the specified token or -1.
static int indexOf(byte[] tok, byte[] sub, int p)
          Returns the position of the specified token or -1.
static int indexOf(byte[] tok, int c)
          Returns the position of the specified character or -1.
static boolean isValidUTF8(byte[] text)
          Checks if the specified UTF-8 characters are valid.
static byte[] lc(byte[] t)
          Converts the specified token to lower case.
static int lc(int ch)
          Converts a character to lower case.
static int len(byte[] text)
          Returns the token length.
static boolean letter(int c)
          Checks if the specified character is a computer letter (A - Z, a - z, _).
static boolean letterOrDigit(int c)
          Checks if the specified character is a computer letter or digit.
static byte[] ln(byte[] name)
          Returns the local name of the specified name.
static String md5(String pw)
          Returns a md5 hash.
static byte[] norm(byte[] tok)
          Normalizes all whitespace occurrences from the specified token.
static int norm(int ch)
          Returns a normalized character without diacritics.
static int numDigits(int x)
          Checks number of digits of the specified integer.
static byte[] pref(byte[] name)
          Returns the prefix of the specified token.
static byte[] removeNonUTF8(byte[] text, boolean chop)
          Removes invalid characters from the UTF-8 sequence.
static byte[] replace(byte[] t, int s, int r)
          Replaces the specified character and returns the result token.
static byte[][] split(byte[] tok, int sep)
          Splits the token at all whitespaces and returns a array with all tokens.
static boolean startsWith(byte[] tok, byte[] sub)
          Checks if the first token starts with the second token.
static boolean startsWith(byte[] tok, int c)
          Checks if the first token starts with the specified character.
static String string(byte[] text)
          Returns the specified token as string.
static String string(byte[] text, int s, int l)
          Returns the specified token as string.
static byte[] substring(byte[] tok, int s)
          Returns a substring of the specified token.
static byte[] substring(byte[] tok, int s, int e)
          Returns a substring of the specified token.
static double toDouble(byte[] to)
          Converts the specified token into a double value.
static int toInt(byte[] to)
          Converts the specified token into an integer value.
static int toInt(byte[] to, int ts, int te)
          Converts the specified token into an integer value.
static int toInt(String to)
          Converts the specified string into an integer value.
static byte[] token(boolean b)
          Creates a byte array representation of the specified boolean value.
static byte[] token(double d)
          Creates a byte array representation from the specified double value; inspired by Xavier Franc's Qizx.
static byte[] token(float f)
          Creates a byte array representation from the specified float value.
static byte[] token(int i)
          Creates a byte array representation of the specified integer value.
static byte[] token(long i)
          Creates a byte array representation from the specified long value, using Java's standard method.
static byte[] token(String s)
          Converts a string to a byte array.
static long toLong(byte[] to)
          Converts the specified token into an long value.
static long toLong(byte[] to, int ts, int te)
          Converts the specified token into an long value.
static long toLong(String to)
          Converts the specified string into an long value.
static int toSimpleInt(byte[] to)
          Converts the specified token into a positive integer value.
static byte[] trim(byte[] t)
          Removes leading and trailing whitespaces from the specified token.
static byte[] uc(byte[] t)
          Converts the specified token to upper case.
static int uc(int ch)
          Converts a character to upper case.
static String utf8(byte[] text, int s, int l)
          Returns a string of the specified UTF8 token.
static byte[] utf8(byte[] s, String enc)
          Converts a token from the input encoding to UTF8.
static boolean ws(byte[] tok)
          Checks if the specified token has only whitespaces.
static boolean ws(int ch)
          Checks if the specified character is a whitespace.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAXLEN

public static final int MAXLEN
Maximum length for hash calculation and index terms.

See Also:
Constant Field Values

MAXCATS

public static final int MAXCATS
Maximum number of categories in statistics.

See Also:
Constant Field Values

EMPTY

public static final byte[] EMPTY
Empty token.


XML

public static final byte[] XML
XML Token.


XMLNS

public static final byte[] XMLNS
XMLNS Token.


XMLNSC

public static final byte[] XMLNSC
XMLNS Token with colon.


TRUE

public static final byte[] TRUE
True token.


FALSE

public static final byte[] FALSE
False token.


NAN

public static final byte[] NAN
Not available number.


INF

public static final byte[] INF
Positive infinity.


NINF

public static final byte[] NINF
Negative infinity.


SPACE

public static final byte[] SPACE
Space token.


ZERO

public static final byte[] ZERO
Zero token.


MZERO

public static final byte[] MZERO
Zero token.


ONE

public static final byte[] ONE
One token.


QU

public static final byte[] QU
Quote Entity.


AMP

public static final byte[] AMP
Ampersand Entity.


APOS

public static final byte[] APOS
Apostrophe Entity.


GT

public static final byte[] GT
GreaterThan Entity.


LT

public static final byte[] LT
LessThan Entity.


UTF8

public static final String UTF8
UTF8 encoding string.

See Also:
Constant Field Values

UTF82

public static final String UTF82
UTF8 encoding string (variant).

See Also:
Constant Field Values

UTF16

public static final String UTF16
UTF16 encoding string.

See Also:
Constant Field Values

UTF16LE

public static final String UTF16LE
UTF16 encoding string.

See Also:
Constant Field Values

UTF16BE

public static final String UTF16BE
UTF16 encoding string.

See Also:
Constant Field Values

LOC

public static final DecimalFormatSymbols LOC
US charset.

Method Detail

string

public static String string(byte[] text)
Returns the specified token as string.

Parameters:
text - token
Returns:
string

string

public static String string(byte[] text,
                            int s,
                            int l)
Returns the specified token as string.

Parameters:
text - token
s - start position
l - length
Returns:
string

utf8

public static String utf8(byte[] text,
                          int s,
                          int l)
Returns a string of the specified UTF8 token.

Parameters:
text - token
s - start position
l - length
Returns:
string

ascii

public static boolean ascii(byte[] text)
Checks if the specified token only consists of ASCII characters.

Parameters:
text - token
Returns:
result of check

isValidUTF8

public static boolean isValidUTF8(byte[] text)
Checks if the specified UTF-8 characters are valid.

Parameters:
text - UTF-8 characters
Returns:
result of check

removeNonUTF8

public static byte[] removeNonUTF8(byte[] text,
                                   boolean chop)
Removes invalid characters from the UTF-8 sequence.

Parameters:
text - the UTF-8 sequence to remove the invalid chars from
chop - if true, all leading and trailing whitespaces are removed
Returns:
the cleaned UTF-8 sequence

token

public static byte[] token(String s)
Converts a string to a byte array. All strings should be converted by this function to guarantee a consistent character conversion.

Parameters:
s - string to be converted
Returns:
byte array

utf8

public static byte[] utf8(byte[] s,
                          String enc)
Converts a token from the input encoding to UTF8.

Parameters:
s - token to be converted
enc - input encoding
Returns:
byte array

enc

public static String enc(String enc)
Returns a unified representation of the specified encoding.

Parameters:
enc - input encoding
Returns:
encoding

cp

public static int cp(byte[] t,
                     int p)
Returns the codepoint (unicode value) of the specified token, starting at the specified position.

Parameters:
t - token
p - character position
Returns:
current character

cl

public static int cl(byte v)
Returns the expected codepoint length of the specified byte.

Parameters:
v - first character byte
Returns:
character length

len

public static int len(byte[] text)
Returns the token length.

Parameters:
text - token
Returns:
length

token

public static byte[] token(boolean b)
Creates a byte array representation of the specified boolean value.

Parameters:
b - boolean value to be converted
Returns:
boolean value in byte array

token

public static byte[] token(int i)
Creates a byte array representation of the specified integer value.

Parameters:
i - int value to be converted
Returns:
integer value in byte array

numDigits

public static int numDigits(int x)
Checks number of digits of the specified integer.

Parameters:
x - number to be checked
Returns:
number of digits

token

public static byte[] token(long i)
Creates a byte array representation from the specified long value, using Java's standard method.

Parameters:
i - int value to be converted
Returns:
byte array

token

public static byte[] token(double d)
Creates a byte array representation from the specified double value; inspired by Xavier Franc's Qizx.

Parameters:
d - double value to be converted
Returns:
byte array

token

public static byte[] token(float f)
Creates a byte array representation from the specified float value.

Parameters:
f - float value to be converted
Returns:
byte array

chopNumber

public static byte[] chopNumber(byte[] t)
Finishes the numeric token, removing trailing zeroes.

Parameters:
t - token to be modified
Returns:
token

toDouble

public static double toDouble(byte[] to)
Converts the specified token into a double value. Double.NaN is returned if the input is invalid.

Parameters:
to - character array to be converted
Returns:
converted double value

toLong

public static long toLong(String to)
Converts the specified string into an long value. Long.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
Returns:
converted long value

toLong

public static long toLong(byte[] to)
Converts the specified token into an long value. Long.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
Returns:
converted long value

toLong

public static long toLong(byte[] to,
                          int ts,
                          int te)
Converts the specified token into an long value. Long.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
ts - first byte to be parsed
te - last byte to be parsed - exclusive
Returns:
converted long value

toInt

public static int toInt(String to)
Converts the specified string into an integer value. Integer.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
Returns:
converted integer value

toInt

public static int toInt(byte[] to)
Converts the specified token into an integer value. Integer.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
Returns:
converted integer value

toInt

public static int toInt(byte[] to,
                        int ts,
                        int te)
Converts the specified token into an integer value. Integer.MIN_VALUE is returned when the input is invalid.

Parameters:
to - character array to be converted
ts - first byte to be parsed
te - last byte to be parsed (exclusive)
Returns:
converted integer value

toSimpleInt

public static int toSimpleInt(byte[] to)
Converts the specified token into a positive integer value. Integer.MIN_VALUE is returned if non-digits are found or if the input is longer than nine characters.

Parameters:
to - character array to be converted
Returns:
converted integer value

hash

public static int hash(byte[] tok)
Calculates a hash code for the specified token.

Parameters:
tok - specified token
Returns:
hash code

eq

public static boolean eq(byte[] tok,
                         byte[] tok2)
Compares two character arrays for equality.

Parameters:
tok - token to be compared
tok2 - second token to be compared
Returns:
true if the arrays are equal

diff

public static int diff(byte[] tok,
                       byte[] tok2)
Calculates the difference of two character arrays.

Parameters:
tok - token to be compared
tok2 - second token to be compared
Returns:
0 if tokens are equal, negative if first token is smaller, positive if first token is bigger

diff

public static int diff(byte c1,
                       byte c2)
Calculates the difference of two characters.

Parameters:
c1 - first character to be compared
c2 - second character to be compared
Returns:
0 if characters are equal, negative if first token is smaller, positive if first character is bigger

contains

public static boolean contains(byte[] tok,
                               byte[] sub)
Checks if the first token contains the second token.

Parameters:
tok - first token
sub - second token
Returns:
result of test

contains

public static boolean contains(byte[] tok,
                               int c)
Checks if the first token contains the specified character.

Parameters:
tok - first token
c - character
Returns:
result of test

indexOf

public static int indexOf(byte[] tok,
                          int c)
Returns the position of the specified character or -1.

Parameters:
tok - first token
c - character
Returns:
result of test

indexOf

public static int indexOf(byte[] tok,
                          byte[] sub)
Returns the position of the specified token or -1.

Parameters:
tok - first token
sub - second token
Returns:
result of test

indexOf

public static int indexOf(byte[] tok,
                          byte[] sub,
                          int p)
Returns the position of the specified token or -1.

Parameters:
tok - first token
sub - second token
p - start position
Returns:
result of test

startsWith

public static boolean startsWith(byte[] tok,
                                 int c)
Checks if the first token starts with the specified character.

Parameters:
tok - first token
c - character
Returns:
result of test

startsWith

public static boolean startsWith(byte[] tok,
                                 byte[] sub)
Checks if the first token starts with the second token.

Parameters:
tok - first token
sub - second token
Returns:
result of test

endsWith

public static boolean endsWith(byte[] tok,
                               int c)
Checks if the first token starts with the specified character.

Parameters:
tok - first token
c - character
Returns:
result of test

endsWith

public static boolean endsWith(byte[] tok,
                               byte[] sub)
Checks if the first token ends with the second token.

Parameters:
tok - first token
sub - second token
Returns:
result of test

substring

public static byte[] substring(byte[] tok,
                               int s)
Returns a substring of the specified token.

Parameters:
tok - token
s - start position
Returns:
substring

substring

public static byte[] substring(byte[] tok,
                               int s,
                               int e)
Returns a substring of the specified token.

Parameters:
tok - token
s - start position
e - end position
Returns:
substring

split

public static byte[][] split(byte[] tok,
                             int sep)
Splits the token at all whitespaces and returns a array with all tokens.

Parameters:
tok - token to be split
sep - separation character
Returns:
array

ws

public static boolean ws(byte[] tok)
Checks if the specified token has only whitespaces.

Parameters:
tok - token
Returns:
true if all characters are whitespaces

replace

public static byte[] replace(byte[] t,
                             int s,
                             int r)
Replaces the specified character and returns the result token.

Parameters:
t - token to be checked
s - the character to be replaced
r - the new character
Returns:
resulting token

trim

public static byte[] trim(byte[] t)
Removes leading and trailing whitespaces from the specified token.

Parameters:
t - token to be trimmed
Returns:
trimmed token

chop

public static byte[] chop(byte[] t,
                          int l)
Chops a token to the specified length and adds dots.

Parameters:
t - token to be chopped
l - maximum length
Returns:
chopped token

concat

public static byte[] concat(byte[]... t)
Concatenates the specified tokens.

Parameters:
t - tokens
Returns:
resulting array

delete

public static byte[] delete(byte[] t,
                            int c)
Deletes the specified character out of the token.

Parameters:
t - token to be checked
c - character to be removed
Returns:
new instance

norm

public static byte[] norm(byte[] tok)
Normalizes all whitespace occurrences from the specified token.

Parameters:
tok - token
Returns:
normalized token

ws

public static boolean ws(int ch)
Checks if the specified character is a whitespace.

Parameters:
ch - the letter to be checked
Returns:
result of comparison

letter

public static boolean letter(int c)
Checks if the specified character is a computer letter (A - Z, a - z, _).

Parameters:
c - the letter to be checked
Returns:
result of comparison

digit

public static boolean digit(int c)
Checks if the specified character is a digit (0 - 9).

Parameters:
c - the letter to be checked
Returns:
result of comparison

letterOrDigit

public static boolean letterOrDigit(int c)
Checks if the specified character is a computer letter or digit.

Parameters:
c - the letter to be checked
Returns:
result of comparison

ftChar

public static boolean ftChar(int ch)
Returns true if the specified character is a full-text letter or digit.

Parameters:
ch - character to be tested
Returns:
result of check

uc

public static byte[] uc(byte[] t)
Converts the specified token to upper case.

Parameters:
t - token to be converted
Returns:
the converted token

uc

public static int uc(int ch)
Converts a character to upper case.

Parameters:
ch - character to be converted
Returns:
converted character

lc

public static byte[] lc(byte[] t)
Converts the specified token to lower case.

Parameters:
t - token to be converted
Returns:
the converted token

lc

public static int lc(int ch)
Converts a character to lower case.

Parameters:
ch - character to be converted
Returns:
converted character

pref

public static byte[] pref(byte[] name)
Returns the prefix of the specified token.

Parameters:
name - name
Returns:
prefix or empty token if no prefix exists

md5

public static String md5(String pw)
Returns a md5 hash.

Parameters:
pw - String
Returns:
String

ln

public static byte[] ln(byte[] name)
Returns the local name of the specified name.

Parameters:
name - name
Returns:
local name

norm

public static int norm(int ch)
Returns a normalized character without diacritics. This method supports all latin1 characters, including supplements.

Parameters:
ch - character to be converted
Returns:
normalized character