Alphabetic Tokenizer
- class py_stringmatching.tokenizer.alphabetic_tokenizer.AlphabeticTokenizer(return_set=False)[source]
Returns tokens that are maximal sequences of consecutive alphabetical characters.
- Parameters
return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
- return_set
An attribute that stores the value for the flag return_set.
- Type
boolean
- get_return_set()
Gets the value of the return_set flag.
- Returns
The boolean value of the return_set flag.
- set_return_set(return_set)
Sets the value of the return_set flag.
- Parameters
return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
- tokenize(input_string)[source]
Tokenizes input string into alphabetical tokens.
- Parameters
input_string (str) – The string to be tokenized.
- Returns
A Python list, which represents a set of tokens if the flag return_set is True, and a bag of tokens otherwise.
- Raises
TypeError – If the input is not a string.
Examples
>>> al_tok = AlphabeticTokenizer() >>> al_tok.tokenize('data99science, data#integration.') ['data', 'science', 'data', 'integration'] >>> al_tok.tokenize('99') [] >>> al_tok = AlphabeticTokenizer(return_set=True) >>> al_tok.tokenize('data99science, data#integration.') ['data', 'science', 'integration']