contextpro.tokenization module¶
This module contains functions used for text data tokenization.
-
contextpro.tokenization.batch_tokenize_text(documents: List[str], tokenizer_method: Optional[str] = 'nltk_word_tokenizer', num_workers: Optional[int] = None, **kwargs) → List[List[str]]¶ Tokenizes sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences to tokenize
tokenizer_method (Optional[str]) –
tokenization method which will be used to tokenize the sentences by default “nltk_word_tokenizer”.
- Allowed values:
nltk_word_tokenizer
nltk_regexp_tokenizer
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Other Parameters
**kwargs (additional properties of the below methods:) –
nltk.word_tokenize()
nltk.regexp_tokenize()
- Returns
nested lists containing tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.tokenization import batch_tokenize_text >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde", ... "This guy's name is Edward Scissorhands", ... "And this is Tom Parker" ... ] >>> batch_tokenize_text( ... corpus, ... tokenizer_method="nltk_regexp_tokenizer", ... pattern=r"\b[^\d\W]+\b", ... gaps=False, ... num_workers=2 ... ) [['My', 'name', 'is', 'Dr', 'Jekyll'], ['His', 'name', 'is', 'Mr', 'Hyde'], ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'], ['And', 'this', 'is', 'Tom', 'Parker']]
-
contextpro.tokenization.tokenize_text(document: str, tokenizer_method: Optional[str] = 'nltk_word_tokenizer', **kwargs) → List[str]¶ Convert sentence into a list of tokens.
- Parameters
documents (str) – sentence to tokenize
tokenizer_method (Optional[str]) –
tokenization method which will be used to tokenize the sentence by default “nltk_word_tokenizer”.
- Allowed values:
nltk_word_tokenizer
nltk_regexp_tokenizer
- Other Parameters
**kwargs (additional properties of the below methods:) –
nltk.word_tokenize()
nltk.regexp_tokenize()
- Returns
list of tokens
- Return type
List[str]
Examples
>>> from contextpro.tokenization import tokenize_text >>> sentence = "My name is Dr. Jekyll." >>> tokenize_text( ... corpus, ... tokenizer_method="nltk_regexp_tokenizer", ... pattern=r"\b[^\d\W]+\b", ... gaps=False, ... ) ['My', 'name', 'is', 'Dr', 'Jekyll']