contextpro.tokenization module¶

This module contains functions used for text data tokenization.

contextpro.tokenization.batch_tokenize_text(documents: List[str], tokenizer_method: Optional[str] = 'nltk_word_tokenizer', num_workers: Optional[int] = None, **kwargs) → List[List[str]]¶

Tokenizes sentences in a concurrent manner.

Parameters

documents (List[str]) – list of sentences to tokenize
tokenizer_method (Optional[str]) –
tokenization method which will be used to tokenize the sentences by default “nltk_word_tokenizer”.
Allowed values:
- nltk_word_tokenizer
- nltk_regexp_tokenizer
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Other Parameters

**kwargs (additional properties of the below methods:) –

nltk.word_tokenize()
nltk.regexp_tokenize()

Returns

nested lists containing tokens

Return type

List[List[str]]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.tokenization import batch_tokenize_text
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_tokenize_text(
...     corpus,
...     tokenizer_method="nltk_regexp_tokenizer",
...     pattern=r"\b[^\d\W]+\b",
...     gaps=False,
...     num_workers=2
... )
[['My', 'name', 'is', 'Dr', 'Jekyll'],
 ['His', 'name', 'is', 'Mr', 'Hyde'],
 ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
 ['And', 'this', 'is', 'Tom', 'Parker']]

contextpro.tokenization.tokenize_text(document: str, tokenizer_method: Optional[str] = 'nltk_word_tokenizer', **kwargs) → List[str]¶

Convert sentence into a list of tokens.