contextpro.tokenization module

This module contains functions used for text data tokenization.

contextpro.tokenization.batch_tokenize_text(documents: List[str], tokenizer_method: Optional[str] = 'nltk_word_tokenizer', num_workers: Optional[int] = None, **kwargs)List[List[str]]

Tokenizes sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences to tokenize

  • tokenizer_method (Optional[str]) –

    tokenization method which will be used to tokenize the sentences by default “nltk_word_tokenizer”.

    Allowed values:
    • nltk_word_tokenizer

    • nltk_regexp_tokenizer

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Other Parameters

**kwargs (additional properties of the below methods:) –

  • nltk.word_tokenize()

  • nltk.regexp_tokenize()

Returns

nested lists containing tokens

Return type

List[List[str]]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.tokenization import batch_tokenize_text
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_tokenize_text(
...     corpus,
...     tokenizer_method="nltk_regexp_tokenizer",
...     pattern=r"\b[^\d\W]+\b",
...     gaps=False,
...     num_workers=2
... )
[['My', 'name', 'is', 'Dr', 'Jekyll'],
 ['His', 'name', 'is', 'Mr', 'Hyde'],
 ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
 ['And', 'this', 'is', 'Tom', 'Parker']]
contextpro.tokenization.tokenize_text(document: str, tokenizer_method: Optional[str] = 'nltk_word_tokenizer', **kwargs)List[str]

Convert sentence into a list of tokens.

Parameters
  • documents (str) – sentence to tokenize

  • tokenizer_method (Optional[str]) –

    tokenization method which will be used to tokenize the sentence by default “nltk_word_tokenizer”.

    Allowed values:
    • nltk_word_tokenizer

    • nltk_regexp_tokenizer

Other Parameters

**kwargs (additional properties of the below methods:) –

  • nltk.word_tokenize()

  • nltk.regexp_tokenize()

Returns

list of tokens

Return type

List[str]

Examples

>>> from contextpro.tokenization import tokenize_text
>>> sentence = "My name is Dr. Jekyll."
>>> tokenize_text(
...     corpus,
...     tokenizer_method="nltk_regexp_tokenizer",
...     pattern=r"\b[^\d\W]+\b",
...     gaps=False,
... )
['My', 'name', 'is', 'Dr', 'Jekyll']