contextpro.feature_extraction module¶

contextpro.feature_extraction.batch_get_ngrams(tokens: List[List[str]], ngram_size: int = 1) → List[List[str]]¶

Prepare n-grams from the provided list of token lists.

Parameters

tokens (List[List[str]]) – list of token lists, each representing single document
ngram_size (int) – size of ngrams to return, by default 1 (unigrams)

Returns

list of nested ngram lists

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.feature_extraction import batch_get_ngrams
>>> tokens = [
...     ["my", "name", "is", "spiderman"],
...     ["she", "lives", "in", "australia"],
... ]
>>> batch_get_ngrams(tokens, ngram_size=2)
[
    ["my name", "name is", "is spiderman"],
    ["she lives", "lives in", "in australia"],
]

contextpro.feature_extraction.get_ngrams(tokens: List[str], ngram_size: int = 1) → List[str]¶

Prepare n-grams from the provided list of tokens.