contextpro.feature_extraction module

contextpro.feature_extraction.batch_get_ngrams(tokens: List[List[str]], ngram_size: int = 1)List[List[str]]

Prepare n-grams from the provided list of token lists.

Parameters
  • tokens (List[List[str]]) – list of token lists, each representing single document

  • ngram_size (int) – size of ngrams to return, by default 1 (unigrams)

Returns

list of nested ngram lists

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.feature_extraction import batch_get_ngrams
>>> tokens = [
...     ["my", "name", "is", "spiderman"],
...     ["she", "lives", "in", "australia"],
... ]
>>> batch_get_ngrams(tokens, ngram_size=2)
[
    ["my name", "name is", "is spiderman"],
    ["she lives", "lives in", "in australia"],
]
contextpro.feature_extraction.get_ngrams(tokens: List[str], ngram_size: int = 1)List[str]

Prepare n-grams from the provided list of tokens.

Parameters
  • tokens (List[str]) – list of tokens

  • ngram_size (int) – size of ngrams to return, by default 1 (unigrams)

Returns

list of ngrams

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided is not a list of strings

Examples

>>> from contextpro.feature_extraction import get_ngrams
>>> tokens = ["my", "name", "is", "dr", "jekyll"]
>>> get_ngrams(tokens, ngram_size=2)
["my name", "name is", "is spiderman"]