contextpro.statistics module

This module contains functions for calculating some text data related statistics.

contextpro.statistics.batch_calculate_corpus_statistics(documents: List[str], lowercase: bool = False, remove_stopwords: bool = False, tokenizer_pattern: str = '\\b[^\\d\\W]+\\b', custom_stopwords: List[str] = [], num_workers: Optional[int] = None)pandas.core.frame.DataFrame

Calculates the below statistics for each document in the corpus in a concurrent manner:

  • Number of characters

  • Number of tokens

  • Number of punctuation characters

  • Number of digits

  • Number of whitespace characters

  • Number of non-ascii characters

  • Sentiment score

  • Subjectivity score

Parameters
  • documents (List[str]) – list of strings

  • lowercase (bool, optional) – convert all characters to lowercase before calculating statistics, by default False

  • remove_stopwords (bool, optional) – remove stopwords before calculating statistics. Uses english stopwords from the NLTK library if ‘custom_stopwords’ list is not provided, by default False

  • tokenizer_pattern (str, optional) – regex pattern used by the underlying NLTK Regexp Tokenizer to tokenize the documents, by default r”b[^dW]+b”

  • custom_stopwords (List[str], optional) – custom stopwords to use for token filtering, by default []

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Returns

with statistics for each document in the provided corpus

Return type

pd.DataFrame

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_corpus_statistics
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_calculate_corpus_statistics(
...     corpus,
...     lowercase=False,
...     remove_stopwords=False,
...     num_workers=2,
... )
    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

ascii_characters sentiment_score subjectivity_score 0 22 0.0 0.0 1 20 0.0 0.0 2 38 0.0 0.0 3 22 0.0 0.0

contextpro.statistics.batch_calculate_sentiment_scores(documents: List[str], num_workers: Optional[int] = None)List[float]

Calculate sentiment scores for sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which sentiment scores have to be calculated

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [-1.0, 1.0] range representing sentiment scores for the sentences where -1.0 means negative and 1.0 positive

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_sentiment_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_sentiment_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.5, -0.35, -1.0]
contextpro.statistics.batch_calculate_subjectivity_scores(documents: List[str], num_workers: Optional[int] = None)List[float]

Calculate subjectivity scores for sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which subjectivity scores have to be calculated

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [0.0, 1.0] range representing subjectivity scores for the sentences where 0.0 means very objective and 1.0 very subjective

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_subjectivity_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_subjectivity_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.6, 0.9, 1.0]
contextpro.statistics.batch_get_ngram_counts(tokens: List[List[str]], ngram_size: int = 1)Dict[str, int]

Calculate ngram counts across the corpus of tokenized documents.

Parameters
  • tokens (List[List[str]]) – list of nested token lists

  • ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a corpus of tokenized documents

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of nested token lists

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> corpus = [
    ["my", "name", "is", "dr", "jekyll"],
    ["his", "name", "is", "mr", "hyde"],
    ["this", "guy", "name", "is", "edward", "scissorhands"],
    ["and", "this", "is", "tom", "parker"],
]
>>> batch_get_ngram_counts(corpus, ngram_size=2)
{
    "my name": 1, "name is": 3, "is dr": 1, "dr jekyll": 1,
    "his name": 1, "is mr": 1, "mr hyde": 1, "this guy": 1,
    "guy name": 1, "is edward": 1, "edward scissorhands": 1,
    "and this": 1, "this is": 1, "is tom": 1, "tom parker": 1
}
contextpro.statistics.calculate_sentiment_score(document: str)float

Calculate sentiment score for the sentence using TextBlob object.

Parameters

document (str) – sentence which sentiment score has to be calculated

Returns

float within [-1.0, 1.0] range representing sentiment score for the sentence, where -1.0 means negative and 1.0 positive

Return type

float

Examples

>>> from contextpro.statistics import calculate_sentiment_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_sentiment_score(sentence)
0.5
contextpro.statistics.calculate_subjectivity_score(document: str)float

Calculate subjectivity score for the sentence using TextBlob object.

Parameters

document (str) – sentence which subjectivity score has to be calculated

Returns

float within [0.0, 1.0] range representing subjectivity score for the sentence, where 0.0 means very objective and 1.0 very subjective

Return type

float

Examples

>>> from contextpro.statistics import calculate_subjectivity_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_subjectivity_score(sentence)
0.6
contextpro.statistics.get_ngram_counts(tokens: List[str], ngram_size: int = 1)Dict[str, int]

Calculate ngram counts in a tokenized document.

Parameters
  • tokens (List[str]) – list of tokens

  • ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a document

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of strings

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> tokens = ["my", "name", "is", "dr", "jekyll"]
>>> get_ngram_counts(tokens, ngram_size=2)
{'my name': 1, 'name is': 1, 'is dr': 1, 'dr jekyll': 1}