contextpro.statistics module¶

This module contains functions for calculating some text data related statistics.

contextpro.statistics.batch_calculate_corpus_statistics(documents: List[str], lowercase: bool = False, remove_stopwords: bool = False, tokenizer_pattern: str = '\\b[^\\d\\W]+\\b', custom_stopwords: List[str] = [], num_workers: Optional[int] = None) → pandas.core.frame.DataFrame¶

Calculates the below statistics for each document in the corpus in a concurrent manner:

Number of characters

Number of tokens

Number of punctuation characters

Number of digits

Number of whitespace characters

Number of non-ascii characters

Sentiment score

Subjectivity score

Parameters

documents (List[str]) – list of strings
lowercase (bool, optional) – convert all characters to lowercase before calculating statistics, by default False
remove_stopwords (bool, optional) – remove stopwords before calculating statistics. Uses english stopwords from the NLTK library if ‘custom_stopwords’ list is not provided, by default False
tokenizer_pattern (str, optional) – regex pattern used by the underlying NLTK Regexp Tokenizer to tokenize the documents, by default r”b[^dW]+b”
custom_stopwords (List[str], optional) – custom stopwords to use for token filtering, by default []
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Returns

with statistics for each document in the provided corpus

Return type

pd.DataFrame

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_corpus_statistics
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_calculate_corpus_statistics(
...     corpus,
...     lowercase=False,
...     remove_stopwords=False,
...     num_workers=2,
... )
    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

ascii_characters sentiment_score subjectivity_score 0 22 0.0 0.0 1 20 0.0 0.0 2 38 0.0 0.0 3 22 0.0 0.0

contextpro.statistics.batch_calculate_sentiment_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶

Calculate sentiment scores for sentences in a concurrent manner.

Parameters

documents (List[str]) – list of sentences which sentiment scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [-1.0, 1.0] range representing sentiment scores for the sentences where -1.0 means negative and 1.0 positive

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_sentiment_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_sentiment_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.5, -0.35, -1.0]

contextpro.statistics.batch_calculate_subjectivity_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶

Calculate subjectivity scores for sentences in a concurrent manner.

Parameters

documents (List[str]) – list of sentences which subjectivity scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [0.0, 1.0] range representing subjectivity scores for the sentences where 0.0 means very objective and 1.0 very subjective

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_subjectivity_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_subjectivity_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.6, 0.9, 1.0]

contextpro.statistics.batch_get_ngram_counts(tokens: List[List[str]], ngram_size: int = 1) → Dict[str, int]¶

Calculate ngram counts across the corpus of tokenized documents.

Parameters

tokens (List[List[str]]) – list of nested token lists
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a corpus of tokenized documents

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of nested token lists

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> corpus = [
    ["my", "name", "is", "dr", "jekyll"],
    ["his", "name", "is", "mr", "hyde"],
    ["this", "guy", "name", "is", "edward", "scissorhands"],
    ["and", "this", "is", "tom", "parker"],
]
>>> batch_get_ngram_counts(corpus, ngram_size=2)
{
    "my name": 1, "name is": 3, "is dr": 1, "dr jekyll": 1,
    "his name": 1, "is mr": 1, "mr hyde": 1, "this guy": 1,
    "guy name": 1, "is edward": 1, "edward scissorhands": 1,
    "and this": 1, "this is": 1, "is tom": 1, "tom parker": 1
}

contextpro.statistics.calculate_sentiment_score(document: str) → float¶

Calculate sentiment score for the sentence using TextBlob object.

Parameters: document (str) – sentence which sentiment score has to be calculated
Returns: float within [-1.0, 1.0] range representing sentiment score for the sentence, where -1.0 means negative and 1.0 positive
Return type: float

Examples

>>> from contextpro.statistics import calculate_sentiment_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_sentiment_score(sentence)
0.5

contextpro.statistics.calculate_subjectivity_score(document: str) → float¶

Calculate subjectivity score for the sentence using TextBlob object.

Parameters: document (str) – sentence which subjectivity score has to be calculated
Returns: float within [0.0, 1.0] range representing subjectivity score for the sentence, where 0.0 means very objective and 1.0 very subjective
Return type: float

Examples

>>> from contextpro.statistics import calculate_subjectivity_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_subjectivity_score(sentence)
0.6

contextpro.statistics.get_ngram_counts(tokens: List[str], ngram_size: int = 1) → Dict[str, int]¶

Calculate ngram counts in a tokenized document.

Parameters

tokens (List[str]) – list of tokens
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a document

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of strings

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> tokens = ["my", "name", "is", "dr", "jekyll"]
>>> get_ngram_counts(tokens, ngram_size=2)
{'my name': 1, 'name is': 1, 'is dr': 1, 'dr jekyll': 1}