contextpro.statistics module¶
This module contains functions for calculating some text data related statistics.
-
contextpro.statistics.batch_calculate_corpus_statistics(documents: List[str], lowercase: bool = False, remove_stopwords: bool = False, tokenizer_pattern: str = '\\b[^\\d\\W]+\\b', custom_stopwords: List[str] = [], num_workers: Optional[int] = None) → pandas.core.frame.DataFrame¶ Calculates the below statistics for each document in the corpus in a concurrent manner:
Number of characters
Number of tokens
Number of punctuation characters
Number of digits
Number of whitespace characters
Number of non-ascii characters
Sentiment score
Subjectivity score
- Parameters
documents (List[str]) – list of strings
lowercase (bool, optional) – convert all characters to lowercase before calculating statistics, by default False
remove_stopwords (bool, optional) – remove stopwords before calculating statistics. Uses english stopwords from the NLTK library if ‘custom_stopwords’ list is not provided, by default False
tokenizer_pattern (str, optional) – regex pattern used by the underlying NLTK Regexp Tokenizer to tokenize the documents, by default r”b[^dW]+b”
custom_stopwords (List[str], optional) – custom stopwords to use for token filtering, by default []
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Returns
with statistics for each document in the provided corpus
- Return type
pd.DataFrame
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_corpus_statistics >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde", ... "This guy's name is Edward Scissorhands", ... "And this is Tom Parker" ... ] >>> batch_calculate_corpus_statistics( ... corpus, ... lowercase=False, ... remove_stopwords=False, ... num_workers=2, ... ) characters tokens punctuation_characters digits whitespace_characters \ 0 22 5 2 0 4 1 20 5 1 0 4 2 38 7 1 0 5 3 22 5 0 0 4
ascii_characters sentiment_score subjectivity_score 0 22 0.0 0.0 1 20 0.0 0.0 2 38 0.0 0.0 3 22 0.0 0.0
-
contextpro.statistics.batch_calculate_sentiment_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶ Calculate sentiment scores for sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which sentiment scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of floats within [-1.0, 1.0] range representing sentiment scores for the sentences where -1.0 means negative and 1.0 positive
- Return type
List[float]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_sentiment_scores >>> corpus = [ ... "I don't like you.", ... "I love the Spiderman movie", ... "In my opinion this movie was rather boring than exciting", ... "This is the worst movie I've ever seen" ... ] >>> batch_calculate_sentiment_scores( ... corpus, ... num_workers=2 ... ) [0.0, 0.5, -0.35, -1.0]
-
contextpro.statistics.batch_calculate_subjectivity_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶ Calculate subjectivity scores for sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which subjectivity scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of floats within [0.0, 1.0] range representing subjectivity scores for the sentences where 0.0 means very objective and 1.0 very subjective
- Return type
List[float]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_subjectivity_scores >>> corpus = [ ... "I don't like you.", ... "I love the Spiderman movie", ... "In my opinion this movie was rather boring than exciting", ... "This is the worst movie I've ever seen" ... ] >>> batch_calculate_subjectivity_scores( ... corpus, ... num_workers=2 ... ) [0.0, 0.6, 0.9, 1.0]
-
contextpro.statistics.batch_get_ngram_counts(tokens: List[List[str]], ngram_size: int = 1) → Dict[str, int]¶ Calculate ngram counts across the corpus of tokenized documents.
- Parameters
tokens (List[List[str]]) – list of nested token lists
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams
- Returns
mapping from ngram to the number of occurrences in a corpus of tokenized documents
- Return type
Dict[str, int]
- Raises
ValueError – if ‘tokens’ provided is not a list of nested token lists
Examples
>>> from contextpro.statistics import get_ngram_counts >>> corpus = [ ["my", "name", "is", "dr", "jekyll"], ["his", "name", "is", "mr", "hyde"], ["this", "guy", "name", "is", "edward", "scissorhands"], ["and", "this", "is", "tom", "parker"], ] >>> batch_get_ngram_counts(corpus, ngram_size=2) { "my name": 1, "name is": 3, "is dr": 1, "dr jekyll": 1, "his name": 1, "is mr": 1, "mr hyde": 1, "this guy": 1, "guy name": 1, "is edward": 1, "edward scissorhands": 1, "and this": 1, "this is": 1, "is tom": 1, "tom parker": 1 }
-
contextpro.statistics.calculate_sentiment_score(document: str) → float¶ Calculate sentiment score for the sentence using TextBlob object.
- Parameters
document (str) – sentence which sentiment score has to be calculated
- Returns
float within [-1.0, 1.0] range representing sentiment score for the sentence, where -1.0 means negative and 1.0 positive
- Return type
float
Examples
>>> from contextpro.statistics import calculate_sentiment_score >>> corpus = "I love the Spiderman movie" >>> calculate_sentiment_score(sentence) 0.5
-
contextpro.statistics.calculate_subjectivity_score(document: str) → float¶ Calculate subjectivity score for the sentence using TextBlob object.
- Parameters
document (str) – sentence which subjectivity score has to be calculated
- Returns
float within [0.0, 1.0] range representing subjectivity score for the sentence, where 0.0 means very objective and 1.0 very subjective
- Return type
float
Examples
>>> from contextpro.statistics import calculate_subjectivity_score >>> corpus = "I love the Spiderman movie" >>> calculate_subjectivity_score(sentence) 0.6
-
contextpro.statistics.get_ngram_counts(tokens: List[str], ngram_size: int = 1) → Dict[str, int]¶ Calculate ngram counts in a tokenized document.
- Parameters
tokens (List[str]) – list of tokens
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams
- Returns
mapping from ngram to the number of occurrences in a document
- Return type
Dict[str, int]
- Raises
ValueError – if ‘tokens’ provided is not a list of strings
Examples
>>> from contextpro.statistics import get_ngram_counts >>> tokens = ["my", "name", "is", "dr", "jekyll"] >>> get_ngram_counts(tokens, ngram_size=2) {'my name': 1, 'name is': 1, 'is dr': 1, 'dr jekyll': 1}