contextpro package¶
-
contextpro.batch_calculate_corpus_statistics(documents: List[str], lowercase: bool = False, remove_stopwords: bool = False, tokenizer_pattern: str = '\\b[^\\d\\W]+\\b', custom_stopwords: List[str] = [], num_workers: Optional[int] = None) → pandas.core.frame.DataFrame¶ Calculates the below statistics for each document in the corpus in a concurrent manner:
Number of characters
Number of tokens
Number of punctuation characters
Number of digits
Number of whitespace characters
Number of non-ascii characters
Sentiment score
Subjectivity score
- Parameters
documents (List[str]) – list of strings
lowercase (bool, optional) – convert all characters to lowercase before calculating statistics, by default False
remove_stopwords (bool, optional) – remove stopwords before calculating statistics. Uses english stopwords from the NLTK library if ‘custom_stopwords’ list is not provided, by default False
tokenizer_pattern (str, optional) – regex pattern used by the underlying NLTK Regexp Tokenizer to tokenize the documents, by default r”b[^dW]+b”
custom_stopwords (List[str], optional) – custom stopwords to use for token filtering, by default []
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Returns
with statistics for each document in the provided corpus
- Return type
pd.DataFrame
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_corpus_statistics >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde", ... "This guy's name is Edward Scissorhands", ... "And this is Tom Parker" ... ] >>> batch_calculate_corpus_statistics( ... corpus, ... lowercase=False, ... remove_stopwords=False, ... num_workers=2, ... ) characters tokens punctuation_characters digits whitespace_characters \ 0 22 5 2 0 4 1 20 5 1 0 4 2 38 7 1 0 5 3 22 5 0 0 4
ascii_characters sentiment_score subjectivity_score 0 22 0.0 0.0 1 20 0.0 0.0 2 38 0.0 0.0 3 22 0.0 0.0
-
contextpro.batch_calculate_sentiment_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶ Calculate sentiment scores for sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which sentiment scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of floats within [-1.0, 1.0] range representing sentiment scores for the sentences where -1.0 means negative and 1.0 positive
- Return type
List[float]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_sentiment_scores >>> corpus = [ ... "I don't like you.", ... "I love the Spiderman movie", ... "In my opinion this movie was rather boring than exciting", ... "This is the worst movie I've ever seen" ... ] >>> batch_calculate_sentiment_scores( ... corpus, ... num_workers=2 ... ) [0.0, 0.5, -0.35, -1.0]
-
contextpro.batch_calculate_subjectivity_scores(documents: List[str], num_workers: Optional[int] = None) → List[float]¶ Calculate subjectivity scores for sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which subjectivity scores have to be calculated
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of floats within [0.0, 1.0] range representing subjectivity scores for the sentences where 0.0 means very objective and 1.0 very subjective
- Return type
List[float]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.statistics import batch_calculate_subjectivity_scores >>> corpus = [ ... "I don't like you.", ... "I love the Spiderman movie", ... "In my opinion this movie was rather boring than exciting", ... "This is the worst movie I've ever seen" ... ] >>> batch_calculate_subjectivity_scores( ... corpus, ... num_workers=2 ... ) [0.0, 0.6, 0.9, 1.0]
-
contextpro.batch_convert_numerals_to_numbers(documents: List[str], num_workers: Optional[int] = None) → List[str]¶ Replaces numerals with numbers in all sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which contain numerals
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of sentences with numerals converted to numbers
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_convert_numerals_to_numbers >>> corpus = [ ... "A bunch of five", ... "A picture is worth a thousand words", ... "A stitch in time saves nine", ... "Back to square one", ... "Behind the eight ball", ... "Between two stools", ... ] >>> batch_convert_numerals_to_numbers(corpus, num_workers=2) [ 'A bunch of 5', 'A picture is worth a 1000 words', 'A stitch in time saves 9', 'Back to square 1', 'Behind the 8 ball', 'Between 2 stools', ]
-
contextpro.batch_get_ngram_counts(tokens: List[List[str]], ngram_size: int = 1) → Dict[str, int]¶ Calculate ngram counts across the corpus of tokenized documents.
- Parameters
tokens (List[List[str]]) – list of nested token lists
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams
- Returns
mapping from ngram to the number of occurrences in a corpus of tokenized documents
- Return type
Dict[str, int]
- Raises
ValueError – if ‘tokens’ provided is not a list of nested token lists
Examples
>>> from contextpro.statistics import get_ngram_counts >>> corpus = [ ["my", "name", "is", "dr", "jekyll"], ["his", "name", "is", "mr", "hyde"], ["this", "guy", "name", "is", "edward", "scissorhands"], ["and", "this", "is", "tom", "parker"], ] >>> batch_get_ngram_counts(corpus, ngram_size=2) { "my name": 1, "name is": 3, "is dr": 1, "dr jekyll": 1, "his name": 1, "is mr": 1, "mr hyde": 1, "this guy": 1, "guy name": 1, "is edward": 1, "edward scissorhands": 1, "and this": 1, "this is": 1, "is tom": 1, "tom parker": 1 }
-
contextpro.batch_get_ngrams(tokens: List[List[str]], ngram_size: int = 1) → List[List[str]]¶ Prepare n-grams from the provided list of token lists.
- Parameters
tokens (List[List[str]]) – list of token lists, each representing single document
ngram_size (int) – size of ngrams to return, by default 1 (unigrams)
- Returns
list of nested ngram lists
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.feature_extraction import batch_get_ngrams >>> tokens = [ ... ["my", "name", "is", "spiderman"], ... ["she", "lives", "in", "australia"], ... ] >>> batch_get_ngrams(tokens, ngram_size=2) [ ["my name", "name is", "is spiderman"], ["she lives", "lives in", "in australia"], ]
-
contextpro.batch_lemmatize(tokens: List[List[str]], num_workers: Optional[int] = None, **kwargs: Any) → List[List[str]]¶ Lemmatizes tokens in lists of tokens in a concurrent manner using NLTK WordNetLemmatizer.
- Parameters
tokens (List[List[str]]) – nested token lists containing tokens with various inflectional forms
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Other Parameters
**kwargs (Any) –
- additional properties of the below lemmatizer:
nltk.WordNetLemmatizer
- Returns
nested token lists with lemmatized tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_lemmatize >>> corpus = [ ... ["I", "like", "driving", "a", "car"], ... ["I", "am", "going", "for", "a", "walk"], ... ["What", "are", "you", "doing"], ... ["Where", "are", "you", "coming", "from"] ... ] >>> batch_lemmatize(corpus, num_workers=2, pos="v") [ ['I', 'like', 'drive', 'a', 'car'], ['I', 'be', 'go', 'for', 'a', 'walk'], ['What', 'be', 'you', 'do'], ['Where', 'be', 'you', 'come', 'from'] ]
-
contextpro.batch_remove_non_ascii_characters(documents: List[str]) → List[str]¶ Removes non-ascii characters from sentences.
- Parameters
documents (List[str]) – list of sentences with non-ascii characters
- Returns
list of sentences with removed non-ascii characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_non_ascii_characters >>> corpus = [ ... "https://sitebulb.com/Folder/øê.html?大学", ... "Jöreskog bißchen Zürcher" ... "This is a © but not a ®" ... "fractions ¼, ½, ¾" ... ] >>> batch_remove_non_ascii_characters(corpus) [ 'https://sitebulb.com/Folder/.html?', 'Jreskog bichen Zrcher', 'This is a but not a ', 'fractions , , ' ]
-
contextpro.batch_remove_numbers(documents: List[str]) → List[str]¶ Removes numbers from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain numbers
- Returns
list of sentences without numbers
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_numbers >>> corpus = [ ... "He is 12 years old.", ... "His father has 3 cars", ... "I have 3 computers", ... "He earns 1000$ daily" ... ] >>> batch_remove_numbers(corpus) [ 'He is years old.', 'His father has cars', 'I have computers', 'He earns $ daily' ]
-
contextpro.batch_remove_punctuation(documents: List[str]) → List[str]¶ Removes punctuation characters from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain punctuation characters
- Returns
list of sentences without punctuation characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_punctuation >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde!", ... "Is his name Edward Scissorhands?", ... "This is Tom-Parker!" ... ] >>> batch_remove_punctuation(corpus) [ 'My name is Dr Jekyll', 'His name is Mr Hyde', 'Is his name Edward Scissorhands', 'This is TomParker' ]
-
contextpro.batch_remove_stopwords(tokens: List[List[str]], custom_stopwords: List[str] = []) → List[List[str]]¶ Removes stopwords from nested token lists.
- Parameters
tokens (List[List[str]]) – nested token lists with stopwords included
custom_stopwords (List[str], optional) – list of stopwords to remove from sentences, by default []
- Returns
nested token lists without stopwords
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_remove_stopwords >>> corpus = [ ... ['My', 'name', 'is', 'Dr', 'Jekyll'], ... ['His', 'name', 'is', 'Mr', 'Hyde'], ... ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'], ... ['And', 'this', 'is', 'Tom', 'Parker'] ... ] >>> batch_remove_stopwords(corpus) [ ['My', 'name', 'Dr', 'Jekyll'], ['His', 'name', 'Mr', 'Hyde'], ['This', 'guy', 'name', 'Edward', 'Scissorhands'], ['And', 'Tom', 'Parker'] ]
-
contextpro.batch_remove_whitespace(documents: List[str]) → List[str]¶ Removes whitespace characters from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain whitespace characters
- Returns
list of sentences without whitespace characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_whitespace >>> corpus = [ ... "He Has Not Been in Touch for over a Month.", ... "I Will See \r\nYou next Week",, ... "I Am Hungry - Can We\t Eat Now, Please?",, ... "It Is \r\nFreezing Outside!" ... ] >>> batch_remove_whitespace(corpus) [ 'He Has Not Been in Touch for over a Month.', 'I Will See You next Week', 'I Am Hungry - Can We Eat Now, Please?', 'It Is Freezing Outside!', ]
-
contextpro.batch_replace_contractions(documents: List[str], **kwargs: bool) → List[str]¶ Expands contractions in sentences.
- Parameters
documents (List[str]) – list of sentences with contracted words
- Other Parameters
**kwargs (bool) –
- additional properties of the below method:
contractions.fix()
- Returns
list of sentences without contractions
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_replace_contractions >>> corpus = [ ... "I don't want to be rude, but you shouldn't do this", ... "Do you think he'll pass his driving test?", ... "I'll see you next week", ... "I'm going for a walk" ... ] >>> batch_replace_contractions(corpus) [ 'I do not want to be rude, but you should not do this', 'Do you think he will pass his driving test?', 'I will see you next week', 'I am going for a walk', ]
-
contextpro.batch_stem(tokens: List[List[str]], stemmer_type: Optional[str] = 'nltk_porter_stemmer', num_workers: Optional[int] = None, **kwargs: Any) → List[List[str]]¶ Stems tokens in lists of tokens to their root (base) form in a concurrent manner.
- Parameters
tokens (List[List[str]]) – list of lists of tokens containing words with various inflectional forms
stemmer_type (Optional[str], optional) –
stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”
- Allowed values:
nltk_porter_stemmer
nltk_lancaster_stemmer
nltk_regexp_stemmer
nltk_snowball_stemmer
num_workers (Optional[int], optional) – number of processors to use, by default None (all processors)
- Other Parameters
**kwargs (Any) –
- additional properties of the below stemmers:
nltk.PorterStemmer
nltk.LancasterStemmer
nltk.RegexpStemmer
nltk.SnowballStemmer
- Returns
list of lists of stemmed tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_stem >>> corpus = [ ... ["I", "like", "driving", "a", "car"], ... ["I", "am", "going", "for", "a", "walk"], ... ["Do", "you", "think", "this", "is", "doable"], ... ["I", "have", "three", "bikes", "in", "two", "garages"] ... ] >>> batch_stem( ... corpus, ... stemmer_type="nltk_porter_stemmer", ... num_workers=2 ... ) [ ['I', 'like', 'drive', 'a', 'car'], ['I', 'am', 'go', 'for', 'a', 'walk'], ['Do', 'you', 'think', 'thi', 'is', 'doabl'], ['I', 'have', 'three', 'bike', 'in', 'two', 'garag'] ]
-
contextpro.batch_tokenize_text(documents: List[str], tokenizer_method: Optional[str] = 'nltk_word_tokenizer', num_workers: Optional[int] = None, **kwargs) → List[List[str]]¶ Tokenizes sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences to tokenize
tokenizer_method (Optional[str]) –
tokenization method which will be used to tokenize the sentences by default “nltk_word_tokenizer”.
- Allowed values:
nltk_word_tokenizer
nltk_regexp_tokenizer
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Other Parameters
**kwargs (additional properties of the below methods:) –
nltk.word_tokenize()
nltk.regexp_tokenize()
- Returns
nested lists containing tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.tokenization import batch_tokenize_text >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde", ... "This guy's name is Edward Scissorhands", ... "And this is Tom Parker" ... ] >>> batch_tokenize_text( ... corpus, ... tokenizer_method="nltk_regexp_tokenizer", ... pattern=r"\b[^\d\W]+\b", ... gaps=False, ... num_workers=2 ... ) [['My', 'name', 'is', 'Dr', 'Jekyll'], ['His', 'name', 'is', 'Mr', 'Hyde'], ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'], ['And', 'this', 'is', 'Tom', 'Parker']]
-
contextpro.calculate_sentiment_score(document: str) → float¶ Calculate sentiment score for the sentence using TextBlob object.
- Parameters
document (str) – sentence which sentiment score has to be calculated
- Returns
float within [-1.0, 1.0] range representing sentiment score for the sentence, where -1.0 means negative and 1.0 positive
- Return type
float
Examples
>>> from contextpro.statistics import calculate_sentiment_score >>> corpus = "I love the Spiderman movie" >>> calculate_sentiment_score(sentence) 0.5
-
contextpro.calculate_subjectivity_score(document: str) → float¶ Calculate subjectivity score for the sentence using TextBlob object.
- Parameters
document (str) – sentence which subjectivity score has to be calculated
- Returns
float within [0.0, 1.0] range representing subjectivity score for the sentence, where 0.0 means very objective and 1.0 very subjective
- Return type
float
Examples
>>> from contextpro.statistics import calculate_subjectivity_score >>> corpus = "I love the Spiderman movie" >>> calculate_subjectivity_score(sentence) 0.6
-
contextpro.convert_numerals_to_numbers(sentence: str) → str¶ Replaces numerals with numbers in a sentence.
- Parameters
sentence (str) – with numerals
- Returns
with numerals replaced with numbers
- Return type
str
Examples
>>> from contextpro.normalization import convert_numerals_to_numbers >>> sentence = "A bunch of five" >>> convert_numerals_to_numbers(sentence) 'A bunch of 5'
-
contextpro.get_ngram_counts(tokens: List[str], ngram_size: int = 1) → Dict[str, int]¶ Calculate ngram counts in a tokenized document.
- Parameters
tokens (List[str]) – list of tokens
ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams
- Returns
mapping from ngram to the number of occurrences in a document
- Return type
Dict[str, int]
- Raises
ValueError – if ‘tokens’ provided is not a list of strings
Examples
>>> from contextpro.statistics import get_ngram_counts >>> tokens = ["my", "name", "is", "dr", "jekyll"] >>> get_ngram_counts(tokens, ngram_size=2) {'my name': 1, 'name is': 1, 'is dr': 1, 'dr jekyll': 1}
-
contextpro.get_ngrams(tokens: List[str], ngram_size: int = 1) → List[str]¶ Prepare n-grams from the provided list of tokens.
- Parameters
tokens (List[str]) – list of tokens
ngram_size (int) – size of ngrams to return, by default 1 (unigrams)
- Returns
list of ngrams
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided is not a list of strings
Examples
>>> from contextpro.feature_extraction import get_ngrams >>> tokens = ["my", "name", "is", "dr", "jekyll"] >>> get_ngrams(tokens, ngram_size=2) ["my name", "name is", "is spiderman"]
-
contextpro.lemmatize(tokens: List[str], **kwargs: Any) → List[str]¶ Lemmatizes tokens using NLTK’s WordNetLemmatizer
- Parameters
tokens (List[str]) – list of tokens containing words with various inflectional forms
- Other Parameters
**kwargs (Any) –
- additional properties of the below lemmatizer:
nltk.WordNetLemmatizer
- Returns
list of lemmatized tokens
- Return type
List[str]
Examples
>>> from contextpro.normalization import lemmatize >>> tokens = ["I", "like", "driving", "a", "car"] >>> lemmatize(tokens, pos="v") ['I', 'like', 'drive', 'a', 'car']
-
contextpro.remove_non_ascii_characters(sentence: str) → str¶ Removes non-ascii characters from the provided sentence.
- Parameters
sentence (str) – with non-ascii characters
- Returns
without non-ascii characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_non_ascii_characters >>> sentence = "https://sitebulb.com/Folder/øê.html?大学" >>> remove_non_ascii_characters(sentence) 'https://sitebulb.com/Folder/.html?'
-
contextpro.remove_numbers(sentence: str) → str¶ Removes numbers from the sentence.
- Parameters
sentence (str) – with number characters
- Returns
without number characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_numbers >>> sentence = "He is 12 years old." >>> remove_numbers(sentence) 'He is years old.'
-
contextpro.remove_punctuation(sentence: str) → str¶ Removes punctuation characters from the sentence.
- Parameters
sentence (str) – with punctuation characters
- Returns
without punctuation characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_punctuation >>> corpus = "My name is Dr. Jekyll." >>> remove_punctuation(sentence) 'My name is Dr Jekyll'
-
contextpro.remove_stopwords(tokens: List[str], custom_stopwords: List[str] = []) → List[str]¶ Remove stopwords from the provided list of tokens.
- Parameters
tokens (List[str]) – list of tokens, including stopwords
custom_stopwords (List[str], optional) – which should be removed from the list of tokens, by default []
- Returns
list of tokens without stopwords
- Return type
List[str]
Examples
>>> from contextpro.normalization import remove_stopwords >>> tokens = ['My', 'name', 'is', 'Dr', 'Jekyll'] >>> remove_stopwords(tokens) ['My', 'name', 'Dr', 'Jekyll']
-
contextpro.remove_whitespace(sentence: str) → str¶ Removes whitespace characters from the sentence.
- Parameters
sentence (str) – with whitespace characters
- Returns
without whitespace characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_whitespace >>> sentence = "He Has Not Been in Touch for over a Month." >>> remove_whitespace(sentence) 'He Has Not Been in Touch for over a Month.'
-
contextpro.replace_contractions(sentence: str, **kwargs: bool) → str¶ Expands contractions in a sentence.
- Parameters
sentence (str) – with contracted words
- Other Parameters
**kwargs (bool) –
- additional properties of the below method:
contractions.fix()
- Returns
without contracted words
- Return type
str
Examples
>>> from contextpro.normalization import replace_contractions >>> sentence = "I don't want to be rude, but you shouldn't do this" >>> replace_contractions(sentence) 'I do not want to be rude, but you should not do this'
-
contextpro.stem(tokens: List[str], stemmer_type: Optional[str] = 'nltk_porter_stemmer', **kwargs: Any) → List[str]¶ Reduces tokens to their root (base) form.
- Parameters
tokens (List[str]) – list of tokens containing words with various inflectional forms
stemmer_type (Optional[str], optional) –
stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”
- Allowed values:
nltk_porter_stemmer
nltk_lancaster_stemmer
nltk_regexp_stemmer
nltk_snowball_stemmer
- Other Parameters
**kwargs (Any) –
- additional properties of the below stemmers:
nltk.PorterStemmer
nltk.LancasterStemmer
nltk.RegexpStemmer
nltk.SnowballStemmer
- Returns
list of stemmed tokens
- Return type
List[str]
Examples
>>> from contextpro.normalization import stem >>> tokens = ["I", "like", "driving", "a", "car"] >>> stem(tokens, stemmer_type="nltk_porter_stemmer") ['I', 'like', 'drive', 'a', 'car']
-
contextpro.tokenize_text(document: str, tokenizer_method: Optional[str] = 'nltk_word_tokenizer', **kwargs) → List[str]¶ Convert sentence into a list of tokens.
- Parameters
documents (str) – sentence to tokenize
tokenizer_method (Optional[str]) –
tokenization method which will be used to tokenize the sentence by default “nltk_word_tokenizer”.
- Allowed values:
nltk_word_tokenizer
nltk_regexp_tokenizer
- Other Parameters
**kwargs (additional properties of the below methods:) –
nltk.word_tokenize()
nltk.regexp_tokenize()
- Returns
list of tokens
- Return type
List[str]
Examples
>>> from contextpro.tokenization import tokenize_text >>> sentence = "My name is Dr. Jekyll." >>> tokenize_text( ... corpus, ... tokenizer_method="nltk_regexp_tokenizer", ... pattern=r"\b[^\d\W]+\b", ... gaps=False, ... ) ['My', 'name', 'is', 'Dr', 'Jekyll']