contextpro

https://gitlab.com/elzawie/contextpro/badges/master/pipeline.svgpipeline status https://gitlab.com/elzawie/contextpro/badges/master/coverage.svgcoverage report https://img.shields.io/badge/license-MIT-blueLicense

contextpro is a Python library for concurrent text preprocessing using functions from some well-known NLP packages including NLTK, spaCy and TextBlob.

  • Documentation: https://contextpro.readthedocs.io/en/latest/

  • Source code: https://gitlab.com/elzawie/contextpro

Installation

Windows / OS X / Linux:

  • Installation with pip

    pip install contextpro
    python -m spacy download en_core_web_sm
    
  • Installation with poetry

    poetry add contextpro
    python -m spacy download en_core_web_sm
    

Configuration

  • Before using the package, execute the below commands in your virtual environment:

    import nltk
    
    nltk.download("punkt")
    nltk.download("stopwords")
    nltk.download("wordnet")
    

Usage examples

from contextpro.normalization import batch_replace_contractions

corpus = [
    "I don't want to be rude, but you shouldn't do this",
    "Do you think he'll pass his driving test?",
    "I'll see you next week",
    "I'm going for a walk"
]

batch_replace_contractions(corpus)

[
    "I do not want to be rude, but you should not do this",
    "Do you think he will pass his driving test?",
    "I will see you next week",
    "I am going for a walk",
]
from contextpro.normalization import batch_remove_stopwords

corpus = [
    ['My', 'name', 'is', 'Dr', 'Jekyll'],
    ['His', 'name', 'is', 'Mr', 'Hyde'],
    ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
    ['And', 'this', 'is', 'Tom', 'Parker']
]

batch_remove_stopwords(corpus)

[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
from contextpro.normalization import batch_lemmatize

corpus =  [
    ["I", "like", "driving", "a", "car"],
    ["I", "am", "going", "for", "a", "walk"],
    ["What", "are", "you", "doing"],
    ["Where", "are", "you", "coming", "from"]
]

batch_lemmatize(corpus, num_workers=2, pos="v")

[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
from contextpro.normalization import batch_convert_numerals_to_numbers

corpus = [
    "A bunch of five",
    "A picture is worth a thousand words",
    "A stitch in time saves nine",
    "Back to square one",
    "Behind the eight ball",
    "Between two stools",
]

batch_convert_numerals_to_numbers(corpus, num_workers=2)

[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
from contextpro.statistics import batch_calculate_corpus_statistics

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

batch_calculate_corpus_statistics(
    corpus,
    lowercase=False,
    remove_stopwords=False,
    num_workers=2,
)

    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

        ascii_characters  sentiment_score  subjectivity_score
0                22              0.0                 0.0
1                20              0.0                 0.0
2                38              0.0                 0.0
3                22              0.0                 0.0

Release History

  • https://gitlab.com/elzawie/contextpro/-/releases

Meta

Łukasz Zawieska – zawieskal@yahoo.com

Gitlab account

Github account

Distributed under the MIT license. See LICENSE for more information.