Natural Language Processing (NLP) tools

Slug: nlp

137080 characters 7018 words

#Digital Methods and Tools

#Text Analysis

  • Description:

  • Example Projects:
  • Methods:
    • Topic Modeling
    • Information Retrieval
    • Text Classification
    • Sentiment Analysis
    • Word Frequency Analysis
    • Concordancing
    • Named Entity Recognition
    • Collocation
    • Word Embeddings
    • Transformer Models
  • Popular Tools:
    • Voyant Tools: web based reading and analysis environment for digital texts
    • Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
    • WordSeer 4: a text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful
    • Antconc: a freeware corpus analysis toolkit for concordancing and text analysis
  • Popular Programming Languages and Packages:
    • Python:
      • NLTK (Natural Language Toolkit): a leading platform for building Python programs to work with human language data
      • spaCy: industrial-strength NLP written in Cython for speed
      • Gensim: free open-source Python library for representing documents as semantic vectors
    • R:
      • Quanteda: a package for the Quantitative Analysis of Textual Data
      • Tidytext: using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
      • spaCyR: an R wrapper for the Python spaCy library
  • Resources:

#Visual Presentation and Analysis

  • Description

  • Example Projects:
  • Popular Tools, Platforms, and Standards:
    • IIIF, the International Image Interoperability Framework, provides a series of API specifications which various image servers and viewers implement
      • IIIF Awesome Resources
      • Harvard IIIF Website
      • Image servers:
        • Loris
        • Cantaloupe
      • Image viewers:
        • Mirador 3
        • Universal Viewer
      • Annotation servers:
        • CatchPy
  • Popular Software Packages

#Spatial Analysis and Web Mapping

  • Description

  • Example Projects:
  • Popular Tools and Platforms:
    • Esri ArcGIS: powerful desktop GIS software suite for mapping, geoprocessing, cartography, and spatial analysis
    • QGIS: FOSS alternative to ArcGIS
    • Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
    • CARTO (previously CartoDB): paid Software as a Service platform for spatial analysis and GIS
    • Esri ArcGIS Online: cloud-based GIS platform for creating and sharing interactive maps and analyzing spatial data
    • Neatline: a suite of add-on tools for Omeka designed to help tell stories with maps, images, and timelines
  • Popular Software Packages:
    • Leaflet: open-source Javascript library for interactive web maps
    • D3.JS: open-source Javascript library for general data visualization, including web mapping
    • Google Maps API: one of the most popular general purpose mapping libraries
    • OpenLayers: Javascript library for displaying maps and analyzing geographical data
    • Mapbox GL: SDK for web maps powered by Mapbox, which allows users to design and publish beautiful maps

#Network Analysis

  • Description

  • Example Projects:
  • Popular Tools:
    • Gephi: a FOSS interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs
    • Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
    • Cytoscape: originally designed for biological research, now a general platform for network analysis and visualization
    • NodeXL: a network analysis and visualization plugin for Excel
  • Popular Software Packages:
    • NetworkX (Python): a package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
    • igraph (R and Python): a collection of network analysis tools with an emphasis on efficiency, portability, and ease of use
    • visNetwork (R): an R package for network visualization using vis.js
    • D3.js (Javascript): more for network visualization and presentation than network analysis

#Timelines and Temporal Analysis

  • Popular Timeline Creation Tools:
    • TimelineJS: an open-source tool that enables anyone to build visually rich, interactive timelines using nothing more than a Google Sheet
    • Chronos Timeline: designed for needs in humanities and social sciences to represent time-based data
    • Neatline: a suite of add-on tools for Omeka designed to tell stories with maps, images, and timelines

#Machine Learning

  • Example Projects

  • Popular Platforms:
    • AWS Sagemaker
    • IBM Watson Studio
    • Google Cloud AI
    • H20.ai
    • KNIME
  • Popular Software Packages:
    • Python:
      • Keras: deep learning and neural networks
      • PyTorch: deep learning, computer vision
      • Scikit-Learn: data preprocessing, text vectorization, classification, clustering
      • TensorFlow: deep learning
    • R:
      • caret: functions that streamline the process for creating predictive models
  • Resources:

#Database Development

  • Popular Databases:
    • Relational:
      • PostgreSQL
      • MySQL
    • Document / NoSQL:
      • MongoDB
      • Elasticsearch
      • Solr
    • Key-value Store:
      • Redis
      • AWS DynamoDB
    • Graph:
      • Neo4J
  • Popular Database Tools and Database Management Systems (DBMS):
    • DBVisualizer
    • DataGrip
    • Postico
    • SQL Server Management Studio

#Data Cleaning

  • Popular Software:
    • Python:
    • R:
      • Tidyverse: an opinionated collection of R packages designed for data science
    • Language-agnostic (mostly): Regular Expressions
  • Popular Tools:

#Research Data Management

  • Popular Software Packages:
    • Git: distributed version control system
  • Popular Tools and Platforms:
    • DataVerse: an open-source research data repository
    • Github desktop clients, such as Github Desktop or GitKraken
    • Tropy: tool to organize and describe photographs of research material

#Project Management

  • Popular Tools and Platforms:
    • Trello
    • Jira
    • Github Projects
    • Asana

#Citation Management

  • Popular Tools and Platforms:
    • Zotero
    • EndNote
    • Mendeley

#Digital Collections

#Phase: Data

#Data Annotation

Category Tool Remarks
Audio audio-annotator, audiono  
General superintendent, pigeon Annotate in notebooks
  labelstudio Open Source Data Labeling Tool
  awesome-data-labeling  
Image makesense.ai, labelimg, via, cvat  
Text doccano, brat  
  chatito Generate text datasets using DSL
  prodigy Paid
Inter-rater agreement disagree  
  simpledorff Krippendorff’s Alpha

#Data Collection

Category Tool Remarks
Curations datasetlist, UCI, Google Dataset Search, fastai-datasets, public-apis, awesome-public-datasets, aws opendata, penn-ml-benchmark  
  huggingface-datasets, The Big Bad NLP Database, nlp-datasets, nlp corpora NLP Datasets
  bifrost, VisualData, roboflow Computer Vision Datasets
Words curse-words, badwords, LDNOOBW, 10K most common words, common-misspellings  
  profanity Profane words
  wordlists Words organized by topic
  english-words A text file containing over 466k English words
  tf-idf-iif-top-100-wordlists Top 100 distinctive words for each language
  freeling Dictionary of words grouped by POS
  catvar, lemmatization-lists, unimorph-eng Word variants
Text Corpus project gutenberg, nlp-datasets, 1 trillion n-grams, litbank, BookCorpus, south-asian text corpus  
  opus, oscar (big multilingual corpus) Translation Parallel Text
  pile 825GB text corpus
  freebase Relation triples
  opensubtitles Movie subtitles parallel corpus
  lti-langid Language Identification Corpus for 1152 languages
  fandom-transcripts Movie and Series Transcripts
  cognet Cognates for 338 languages
  wold Loan words
Sentiment SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment, socialsent  
Emotion NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB, emotion-to-emoji-mapping  
  EmoTag1200 Emoji-Emotion scores
NLU Intents rasa-nlu-training-data  
N-grams google-book-ngrams, norvig-processed-ngrams  
Word Frequency wordfreq,key-nbc  
Summarization curation-corpus  
Conversations conversational-datasets, cornell-movie-dialog-corpus, persona-chat, DialogDatasets  
Semantic Parsing wikisql, spider Text to SQL
  WebQuestions, ComplexWebQuestions Text to Knowledge Graph
  CoNaLa, CONCODE Text to program
  amrlib Parse AMR data
Image 1 million fake faces, flickr-faces, objectnet, YFCC100m, USPS, Animal Faces-HQ dataset (AFHQ)  
  tiny-images,SVHN, STL-10, imagenette, CIFAR-10 Small image datasets for quick experimentation
  omniglot, mini-imagenet One Shot Learning
Paraphrasing PPDB  
Audio audioset YouTube audio with labels
Speech voxforge, openslr, cmu wilderness, commonvoice  
Speech synthesis CMU Artic  
Graphs Social Networks (Github, Facebook, Reddit)  
Handwriting iam-handwriting  
  text_renderer Generate synthetic OCR text

#Importing Data

Category Tool Remarks
Prebuilt openml, lineflow  
  rs_datasets Recommendation Datasets
  nlp Python interface to NLP datasets
  tensorflow_datasets Access datasets in Tensorflow
  hub Prebuild datasets for PyTorch and Tensorflow
  pydataset  
  ir_datasets Information Retrieval Datasets
App Store google-play-scraper  
Arxiv pyarxiv Programmatic access to arxiv.org
Audio pydub  
Crawling MechanicalSoup, libextract  
  pyppeteer Chrome Automation
  trafilatura Extract text sections from HTML
  justext Remove boilerplate from scraped HTML
  hext DSL for extracting data from HTML
  ratelimit API rate limit decorator
  backoff Exponential backoff and jitter
  asks Async version of requests
  requests-cache Cached version of requests
  html2text Convert HTML to markdown-formatted plain text
Database blaze Pandas and Numpy interface to databases
Email talon  
Excel openpyxl  
Google Drive gdown, pydrive  
Google Maps geo-heatmap  
Google Search googlesearch Parse google search results
Google Sheets gspread  
Google Ngrams google-ngram-downloader  
HTML python-readability, html-text HTML to Text
Image py-image-dataset-generator, idt, jmd-imagescraper Auto fetch images from web for certain search
Video moviepy Edit Videos
  pytube Download youtube vidoes
Lyrics lyricsgenius  
Machine Translation Corpus mtdata  
News news-please, news-catcher Scrap News
  pygooglenews Google News
Network Packet dpkt, scapy  
PDF camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf  
  grobid Parse PDF into structured XML
  PyPDF2 Read and write PDF in Python
  pdf2image Convert PDF to image
Remote file smart_open  
Text to Speech gtts  
Twitter twint, tweepy, twarc Scrape Twitter
Wikipedia wikipedia, wikitextparser Access data from wikipedia
  wikitables Import table from wikipedia article
Wikidata wikidata Python API to wikidata
XML xmltodict Parse XML as python dictionary
YouTube scrapetube Scrape video metadata from channel

#Data Augmentation

Category Tool Remarks
Audio audiomentations, muda  
Image imgaug, albumentations, augmentor, solt  
  deepaugment Automatic augmentation
  TextRecognitionDataGenerator, genalog OCR
Tabular data deltapy  
  mockaroo Generate synthetic user details
Text nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe, DataAug4NLP, NL-Augmenter  
  fastent Expand NER entity list

#Phase: Exploration

#Data Preparation

Category Tool Remarks
Class Imbalance imblearn  
Categorical encoding category_encoders  
  dirty_cat Encode cateogories with typos
Dataframe cudf Pandas on GPU
Data Validation pandera, pandas-profiling Pandas
Data Cleaning pyjanitor Janitor ported to python
Graph Sampling little ball of fur  
Missing values missingno  
Parallelize pandarallel, swifter, modin Parallelize pandas
  vaex Pandas on huge data
  numba Parallelize numpy
Parsing pyparsing, parse  
Split images into train/validation/test split-folders  
Submodular Optimization twinning, apricot  
Weak Supervision snorkel  

#Data Exploration

Category Tool Remarks
Explore Data sweetviz, dataprep, quickda, vizidata Generate quick visualizations of data
  ipyplot Plot images
Notebook Tools nbdime View Jupyter notebooks through CLI
  papermill Parametrize notebooks
  nbformat Access notebooks programatically
  nbconvert Convert notebooks to other formats
  ipyleaflet Maps in notebooks
  ipycanvas Draw diagrams in notebook
  fastdoc Convert notebook to PDF book
Relationship ppscore Predictive Power Score
  pdpbox Partial Dependence Plot

#Feature Generation

Category Tool Remarks
Automatic feature engineering featuretools, autopandas  
  tsfresh Automatic feature engineering for time series
DAG based dataset generation DFFML  
Dimensionality reduction fbpca, fitsne, trimap  
Metric learning metric-learn, pytorch-metric-learning  
Time series python-holidays List of holidays
  skits Transformation for time-series data
  catch22 Pre-built features for time-series data

#Phase: Modeling

#Model Selection

Category Tool Remarks
Project Structure cookiecutter-data-science  
Find SOTA models sotawhat, papers-with-code, codalab, nlpprogress, evalai, collectiveknowledge, sotabench Benchmarks
  bert-related-papers BERT Papers
  survey-papers Collection of survey papers
Pretrained models modeldepot, pytorch-hub General
  pretrained-models.pytorch, pytorchcv Pre-trained ConvNets
  pytorch-image-models 200+ pretrained ConvNet backbones
  huggingface-models, huggingface-pretrained Transformer Models
  awesome-models Pretrained CoreML models
  huggingface-languages Multi-lingual Models
  model-forge, The Super Duper NLP Repo Pre-trained NLP models by usecase
AutoML auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml  
  lazypredict Run all sklearn models at once
  tpot Genetic AutoML
  autocat Auto-generate text classification models in spacy
  mindsdb, lugwig Autogenerate ML code
Active Learning modal  
Anomaly detection adtk  
Contrastive Learning contrastive-learner  
Deep Clustering deep-clustering-toolbox  
Few Shot Learning keras-fewshotlearning  
Fuzzy Learning fylearn, scikit-fuzzy  
Genetic Programming gplearn  
Gradient Boosting catboost, xgboost, ngboost  
  lightgbm, thunderbm GPU Capable
Graph Neural Networks spektral GNN for Keras
Graph Embedding and Community Detection karateclub, python-louvain, communities  
Hidden Markov Models hmmlearn  
Interpretable Models imodels Models that show rules
Multi-view Learning mvlearn  
Noisy Label Learning cleanlab  
Optimization nevergrad Gradient Free Optimization
  cvxpy Convex Optimization
Optimal Transport pot, geomloss  
Probabilistic modeling pomegranate, pymc3  
Rule based classifier sklearn-expertsys  
Self-Supervised Learning lightly, vissl, solo-learn Implementations of SSL models
  self_supervised Self-supervised models in Fast.AI
Spiking Neural Network norse  
Support Vector Machines thundersvm Run SVM on GPU
Survival Analysis lifelines  

#Frameworks

Category Tool Remarks
Addons mlxtend Extra utilities not present in frameworks
  tensor-sensor Visualize tensors
Pytorch pytorch-summary Keras-like summary
  torchtyping, tsalib Type annotation for tensors
  einops Einstein Notation
  kornia Computer Vision Methods
  nonechucks Drop corrupt data automatically in DataLoader
  pytorch-optimizer Collection of optimizers
  pytorch-block-sparse Sparse matrix replacement for nn.Linear
  pytorch-forecasting Time series forecasting in PyTorch lightning
  pytorch-lightning Lightweight wrapper for PyTorch
  skorch Wrap pytorch in scikit-learn compatible API
  torchcontrib SOTA Bulding Blocks in PyTorch
  bitsandbytes 8-bit optimizers for PyTorch
Scikit-learn scikit-lego, iterative-stratification  
  iterstrat Cross-validation for multi-label data
  scikit-multilearn Multi-label classification
  tscv Time-series cross-validation
Sparsification sparseml Apply sparsification to any framework
Tensorflow tensorflow-addons  
  keras-radam RADAM optimizer
  ktrain FastAI like interface for keras
  larq Binarized neural networks
  scikeras Scikit-learn Wrapper for Keras
  tavolo Kaggle Tricks as Keras Layers
  tensorflow-text Addons for NLP
  tensorflow-wheels Optimized wheels for Tensorflow
  tf-sha-rnn  

#Natural Language Processing

Category Tool Remarks
Libraries spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza, nlp-architect, spark-nlp, pytext, FARM  
  headliner, txt2txt Sequence to sequence models
  Nvidia NeMo Toolkit for ASR, NLP and TTS
  nlu 1-line models for NLP
  pyconverse Conversational Text Analysis
  booknlp NLP for Books
  fast-bert, simpletransformers Wrappers
  finetune Scikit-learn like API for transformers
  compromise Javascript NLP
CPU-optimizations turbo_transformers, onnx_transformers  
  fastT5 Generate optimized T5 model
Preprocessing textacy, texthero, textpipe, nlpretext  
  JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell, nlprule, spylls Spelling Correction
  gramformer Grammar Checker
  language-tool-python, gingerit Grammatical Error Correction
  ekphrasis Pre-processing for social media texts
  editop Compute edit-operations for text normalization
  contractions, pycontractions Contraction Mapping
  truecase Fix casing
  nnsplit, deepsegment, sentence-doctor, pysbd, sentence-splitter Sentence Segmentation
  wordninja Probabilistic Word Segmentation
  punctuator2 Punctuation Restoration
  stopwords-iso Stopwords for all languages
  language-check, langdetect, polyglot, pycld2, cld2, cld3, langid, lumi_language_id Language Identification
  langcodes Get language from language code
  neuralcoref Coreference Resolution
  inflect, lemminflect, pyinflect Inflections
  scrubadub PID removal
  ftfy, clean-text,text-unidecode Fix Unicode Issues
  fastpunct Punctuation Restoration
  pyphen Hypthenate words into syllables
  pypostal, mordecai, usaddress, libpostal Parse Street Addresses
  geopy, geocoder, nominatim, pelias, photon, lieu Geocoding
  probablepeople, python-nameparser Parse person name
  python-phonenumbers Parse phone numbers
  numerizer, word2number Parse natural language number
  dateparser Parse natural dates
  ctparse Parse natural language time
  daterangeparser Parse date ranges in natural language
  emoji Handle emoji
  pyarabic multilingual
Tokenization sentencepiece, youtokentome, subword-nmt  
  sacremoses Rule-based
  jieba, pkuseg Chinese Word Segmentation
  kytea Japanese word segmentation
Clustering kmodes, star-clustering, genieclust  
  spherecluster K-means with cosine distance
  sib Sequential Information Bottleneck
  kneed Automatically find number of clusters from elbow curve
  OptimalCluster Automatically find optimal number of clusters
  gsdmm Short-text clustering
Code Switching codeswitch  
Constituency Parsing benepar, allennlp, chunk-english-fast  
Compact Models mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML  
Cross-lingual Embeddings muse, laserembeddings, xlm, LaBSE  
  transvec, vecmap Train mapping between monolingual embeddings
  MuRIL Embeddings for 17 indic languages with transliteration
  BPEmb Subword Embeddings in 275 Languages
  piecelearn Train own sub-word embeddings
Dictionary vocabulary  
Domain-specific codebert Code
  clinicalbert-mimicnotes, clinicalbert-discharge-summary Clinical Domain
  twitter-roberta-base twitter
  scispacy bio-medical data
  blackstone Legal text
Entity Linking dbpedia-spotlight, GENRE  
Entity Matching py_entitymatching, deepmatcher  
Embeddings InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec,glove-python, fse  
  counterix Train custom Count-based DSM
  embeddix Convert word vectors format
  wiki2vec Word2Vec trained on DBPedia Entities
  chars2vec Character-embeddings for handling typo and slangs
  rank_bm25, BM25Transformer BM25
  sentence-transformers, DeCLUTR BERT sentence embeddings
  conceptnet-numberbatch Word embeddings trained with common-sense knowledge graph
  word2vec-twitter Word2vec trained on twitter
  pymagnitude Access word-embeddings programatically
  chakin Download pre-trained word vectors
  zeugma Pretrained-word embeddings as scikit-learn transformers
  starspace Learn embeddings for anything
  svd2vec Learn embeddings from co-occurrence
  all-but-the-top Post-processing for word vectors
  entity-embed Train custom embeddings for named entities
Emotion Classification goemotion-pytorch, text2emotion  
  emosent-py Sentiment scores for Emojis
Feature Generation homer, textstat Readability scores
  LexicalRichness Lexical Richness Measure
Fill mask fitbert  
Finite State Transducer OpenFST  
Gibberish Detection nostril, gibberish-detector  
Grammar Induction gitta, grasp Generate CFG from sentences
Information Extraction claucy  
  GiveMe5W1H Extract 5-why 1-how phrases from news
  spikex Spacy pipeline for knowledge extraction
Keyword extraction rake, multi-rake, pke, phrasemachine, keybert, word2phrase  
  pyate Automated Term Extraction
Knowledge conceptnet-lite  
  stanford-openie Knowledge Graphs
  verbnet-parser VerbNet parser
Knowledge Distillation textbrewer, aquvitae  
Language Model Scoring lm-scorer, bertscore, kenlm, spacy_kenlm, mlm-scoring  
Lexical Simplification easee Evaluation metric
Metrics seqeval NER, POS tagging
  ranking-metrics, cute_ranking Metrics for Information Retrieval
  mir_eval Music Information Retrieval Metrics
Morphology unimorph Morphology data for many languages
Multilingual support polyglot, trankit  
  inltk, indic_nlp Indic Languages
  cltk NLP for latin and classic languages
  langrank Auto-select optimal transfer language
Named Entity Recognition(NER) spaCy , Stanford NER, sklearn-crfsuite  
  med7 Spacy NER for medical records
Nearest neighbor faiss, sparse_dot_topn, n2, autofaiss  
NLU snips-nlu  
  ParlAI Dialogue System
Paraphrasing parrot  
  pegasus Question Paraphrasing
  paraphrase_diversity_ranker Rank paraphrases of sentence
  sentaugment Paraphrase mining
Phonetics epitran Transliterate text into IPA
  allosaurus Recognize phone for 2000 languages
Phonology panphon Generate phonological feature representations
  phoible Database of segment inventories for 2186 languages
Probabilistic parsing parserator Create domain-specific parser for address, name etc.
Profanity detection profanity-check  
Pronunciation pronouncing  
Question Answering haystack Build end-to-end QA system
  mcQA Multiple Choice Question Answering
  TAPAS Table Question Answering
Question Generation question-generation, questiongen.ai Question Generation Pipeline for Transformers
Ranking transformer-rankers  
Relation Extraction OpenNRE  
Search elasticsearch-dsl Wrapper for elastic search
  jina production-level neural semantic search
  mellisearch-python  
Semantic parsing quepy  
Sentiment vaderSentiment, afinn Rule based
  absa Aspect Based Sentiment Analysis
  xlm-t Models
Spacy Extensions spacy-pattern-builder Generate dependency matcher patterns automatically
  spacy_grammar Rule-based grammar error detection
  role-pattern-builder Pattern based SRL
  textpipeliner Extract RDF triples
  tenseflow Convert tense of sentence
  camphr Wrapper to transformers, elmo, udify
  spleno Domain-specific lemmatization
  spacy-udpipe Use UDPipe from Spacy
  spacymoji Add emoji metadata to spacy docs
String match phrase-seeker, textsearch  
  jellyfish, fuzzy, doublemetaphone Perform string and phonetic comparison
  clavier Edit distance based on keyboard layout
  flashtext Super-fast extract and replace keywords
  pythonverbalexpressions Verbally describe regex
  commonregex Ready-made regex for email/phone etc.
  textdistance, editdistance, word-mover-distance, edlib Text distances
  wmd-relax Word mover distance for spacy
  fuzzywuzzy, spaczz, PolyFuzz, rapidfuzz, fuzzymatcher Fuzzy Search
  deduplipy, dedupe Active-Learning based fuzzy matching
  recordlinkage Record Linkage
Summarization textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval  
  doc2query Summarize document with queries
  summarizers Controllable summarization
  insight_extractor Extract insightful sentences from docs
Text Extraction textract (Image, Audio, PDF)  
Text Generation gp2client, textgenrnn, gpt-2-simple, aitextgen GPT-2
  markovify Markov chains
  accelerated-text Template-based generation
  keytotext Keyword to Sentence Generation
Transliteration wiktra  
Machine Translation MarianMT, Opus-MT, joeynmt, OpenNMT, EasyNMT, argos-translate, dl-translate  
  googletrans, word2word, translate-python, deep_translator Translation libraries
  mosesdecoder Statistical MT
  apertium RBMT
  translators Free calls to multiple translation APIs
  giza++, fastalign, simalign, eflomal, awesome-align Word Alignment
Thesaurus python-datamuse  
Toxicity Detection detoxify  
Topic Modeling gensim, guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic, tomotopy, ToModAPI  
  zeroshot_topics Zero-shot topic modeling
  octis Evaluate topic models
Typology lang2vec Compare typological features of languages
Visualization stylecloud Word Clouds
  scattertext Compare word usage across segments
  picture-text Interactive tree-maps for hierarchical clustering
  ipymarkup Visualize NER and syntax
Verb Conjugation nodebox_linguistics_extended, mlconj3  
Word Sense Disambiguation pywsd, ewiser, supwsd  
  frame-english-fast Verb Disambiguation
Zero Shot Learning setfit  

#Computer Vision

Category Tool Remarks
Face recognition face_recognition, mtcnn, insightface, face-detection  
  face-alignment Find facial landmarks
  Facial-Expression-Recognition.Pytorch Face Emotion
Face swapping faceit, faceit-live, avatarify  
GANS mimicry, imaginaire, pytorch-lightning-gans  
High-level libraries terran Face detection, recognition, pose estimation
Image Hashing ImageHash, imagededup  
Image Inpainting GAN Image Inpainting  
Image Processing scikit-image, imutils, opencv-wrapper, opencv-python  
  torchio Medical Images
Object detection luminoth, detectron2, mmdetection, icevision  
OCR keras-ocr, pytesseract, keras-craft, ocropy, doc2text  
  easyocr, kraken, PaddleOCR Multilingual OCR
  layout-parser, pdftabextract OCR tables from document
Segmentation segmentation_models Keras
  segmentation_models.pytorch Segmentation models in PyTorch
Semantic Search scoper Video
Video summarization videodigest  

#Speech

Category Tool Remarks
Diarization resemblyzer  
Feature Engineering python_speech_features Convert raw audio to features
Libraries speechbrain, pyannotate, librosa, espnet  
  silero-models Pre-trained models
Source Separation spleeter, nussl, open-unmix-pytorch, asteroid  
Speech Recognition kaldi, speech_recognition, delta, pocketsphinx-python, deepspeech, stt, vosk  
Speech Synthesis festvox, cmuflite, tts  

#Recommendation System

Category Tool Remarks
Apriori algorithm apyori  
Collaborative Filtering implicit  
Libraries xlearn, DeepCTR, RankFM Factorization machines (FM), and field-aware factorization machines (FFM)
  libmf-python Matrix Factorization
  lightfm, spotlight Popular Recsys algos
  tensorflow_recommenders Recommendation System in Tensorflow
Metrics rs_metrics  
Recommendation System in Pytorch CaseRecommender  
Scikit-learn like API surprise  

#Timeseries

Category Tool Remarks
Libraries prophet, tslearn, pyts, seglearn, cesium, stumpy, darts, gluon-ts, stldecompose  
  sktime Scikit-learn like API
  atspy Automated time-series models
Anomaly Detection orion, luminaire Unsupervised time-series anomaly detection
ARIMA models pmdarima  

#Hyperparameter Optimization

Category Tool Remarks
General hyperopt, optuna, evol, talos  
Keras keras-tuner  
Parameter optimization ParameterImportance  
Scikit-learn hyperopt-sklearn, scikit-optimize Bayesian Optimization
  sklearn-deap, sklearn-generic-opt Evolutionary algorithm

#Phase: Validation

#Experiment Monitoring

Category Tool Remarks
Experiment tracking tensorboard, mlflow  
  lrcurve, livelossplot Plot realtime learning curve in Keras
GPU Usage gpumonitor, nvtop  
  jupyterlab-nvdashboard See GPU Usage in jupyterlab
MLOps clearml, wandb, neptune.ai, replicate.ai  
Notification knockknock Get notified by slack/email
  jupyter-notify Notify when task is completed in jupyter
  apprise Notify to any platform
  pynotifier Generate desktop notification

#Interpretability

Category Tool Remarks
Adversarial Attack cleverhans General
  foolbox Image
  triggers NLP
Interpret models eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML, shapash  
  exbert Interpret BERT
  bertviz Explore self-attention in BERT
NLP word2viz, whatlies word-vectors
  Language Interpretability Tool, transformers-interpret  

#Visualization

Category Tool Remarks
Diagrams dl-visuals, ml-visuals  
  chalk Declarative drawing API
Libraries matplotlib, seaborn, pygal, plotly, plotnine  
  yellowbrick, scikit-plot Visualization for scikit-learn
  pyldavis Visualize topics models
  dtreeviz Visualize decision tree
  txtmarker Highlight text in PDF
  metriculous Visualize model performance
Animated charts bar_chart_race Bar chart race animation
  pandas_alive Animated charts in pandas
High dimensional visualization umap  
  ivis Ivis Algorithm
Interactive charts bokeh  
  flourish-studio Create interactive charts online
  mpld3 Matplotlib to D3 Converter
Model Visualization netron, nn-svg Architecture
  keract Activation maps for keras
  keras-vis Visualize keras models
  PlotNeuralNet Latex code for drawing neural network
  loss-landscape-anim Generate loss landscape of optimizer
Styling open-color Color Schemes
  mplcyberpunk Cyberpunk style for matplotlib
  chart.xkcd XKCD like charts
  adjustText Prevent overlap when plotting point text label
Generate graphs using markdown mermaid  
Tree-map chart squarify  
3D charts babyplots  

#Phase: Production

#Model Export

Category Tool Remarks
Benchmarking torchprof Profile pytorch layers
  scalene, pyinstrument Profile python code
  k6 Load test API
  ai-benchmark Bechmark VM on 19 different models
Cloud Storage Zenodo, Github Releases, OneDrive, Google Drive, Dropbox, S3, mega, DAGsHub, huggingface-hub  
Data Pipeline pypeln  
Dependencies pip-chill pip freeze without dependencies
  pipreqs Generate requirements.txt based on imports
  conda-pack Export conda for offline use
Distributed training horovod  
Model Store modelstore  
Optimization nn_pruning Movement Pruning
  aimet, tensorflow-lite Quantization
Serialization sklearn-porter, m2cgen Transpile sklearn model to C, Java, JavaScript and others
  onnxmltools Classic ML models to onnx format
  hummingbird Convert ML models to PyTorch
  cloudpickle, jsonpickle Pickle extensions

#Inference

Category Tool Remarks
Authentication pyjwt (JWT), auth0, okta, cognito  
Batch Jobs airflow, luigi, dagster, oozie, prefect, kubernetes-cron-jobs, argo  
  rq, schedule, huey Task Queue
  mlq Queue ML Tasks in Flask
Caching cachetools, cachew (cache to local sqlite)  
  redis-py, pymemcache  
Cloud Monitoring datadog  
Configuration Management config, python-decouple, python-dotenv, dynaconf  
CORS flask-cors CORS in Flask
Database flask-sqlalchemy, tinydb, flask-pymongo, odmantic  
  tortoise-orm Asyncio ORM similar to Django
Monitoring whylogs Data Logging
  grafana, prometheus Metric
  sentry, honeybadger Error Reporting
Data Validation schema, jsonschema, cerebrus, pydantic, marshmallow, validators  
Dashboard streamlit Generate frontend with python
  gradio Fast UI generation for prototyping
  dash React Dashboard using Python
  voila Convert Jupyter notebooks into dashboard
  streamlit-drawable-canvas Drawable Canvas for Streamlit
  streamlit-terran-timeline Show timeline of faces in videos
  streamlit components Collection of streamlit components
Deployment Checklist ml-checklist  
Documentation mkdocs, pdoc  
Drift Detection alibi-detect, torchdrift, boxkite Outlier and drift detection
Edge Deployment Tensorfow Lite, coreml, Tensorflow.js)  
Logging loguru  
Model Serving cortex, torchserve, ray-serve, bentoml, seldon-core Serving Framework
  flask, fastapi API Frameworks
Processing pyspark, hive  
Serverless magnum Use FastAPI in Lambda
Server-Side Events sse-starlette Server-side events for FastAPI
Stream Processing flink, kafka, apache beam  
Testing schemathesis Automatic test generation from Swagger
  pytest-benchmark Profile time in pytest
  exdown Extract code from markdown files
  mktestdocs Test code present in markdown files

#Python libraries

Category Tool Remarks
Async tomorrow  
Audio simpleaudio Play audio using python
Automation pyuserinput, pyautogui, pynput Control mouse and keyboard
bloom filter python-bloomfilter  
CLI Formatting rich  
Concurrent database pickleshare  
Code to Maths latexify-py, handcalcs  
Create interactive prompts prompt-toolkit  
Collections bidict Bidirectional dictionary
  sortedcontainers Sorted list, set and dict
  munch Dictionary with dot access
Correlation Metric xicor  
Date and Time pendulum  
Decorators retrying (retry some function)  
Debugging PySnooper  
Improved doctest xdoctest  
Linting pylint, pycodestyle Code Formatting
  pydocstyle Check docstring
  safety, bandit, shellcheck Check vulnerabilities
  mypy Check types
  black Automated Formatting
Leaflet maps from python folium  
Multiprocessing filelock Lock files during access from multiple process
Path-like interface to remote files pathy  
Pretty print tables in CLI tabulate  
Progress bar fastprogress, tqdm  
Run python libraries in sandbox pipx  
Shell commands as functions sh  
Standard Library Extension ubelt  
Subprocess delegator.py  
Testing crosshair(find failure cases for functions)  
Virtual webcam pyfakewebcam  

#Utilities

Category Tool Remarks
Colab colab-cli Manager colab notebook from command line
Drive drive-cli Use google drive similar to git
Database mlab Free 500 MB MongoDB
Data Visualization flourish-studio  
Git gitjk Undo what you just did in git
Linux ripgrep  
Trade-off tools egograph Find alternatives to anything
URL: https://ib.bsb.br/nlp
Reference: https://digitalhumanities.fas.harvard.edu/resources/choosing-digital-methods-and-tools/