Description:
Example Projects:
- Visualizing Russian (Harvard)
- Hedera (Harvard)
- Concept Search / Open Books (Harvard)
- China Biographical Database Project (Harvard)
Methods:
- Topic Modeling
- Information Retrieval
- Text Classification
- Sentiment Analysis
- Word Frequency Analysis
- Concordancing
- Named Entity Recognition
- Collocation
- Word Embeddings
- Transformer Models
Popular Tools:
- Voyant Tools: web based reading and analysis environment for digital texts
- Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
- WordSeer 4: a text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful
- Antconc: a freeware corpus analysis toolkit for concordancing and text analysis
Popular Programming Languages and Packages:
- Python:
  - NLTK (Natural Language Toolkit): a leading platform for building Python programs to work with human language data
  - spaCy: industrial-strength NLP written in Cython for speed
  - Gensim: free open-source Python library for representing documents as semantic vectors
- R:
  - Quanteda: a package for the Quantitative Analysis of Textual Data
  - Tidytext: using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
  - spaCyR: an R wrapper for the Python spaCy library
Resources:
- TAPoR (Text Analysis Portal for Research): a gateway to the tools used in sophisticated text analysis and retrieval

#Visual Presentation and Analysis

Description
Example Projects:
- Mapping Color in History (Harvard)
Popular Tools, Platforms, and Standards:
- IIIF, the International Image Interoperability Framework, provides a series of API specifications which various image servers and viewers implement
  - IIIF Awesome Resources
  - Harvard IIIF Website
  - Image servers:
    - Loris
    - Cantaloupe
  - Image viewers:
    - Mirador 3
    - Universal Viewer
  - Annotation servers:
    - CatchPy
Popular Software Packages

#Spatial Analysis and Web Mapping

Description
Example Projects:
- Slave Revolt in Jamaica (Harvard)
Popular Tools and Platforms:
- Esri ArcGIS: powerful desktop GIS software suite for mapping, geoprocessing, cartography, and spatial analysis
- QGIS: FOSS alternative to ArcGIS
- Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
- CARTO (previously CartoDB): paid Software as a Service platform for spatial analysis and GIS
- Esri ArcGIS Online: cloud-based GIS platform for creating and sharing interactive maps and analyzing spatial data
- Neatline: a suite of add-on tools for Omeka designed to help tell stories with maps, images, and timelines
Popular Software Packages:
- Leaflet: open-source Javascript library for interactive web maps
- D3.JS: open-source Javascript library for general data visualization, including web mapping
- Google Maps API: one of the most popular general purpose mapping libraries
- OpenLayers: Javascript library for displaying maps and analyzing geographical data
- Mapbox GL: SDK for web maps powered by Mapbox, which allows users to design and publish beautiful maps

#Network Analysis

Description
Example Projects:
- Visualizing Broadway (Harvard)
Popular Tools:
- Gephi: a FOSS interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs
- Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
- Cytoscape: originally designed for biological research, now a general platform for network analysis and visualization
- NodeXL: a network analysis and visualization plugin for Excel
Popular Software Packages:
- NetworkX (Python): a package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- igraph (R and Python): a collection of network analysis tools with an emphasis on efficiency, portability, and ease of use
- visNetwork (R): an R package for network visualization using vis.js
- D3.js (Javascript): more for network visualization and presentation than network analysis

#Timelines and Temporal Analysis

Popular Timeline Creation Tools:
- TimelineJS: an open-source tool that enables anyone to build visually rich, interactive timelines using nothing more than a Google Sheet
- Chronos Timeline: designed for needs in humanities and social sciences to represent time-based data
- Neatline: a suite of add-on tools for Omeka designed to tell stories with maps, images, and timelines

#Machine Learning

Example Projects
Popular Platforms:
- AWS Sagemaker
- IBM Watson Studio
- Google Cloud AI
- H20.ai
- KNIME
Popular Software Packages:
- Python:
  - Keras: deep learning and neural networks
  - PyTorch: deep learning, computer vision
  - Scikit-Learn: data preprocessing, text vectorization, classification, clustering
  - TensorFlow: deep learning
- R:
  - caret: functions that streamline the process for creating predictive models
Resources:
- Machine Learning for the Humanities

#Database Development

Popular Databases:
- Relational:
  - PostgreSQL
  - MySQL
- Document / NoSQL:
  - MongoDB
  - Elasticsearch
  - Solr
- Key-value Store:
  - Redis
  - AWS DynamoDB
- Graph:
  - Neo4J
Popular Database Tools and Database Management Systems (DBMS):
- DBVisualizer
- DataGrip
- Postico
- SQL Server Management Studio

#Data Cleaning

Popular Software:
- Python:
  - Pandas: data analysis and manipulation tool
  - NumPy: scientific computing with Python
  - Jupyter Notebooks / Jupyter Lab: web-based interactive development environment
- R:
  - Tidyverse: an opinionated collection of R packages designed for data science
- Language-agnostic (mostly): Regular Expressions
Popular Tools:
- Google Sheets
- OpenRefine: a powerful tool for working with messy data

#Research Data Management

Popular Software Packages:
- Git: distributed version control system
Popular Tools and Platforms:
- DataVerse: an open-source research data repository
- Github desktop clients, such as Github Desktop or GitKraken
- Tropy: tool to organize and describe photographs of research material

#Project Management

Popular Tools and Platforms:
- Trello
- Jira
- Github Projects
- Asana

#Citation Management

Popular Tools and Platforms:
- Zotero
- EndNote
- Mendeley

#Digital Collections

Description
Example Projects:
- Eileen Southern and the Music of Black Americans (Harvard)
- Imperiia (Harvard)
Popular Platforms and Frameworks:
- Omeka: open-source web publishing platform
- Scalar: authoring and publishing platform
- Drupal: content management system
- WordPress: content management system

#Phase: Data

#Data Annotation

Category	Tool	Remarks
Audio	audio-annotator, audiono
General	superintendent, pigeon	Annotate in notebooks
	labelstudio	Open Source Data Labeling Tool
	awesome-data-labeling
Image	makesense.ai, labelimg, via, cvat
Text	doccano, brat
	chatito	Generate text datasets using DSL
	prodigy	Paid
Inter-rater agreement	disagree
	simpledorff	Krippendorff’s Alpha

#Data Collection

Category	Tool	Remarks
Curations	datasetlist, UCI, Google Dataset Search, fastai-datasets, public-apis, awesome-public-datasets, aws opendata, penn-ml-benchmark
	huggingface-datasets, The Big Bad NLP Database, nlp-datasets, nlp corpora	NLP Datasets
	bifrost, VisualData, roboflow	Computer Vision Datasets
Words	curse-words, badwords, LDNOOBW, 10K most common words, common-misspellings
	profanity	Profane words
	wordlists	Words organized by topic
	english-words	A text file containing over 466k English words
	tf-idf-iif-top-100-wordlists	Top 100 distinctive words for each language
	freeling	Dictionary of words grouped by POS
	catvar, lemmatization-lists, unimorph-eng	Word variants
Text Corpus	project gutenberg, nlp-datasets, 1 trillion n-grams, litbank, BookCorpus, south-asian text corpus
	opus, oscar (big multilingual corpus)	Translation Parallel Text
	pile	825GB text corpus
	freebase	Relation triples
	opensubtitles	Movie subtitles parallel corpus
	lti-langid	Language Identification Corpus for 1152 languages
	fandom-transcripts	Movie and Series Transcripts
	cognet	Cognates for 338 languages
	wold	Loan words
Sentiment	SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment, socialsent
Emotion	NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB, emotion-to-emoji-mapping
	EmoTag1200	Emoji-Emotion scores
NLU Intents	rasa-nlu-training-data
N-grams	google-book-ngrams, norvig-processed-ngrams
Word Frequency	wordfreq,key-nbc
Summarization	curation-corpus
Conversations	conversational-datasets, cornell-movie-dialog-corpus, persona-chat, DialogDatasets
Semantic Parsing	wikisql, spider	Text to SQL
	WebQuestions, ComplexWebQuestions	Text to Knowledge Graph
	CoNaLa, CONCODE	Text to program
	amrlib	Parse AMR data
Image	1 million fake faces, flickr-faces, objectnet, YFCC100m, USPS, Animal Faces-HQ dataset (AFHQ)
	tiny-images,SVHN, STL-10, imagenette, CIFAR-10	Small image datasets for quick experimentation
	omniglot, mini-imagenet	One Shot Learning
Paraphrasing	PPDB
Audio	audioset	YouTube audio with labels
Speech	voxforge, openslr, cmu wilderness, commonvoice
Speech synthesis	CMU Artic
Graphs	Social Networks (Github, Facebook, Reddit)
Handwriting	iam-handwriting
	text_renderer	Generate synthetic OCR text

#Importing Data

Category	Tool	Remarks
Prebuilt	openml, lineflow
	rs_datasets	Recommendation Datasets
	nlp	Python interface to NLP datasets
	tensorflow_datasets	Access datasets in Tensorflow
	hub	Prebuild datasets for PyTorch and Tensorflow
	pydataset
	ir_datasets	Information Retrieval Datasets
App Store	google-play-scraper
Arxiv	pyarxiv	Programmatic access to arxiv.org
Audio	pydub
Crawling	MechanicalSoup, libextract
	pyppeteer	Chrome Automation
	trafilatura	Extract text sections from HTML
	justext	Remove boilerplate from scraped HTML
	hext	DSL for extracting data from HTML
	ratelimit	API rate limit decorator
	backoff	Exponential backoff and jitter
	asks	Async version of requests
	requests-cache	Cached version of requests
	html2text	Convert HTML to markdown-formatted plain text
Database	blaze	Pandas and Numpy interface to databases
Email	talon
Excel	openpyxl
Google Drive	gdown, pydrive
Google Maps	geo-heatmap
Google Search	googlesearch	Parse google search results
Google Sheets	gspread
Google Ngrams	google-ngram-downloader
HTML	python-readability, html-text	HTML to Text
Image	py-image-dataset-generator, idt, jmd-imagescraper	Auto fetch images from web for certain search
Video	moviepy	Edit Videos
	pytube	Download youtube vidoes
Lyrics	lyricsgenius
Machine Translation Corpus	mtdata
News	news-please, news-catcher	Scrap News
	pygooglenews	Google News
Network Packet	dpkt, scapy
PDF	camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf
	grobid	Parse PDF into structured XML
	PyPDF2	Read and write PDF in Python
	pdf2image	Convert PDF to image
Remote file	smart_open
Text to Speech	gtts
Twitter	twint, tweepy, twarc	Scrape Twitter
Wikipedia	wikipedia, wikitextparser	Access data from wikipedia
	wikitables	Import table from wikipedia article
Wikidata	wikidata	Python API to wikidata
XML	xmltodict	Parse XML as python dictionary
YouTube	scrapetube	Scrape video metadata from channel

#Data Augmentation

Category	Tool	Remarks
Audio	audiomentations, muda
Image	imgaug, albumentations, augmentor, solt
	deepaugment	Automatic augmentation
	TextRecognitionDataGenerator, genalog	OCR
Tabular data	deltapy
	mockaroo	Generate synthetic user details
Text	nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe, DataAug4NLP, NL-Augmenter
	fastent	Expand NER entity list

#Phase: Exploration

#Data Preparation

Category	Tool	Remarks
Class Imbalance	imblearn
Categorical encoding	category_encoders
	dirty_cat	Encode cateogories with typos
Dataframe	cudf	Pandas on GPU
Data Validation	pandera, pandas-profiling	Pandas
Data Cleaning	pyjanitor	Janitor ported to python
Graph Sampling	little ball of fur
Missing values	missingno
Parallelize	pandarallel, swifter, modin	Parallelize pandas
	vaex	Pandas on huge data
	numba	Parallelize numpy
Parsing	pyparsing, parse
Split images into train/validation/test	split-folders
Submodular Optimization	twinning, apricot
Weak Supervision	snorkel

#Data Exploration

Category	Tool	Remarks
Explore Data	sweetviz, dataprep, quickda, vizidata	Generate quick visualizations of data
	ipyplot	Plot images
Notebook Tools	nbdime	View Jupyter notebooks through CLI
	papermill	Parametrize notebooks
	nbformat	Access notebooks programatically
	nbconvert	Convert notebooks to other formats
	ipyleaflet	Maps in notebooks
	ipycanvas	Draw diagrams in notebook
	fastdoc	Convert notebook to PDF book
Relationship	ppscore	Predictive Power Score
	pdpbox	Partial Dependence Plot

#Feature Generation

Category	Tool	Remarks
Automatic feature engineering	featuretools, autopandas
	tsfresh	Automatic feature engineering for time series
DAG based dataset generation	DFFML
Dimensionality reduction	fbpca, fitsne, trimap
Metric learning	metric-learn, pytorch-metric-learning
Time series	python-holidays	List of holidays
	skits	Transformation for time-series data
	catch22	Pre-built features for time-series data

#Phase: Modeling

#Model Selection

Category	Tool	Remarks
Project Structure	cookiecutter-data-science
Find SOTA models	sotawhat, papers-with-code, codalab, nlpprogress, evalai, collectiveknowledge, sotabench	Benchmarks
	bert-related-papers	BERT Papers
	survey-papers	Collection of survey papers
Pretrained models	modeldepot, pytorch-hub	General
	pretrained-models.pytorch, pytorchcv	Pre-trained ConvNets
	pytorch-image-models	200+ pretrained ConvNet backbones
	huggingface-models, huggingface-pretrained	Transformer Models
	awesome-models	Pretrained CoreML models
	huggingface-languages	Multi-lingual Models
	model-forge, The Super Duper NLP Repo	Pre-trained NLP models by usecase
AutoML	auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml
	lazypredict	Run all sklearn models at once
	tpot	Genetic AutoML
	autocat	Auto-generate text classification models in spacy
	mindsdb, lugwig	Autogenerate ML code
Active Learning	modal
Anomaly detection	adtk
Contrastive Learning	contrastive-learner
Deep Clustering	deep-clustering-toolbox
Few Shot Learning	keras-fewshotlearning
Fuzzy Learning	fylearn, scikit-fuzzy
Genetic Programming	gplearn
Gradient Boosting	catboost, xgboost, ngboost
	lightgbm, thunderbm	GPU Capable
Graph Neural Networks	spektral	GNN for Keras
Graph Embedding and Community Detection	karateclub, python-louvain, communities
Hidden Markov Models	hmmlearn
Interpretable Models	imodels	Models that show rules
Multi-view Learning	mvlearn
Noisy Label Learning	cleanlab
Optimization	nevergrad	Gradient Free Optimization
	cvxpy	Convex Optimization
Optimal Transport	pot, geomloss
Probabilistic modeling	pomegranate, pymc3
Rule based classifier	sklearn-expertsys
Self-Supervised Learning	lightly, vissl, solo-learn	Implementations of SSL models
	self_supervised	Self-supervised models in Fast.AI
Spiking Neural Network	norse
Support Vector Machines	thundersvm	Run SVM on GPU
Survival Analysis	lifelines

#Frameworks

Category	Tool	Remarks
Addons	mlxtend	Extra utilities not present in frameworks
	tensor-sensor	Visualize tensors
Pytorch	pytorch-summary	Keras-like summary
	torchtyping, tsalib	Type annotation for tensors
	einops	Einstein Notation
	kornia	Computer Vision Methods
	nonechucks	Drop corrupt data automatically in DataLoader
	pytorch-optimizer	Collection of optimizers
	pytorch-block-sparse	Sparse matrix replacement for nn.Linear
	pytorch-forecasting	Time series forecasting in PyTorch lightning
	pytorch-lightning	Lightweight wrapper for PyTorch
	skorch	Wrap pytorch in scikit-learn compatible API
	torchcontrib	SOTA Bulding Blocks in PyTorch
	bitsandbytes	8-bit optimizers for PyTorch
Scikit-learn	scikit-lego, iterative-stratification
	iterstrat	Cross-validation for multi-label data
	scikit-multilearn	Multi-label classification
	tscv	Time-series cross-validation
Sparsification	sparseml	Apply sparsification to any framework
Tensorflow	tensorflow-addons
	keras-radam	RADAM optimizer
	ktrain	FastAI like interface for keras
	larq	Binarized neural networks
	scikeras	Scikit-learn Wrapper for Keras
	tavolo	Kaggle Tricks as Keras Layers
	tensorflow-text	Addons for NLP
	tensorflow-wheels	Optimized wheels for Tensorflow
	tf-sha-rnn

#Natural Language Processing

Category	Tool	Remarks
Libraries	spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza, nlp-architect, spark-nlp, pytext, FARM
	headliner, txt2txt	Sequence to sequence models
	Nvidia NeMo	Toolkit for ASR, NLP and TTS
	nlu	1-line models for NLP
	pyconverse	Conversational Text Analysis
	booknlp	NLP for Books
	fast-bert, simpletransformers	Wrappers
	finetune	Scikit-learn like API for transformers
	compromise	Javascript NLP
CPU-optimizations	turbo_transformers, onnx_transformers
	fastT5	Generate optimized T5 model
Preprocessing	textacy, texthero, textpipe, nlpretext
	JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell, nlprule, spylls	Spelling Correction
	gramformer	Grammar Checker
	language-tool-python, gingerit	Grammatical Error Correction
	ekphrasis	Pre-processing for social media texts
	editop	Compute edit-operations for text normalization
	contractions, pycontractions	Contraction Mapping
	truecase	Fix casing
	nnsplit, deepsegment, sentence-doctor, pysbd, sentence-splitter	Sentence Segmentation
	wordninja	Probabilistic Word Segmentation
	punctuator2	Punctuation Restoration
	stopwords-iso	Stopwords for all languages
	language-check, langdetect, polyglot, pycld2, cld2, cld3, langid, lumi_language_id	Language Identification
	langcodes	Get language from language code
	neuralcoref	Coreference Resolution
	inflect, lemminflect, pyinflect	Inflections
	scrubadub	PID removal
	ftfy, clean-text,text-unidecode	Fix Unicode Issues
	fastpunct	Punctuation Restoration
	pyphen	Hypthenate words into syllables
	pypostal, mordecai, usaddress, libpostal	Parse Street Addresses
	geopy, geocoder, nominatim, pelias, photon, lieu	Geocoding
	probablepeople, python-nameparser	Parse person name
	python-phonenumbers	Parse phone numbers
	numerizer, word2number	Parse natural language number
	dateparser	Parse natural dates
	ctparse	Parse natural language time
	daterangeparser	Parse date ranges in natural language
	emoji	Handle emoji
	pyarabic	multilingual
Tokenization	sentencepiece, youtokentome, subword-nmt
	sacremoses	Rule-based
	jieba, pkuseg	Chinese Word Segmentation
	kytea	Japanese word segmentation
Clustering	kmodes, star-clustering, genieclust
	spherecluster	K-means with cosine distance
	sib	Sequential Information Bottleneck
	kneed	Automatically find number of clusters from elbow curve
	OptimalCluster	Automatically find optimal number of clusters
	gsdmm	Short-text clustering
Code Switching	codeswitch
Constituency Parsing	benepar, allennlp, chunk-english-fast
Compact Models	mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML
Cross-lingual Embeddings	muse, laserembeddings, xlm, LaBSE
	transvec, vecmap	Train mapping between monolingual embeddings
	MuRIL	Embeddings for 17 indic languages with transliteration
	BPEmb	Subword Embeddings in 275 Languages
	piecelearn	Train own sub-word embeddings
Dictionary	vocabulary
Domain-specific	codebert	Code
	clinicalbert-mimicnotes, clinicalbert-discharge-summary	Clinical Domain
	twitter-roberta-base	twitter
	scispacy	bio-medical data
	blackstone	Legal text
Entity Linking	dbpedia-spotlight, GENRE
Entity Matching	py_entitymatching, deepmatcher
Embeddings	InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec,glove-python, fse
	counterix	Train custom Count-based DSM
	embeddix	Convert word vectors format
	wiki2vec	Word2Vec trained on DBPedia Entities
	chars2vec	Character-embeddings for handling typo and slangs
	rank_bm25, BM25Transformer	BM25
	sentence-transformers, DeCLUTR	BERT sentence embeddings
	conceptnet-numberbatch	Word embeddings trained with common-sense knowledge graph
	word2vec-twitter	Word2vec trained on twitter
	pymagnitude	Access word-embeddings programatically
	chakin	Download pre-trained word vectors
	zeugma	Pretrained-word embeddings as scikit-learn transformers
	starspace	Learn embeddings for anything
	svd2vec	Learn embeddings from co-occurrence
	all-but-the-top	Post-processing for word vectors
	entity-embed	Train custom embeddings for named entities
Emotion Classification	goemotion-pytorch, text2emotion
	emosent-py	Sentiment scores for Emojis
Feature Generation	homer, textstat	Readability scores
	LexicalRichness	Lexical Richness Measure
Fill mask	fitbert
Finite State Transducer	OpenFST
Gibberish Detection	nostril, gibberish-detector
Grammar Induction	gitta, grasp	Generate CFG from sentences
Information Extraction	claucy
	GiveMe5W1H	Extract 5-why 1-how phrases from news
	spikex	Spacy pipeline for knowledge extraction
Keyword extraction	rake, multi-rake, pke, phrasemachine, keybert, word2phrase
	pyate	Automated Term Extraction
Knowledge	conceptnet-lite
	stanford-openie	Knowledge Graphs
	verbnet-parser	VerbNet parser
Knowledge Distillation	textbrewer, aquvitae
Language Model Scoring	lm-scorer, bertscore, kenlm, spacy_kenlm, mlm-scoring
Lexical Simplification	easee	Evaluation metric
Metrics	seqeval	NER, POS tagging
	ranking-metrics, cute_ranking	Metrics for Information Retrieval
	mir_eval	Music Information Retrieval Metrics
Morphology	unimorph	Morphology data for many languages
Multilingual support	polyglot, trankit
	inltk, indic_nlp	Indic Languages
	cltk	NLP for latin and classic languages
	langrank	Auto-select optimal transfer language
Named Entity Recognition(NER)	spaCy , Stanford NER, sklearn-crfsuite
	med7	Spacy NER for medical records
Nearest neighbor	faiss, sparse_dot_topn, n2, autofaiss
NLU	snips-nlu
	ParlAI	Dialogue System
Paraphrasing	parrot
	pegasus	Question Paraphrasing
	paraphrase_diversity_ranker	Rank paraphrases of sentence
	sentaugment	Paraphrase mining
Phonetics	epitran	Transliterate text into IPA
	allosaurus	Recognize phone for 2000 languages
Phonology	panphon	Generate phonological feature representations
	phoible	Database of segment inventories for 2186 languages
Probabilistic parsing	parserator	Create domain-specific parser for address, name etc.
Profanity detection	profanity-check
Pronunciation	pronouncing
Question Answering	haystack	Build end-to-end QA system
	mcQA	Multiple Choice Question Answering
	TAPAS	Table Question Answering
Question Generation	question-generation, questiongen.ai	Question Generation Pipeline for Transformers
Ranking	transformer-rankers
Relation Extraction	OpenNRE
Search	elasticsearch-dsl	Wrapper for elastic search
	jina	production-level neural semantic search
	mellisearch-python
Semantic parsing	quepy
Sentiment	vaderSentiment, afinn	Rule based
	absa	Aspect Based Sentiment Analysis
	xlm-t	Models
Spacy Extensions	spacy-pattern-builder	Generate dependency matcher patterns automatically
	spacy_grammar	Rule-based grammar error detection
	role-pattern-builder	Pattern based SRL
	textpipeliner	Extract RDF triples
	tenseflow	Convert tense of sentence
	camphr	Wrapper to transformers, elmo, udify
	spleno	Domain-specific lemmatization
	spacy-udpipe	Use UDPipe from Spacy
	spacymoji	Add emoji metadata to spacy docs
String match	phrase-seeker, textsearch
	jellyfish, fuzzy, doublemetaphone	Perform string and phonetic comparison
	clavier	Edit distance based on keyboard layout
	flashtext	Super-fast extract and replace keywords
	pythonverbalexpressions	Verbally describe regex
	commonregex	Ready-made regex for email/phone etc.
	textdistance, editdistance, word-mover-distance, edlib	Text distances
	wmd-relax	Word mover distance for spacy
	fuzzywuzzy, spaczz, PolyFuzz, rapidfuzz, fuzzymatcher	Fuzzy Search
	deduplipy, dedupe	Active-Learning based fuzzy matching
	recordlinkage	Record Linkage
Summarization	textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval
	doc2query	Summarize document with queries
	summarizers	Controllable summarization
	insight_extractor	Extract insightful sentences from docs
Text Extraction	textract (Image, Audio, PDF)
Text Generation	gp2client, textgenrnn, gpt-2-simple, aitextgen	GPT-2
	markovify	Markov chains
	accelerated-text	Template-based generation
	keytotext	Keyword to Sentence Generation
Transliteration	wiktra
Machine Translation	MarianMT, Opus-MT, joeynmt, OpenNMT, EasyNMT, argos-translate, dl-translate
	googletrans, word2word, translate-python, deep_translator	Translation libraries
	mosesdecoder	Statistical MT
	apertium	RBMT
	translators	Free calls to multiple translation APIs
	giza++, fastalign, simalign, eflomal, awesome-align	Word Alignment
Thesaurus	python-datamuse
Toxicity Detection	detoxify
Topic Modeling	gensim, guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic, tomotopy, ToModAPI
	zeroshot_topics	Zero-shot topic modeling
	octis	Evaluate topic models
Typology	lang2vec	Compare typological features of languages
Visualization	stylecloud	Word Clouds
	scattertext	Compare word usage across segments
	picture-text	Interactive tree-maps for hierarchical clustering
	ipymarkup	Visualize NER and syntax
Verb Conjugation	nodebox_linguistics_extended, mlconj3
Word Sense Disambiguation	pywsd, ewiser, supwsd
	frame-english-fast	Verb Disambiguation
Zero Shot Learning	setfit

#Computer Vision

Category	Tool	Remarks
Face recognition	face_recognition, mtcnn, insightface, face-detection
	face-alignment	Find facial landmarks
	Facial-Expression-Recognition.Pytorch	Face Emotion
Face swapping	faceit, faceit-live, avatarify
GANS	mimicry, imaginaire, pytorch-lightning-gans
High-level libraries	terran	Face detection, recognition, pose estimation
Image Hashing	ImageHash, imagededup
Image Inpainting	GAN Image Inpainting
Image Processing	scikit-image, imutils, opencv-wrapper, opencv-python
	torchio	Medical Images
Object detection	luminoth, detectron2, mmdetection, icevision
OCR	keras-ocr, pytesseract, keras-craft, ocropy, doc2text
	easyocr, kraken, PaddleOCR	Multilingual OCR
	layout-parser, pdftabextract	OCR tables from document
Segmentation	segmentation_models	Keras
	segmentation_models.pytorch	Segmentation models in PyTorch
Semantic Search	scoper	Video
Video summarization	videodigest

#Speech

Category	Tool	Remarks
Diarization	resemblyzer
Feature Engineering	python_speech_features	Convert raw audio to features
Libraries	speechbrain, pyannotate, librosa, espnet
	silero-models	Pre-trained models
Source Separation	spleeter, nussl, open-unmix-pytorch, asteroid
Speech Recognition	kaldi, speech_recognition, delta, pocketsphinx-python, deepspeech, stt, vosk
Speech Synthesis	festvox, cmuflite, tts

#Recommendation System

Category	Tool	Remarks
Apriori algorithm	apyori
Collaborative Filtering	implicit
Libraries	xlearn, DeepCTR, RankFM	Factorization machines (FM), and field-aware factorization machines (FFM)
	libmf-python	Matrix Factorization
	lightfm, spotlight	Popular Recsys algos
	tensorflow_recommenders	Recommendation System in Tensorflow
Metrics	rs_metrics
Recommendation System in Pytorch	CaseRecommender
Scikit-learn like API	surprise

#Timeseries

Category	Tool	Remarks
Libraries	prophet, tslearn, pyts, seglearn, cesium, stumpy, darts, gluon-ts, stldecompose
	sktime	Scikit-learn like API
	atspy	Automated time-series models
Anomaly Detection	orion, luminaire	Unsupervised time-series anomaly detection
ARIMA models	pmdarima

#Hyperparameter Optimization

Category	Tool	Remarks
General	hyperopt, optuna, evol, talos
Keras	keras-tuner
Parameter optimization	ParameterImportance
Scikit-learn	hyperopt-sklearn, scikit-optimize	Bayesian Optimization
	sklearn-deap, sklearn-generic-opt	Evolutionary algorithm

#Phase: Validation

#Experiment Monitoring

Category	Tool	Remarks
Experiment tracking	tensorboard, mlflow
	lrcurve, livelossplot	Plot realtime learning curve in Keras
GPU Usage	gpumonitor, nvtop
	jupyterlab-nvdashboard	See GPU Usage in jupyterlab
MLOps	clearml, wandb, neptune.ai, replicate.ai
Notification	knockknock	Get notified by slack/email
	jupyter-notify	Notify when task is completed in jupyter
	apprise	Notify to any platform
	pynotifier	Generate desktop notification

#Interpretability

Category	Tool	Remarks
Adversarial Attack	cleverhans	General
	foolbox	Image
	triggers	NLP
Interpret models	eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML, shapash
	exbert	Interpret BERT
	bertviz	Explore self-attention in BERT
NLP	word2viz, whatlies	word-vectors
	Language Interpretability Tool, transformers-interpret

#Visualization

Category	Tool	Remarks
Diagrams	dl-visuals, ml-visuals
	chalk	Declarative drawing API
Libraries	matplotlib, seaborn, pygal, plotly, plotnine
	yellowbrick, scikit-plot	Visualization for scikit-learn
	pyldavis	Visualize topics models
	dtreeviz	Visualize decision tree
	txtmarker	Highlight text in PDF
	metriculous	Visualize model performance
Animated charts	bar_chart_race	Bar chart race animation
	pandas_alive	Animated charts in pandas
High dimensional visualization	umap
	ivis	Ivis Algorithm
Interactive charts	bokeh
	flourish-studio	Create interactive charts online
	mpld3	Matplotlib to D3 Converter
Model Visualization	netron, nn-svg	Architecture
	keract	Activation maps for keras
	keras-vis	Visualize keras models
	PlotNeuralNet	Latex code for drawing neural network
	loss-landscape-anim	Generate loss landscape of optimizer
Styling	open-color	Color Schemes
	mplcyberpunk	Cyberpunk style for matplotlib
	chart.xkcd	XKCD like charts
	adjustText	Prevent overlap when plotting point text label
Generate graphs using markdown	mermaid
Tree-map chart	squarify
3D charts	babyplots

#Phase: Production

#Model Export

Category	Tool	Remarks
Benchmarking	torchprof	Profile pytorch layers
	scalene, pyinstrument	Profile python code
	k6	Load test API
	ai-benchmark	Bechmark VM on 19 different models
Cloud Storage	Zenodo, Github Releases, OneDrive, Google Drive, Dropbox, S3, mega, DAGsHub, huggingface-hub
Data Pipeline	pypeln
Dependencies	pip-chill	pip freeze without dependencies
	pipreqs	Generate requirements.txt based on imports
	conda-pack	Export conda for offline use
Distributed training	horovod
Model Store	modelstore
Optimization	nn_pruning	Movement Pruning
	aimet, tensorflow-lite	Quantization
Serialization	sklearn-porter, m2cgen	Transpile sklearn model to C, Java, JavaScript and others
	onnxmltools	Classic ML models to onnx format
	hummingbird	Convert ML models to PyTorch
	cloudpickle, jsonpickle	Pickle extensions

#Inference

Category	Tool	Remarks
Authentication	pyjwt (JWT), auth0, okta, cognito
Batch Jobs	airflow, luigi, dagster, oozie, prefect, kubernetes-cron-jobs, argo
	rq, schedule, huey	Task Queue
	mlq	Queue ML Tasks in Flask
Caching	cachetools, cachew (cache to local sqlite)
	redis-py, pymemcache
Cloud Monitoring	datadog
Configuration Management	config, python-decouple, python-dotenv, dynaconf
CORS	flask-cors	CORS in Flask
Database	flask-sqlalchemy, tinydb, flask-pymongo, odmantic
	tortoise-orm	Asyncio ORM similar to Django
Monitoring	whylogs	Data Logging
	grafana, prometheus	Metric
	sentry, honeybadger	Error Reporting
Data Validation	schema, jsonschema, cerebrus, pydantic, marshmallow, validators
Dashboard	streamlit	Generate frontend with python
	gradio	Fast UI generation for prototyping
	dash	React Dashboard using Python
	voila	Convert Jupyter notebooks into dashboard
	streamlit-drawable-canvas	Drawable Canvas for Streamlit
	streamlit-terran-timeline	Show timeline of faces in videos
	streamlit components	Collection of streamlit components
Deployment Checklist	ml-checklist
Documentation	mkdocs, pdoc
Drift Detection	alibi-detect, torchdrift, boxkite	Outlier and drift detection
Edge Deployment	Tensorfow Lite, coreml, Tensorflow.js)
Logging	loguru
Model Serving	cortex, torchserve, ray-serve, bentoml, seldon-core	Serving Framework
	flask, fastapi	API Frameworks
Processing	pyspark, hive
Serverless	magnum	Use FastAPI in Lambda
Server-Side Events	sse-starlette	Server-side events for FastAPI
Stream Processing	flink, kafka, apache beam
Testing	schemathesis	Automatic test generation from Swagger
	pytest-benchmark	Profile time in pytest
	exdown	Extract code from markdown files
	mktestdocs	Test code present in markdown files

#Python libraries

Category	Tool	Remarks
Async	tomorrow
Audio	simpleaudio	Play audio using python
Automation	pyuserinput, pyautogui, pynput	Control mouse and keyboard
bloom filter	python-bloomfilter
CLI Formatting	rich
Concurrent database	pickleshare
Code to Maths	latexify-py, handcalcs
Create interactive prompts	prompt-toolkit
Collections	bidict	Bidirectional dictionary
	sortedcontainers	Sorted list, set and dict
	munch	Dictionary with dot access
Correlation Metric	xicor
Date and Time	pendulum
Decorators	retrying (retry some function)
Debugging	PySnooper
Improved doctest	xdoctest
Linting	pylint, pycodestyle	Code Formatting
	pydocstyle	Check docstring
	safety, bandit, shellcheck	Check vulnerabilities
	mypy	Check types
	black	Automated Formatting
Leaflet maps from python	folium
Multiprocessing	filelock	Lock files during access from multiple process
Path-like interface to remote files	pathy
Pretty print tables in CLI	tabulate
Progress bar	fastprogress, tqdm
Run python libraries in sandbox	pipx
Shell commands as functions	sh
Standard Library Extension	ubelt
Subprocess	delegator.py
Testing	crosshair(find failure cases for functions)
Virtual webcam	pyfakewebcam

#Utilities

Category	Tool	Remarks
Colab	colab-cli	Manager colab notebook from command line
Drive	drive-cli	Use google drive similar to git
Database	mlab	Free 500 MB MongoDB
Data Visualization	flourish-studio
Git	gitjk	Undo what you just did in git
Linux	ripgrep
Trade-off tools	egograph	Find alternatives to anything

URL: https://ib.bsb.br/nlp