- Digital Methods and Tools
- Text Analysis
- Visual Presentation and Analysis
- Spatial Analysis and Web Mapping
- Network Analysis
- Timelines and Temporal Analysis
- Machine Learning
- Database Development
- Data Cleaning
- Research Data Management
- Project Management
- Citation Management
- Digital Collections
- Phase: Data
- Phase: Exploration
- Phase: Modeling
- Phase: Validation
- Phase: Production
#Digital Methods and Tools
#Text Analysis
-
Description:
- Example Projects:
- Visualizing Russian (Harvard)
- Hedera (Harvard)
- Concept Search / Open Books (Harvard)
- China Biographical Database Project (Harvard)
- Methods:
- Topic Modeling
- Information Retrieval
- Text Classification
- Sentiment Analysis
- Word Frequency Analysis
- Concordancing
- Named Entity Recognition
- Collocation
- Word Embeddings
- Transformer Models
- Popular Tools:
- Voyant Tools: web based reading and analysis environment for digital texts
- Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
- WordSeer 4: a text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful
- Antconc: a freeware corpus analysis toolkit for concordancing and text analysis
- Popular Programming Languages and Packages:
- Python:
- NLTK (Natural Language Toolkit): a leading platform for building Python programs to work with human language data
- spaCy: industrial-strength NLP written in Cython for speed
- Gensim: free open-source Python library for representing documents as semantic vectors
- R:
- Python:
- Resources:
- TAPoR (Text Analysis Portal for Research): a gateway to the tools used in sophisticated text analysis and retrieval
#Visual Presentation and Analysis
-
Description
- Example Projects:
- Mapping Color in History (Harvard)
- Popular Tools, Platforms, and Standards:
- IIIF, the International Image Interoperability Framework, provides a series of API specifications which various image servers and viewers implement
- IIIF Awesome Resources
- Harvard IIIF Website
- Image servers:
- Loris
- Cantaloupe
- Image viewers:
- Mirador 3
- Universal Viewer
- Annotation servers:
- CatchPy
- IIIF, the International Image Interoperability Framework, provides a series of API specifications which various image servers and viewers implement
- Popular Software Packages
#Spatial Analysis and Web Mapping
-
Description
- Example Projects:
- Slave Revolt in Jamaica (Harvard)
- Popular Tools and Platforms:
- Esri ArcGIS: powerful desktop GIS software suite for mapping, geoprocessing, cartography, and spatial analysis
- QGIS: FOSS alternative to ArcGIS
- Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
- CARTO (previously CartoDB): paid Software as a Service platform for spatial analysis and GIS
- Esri ArcGIS Online: cloud-based GIS platform for creating and sharing interactive maps and analyzing spatial data
- Neatline: a suite of add-on tools for Omeka designed to help tell stories with maps, images, and timelines
- Popular Software Packages:
- Leaflet: open-source Javascript library for interactive web maps
- D3.JS: open-source Javascript library for general data visualization, including web mapping
- Google Maps API: one of the most popular general purpose mapping libraries
- OpenLayers: Javascript library for displaying maps and analyzing geographical data
- Mapbox GL: SDK for web maps powered by Mapbox, which allows users to design and publish beautiful maps
#Network Analysis
-
Description
- Example Projects:
- Visualizing Broadway (Harvard)
- Popular Tools:
- Gephi: a FOSS interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs
- Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
- Cytoscape: originally designed for biological research, now a general platform for network analysis and visualization
- NodeXL: a network analysis and visualization plugin for Excel
- Popular Software Packages:
- NetworkX (Python): a package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- igraph (R and Python): a collection of network analysis tools with an emphasis on efficiency, portability, and ease of use
- visNetwork (R): an R package for network visualization using vis.js
- D3.js (Javascript): more for network visualization and presentation than network analysis
#Timelines and Temporal Analysis
- Popular Timeline Creation Tools:
- TimelineJS: an open-source tool that enables anyone to build visually rich, interactive timelines using nothing more than a Google Sheet
- Chronos Timeline: designed for needs in humanities and social sciences to represent time-based data
- Neatline: a suite of add-on tools for Omeka designed to tell stories with maps, images, and timelines
#Machine Learning
-
Example Projects
- Popular Platforms:
- AWS Sagemaker
- IBM Watson Studio
- Google Cloud AI
- H20.ai
- KNIME
- Popular Software Packages:
- Python:
- Keras: deep learning and neural networks
- PyTorch: deep learning, computer vision
- Scikit-Learn: data preprocessing, text vectorization, classification, clustering
- TensorFlow: deep learning
- R:
- caret: functions that streamline the process for creating predictive models
- Python:
- Resources:
#Database Development
- Popular Databases:
- Relational:
- PostgreSQL
- MySQL
- Document / NoSQL:
- MongoDB
- Elasticsearch
- Solr
- Key-value Store:
- Redis
- AWS DynamoDB
- Graph:
- Neo4J
- Relational:
- Popular Database Tools and Database Management Systems (DBMS):
- DBVisualizer
- DataGrip
- Postico
- SQL Server Management Studio
#Data Cleaning
- Popular Software:
- Python:
- Pandas: data analysis and manipulation tool
- NumPy: scientific computing with Python
- Jupyter Notebooks / Jupyter Lab: web-based interactive development environment
- R:
- Tidyverse: an opinionated collection of R packages designed for data science
- Language-agnostic (mostly): Regular Expressions
- Python:
- Popular Tools:
- Google Sheets
- OpenRefine: a powerful tool for working with messy data
#Research Data Management
- Popular Software Packages:
- Git: distributed version control system
- Popular Tools and Platforms:
#Project Management
- Popular Tools and Platforms:
- Trello
- Jira
- Github Projects
- Asana
#Citation Management
- Popular Tools and Platforms:
- Zotero
- EndNote
- Mendeley
#Digital Collections
-
Description
- Example Projects:
- Eileen Southern and the Music of Black Americans (Harvard)
- Imperiia (Harvard)
- Popular Platforms and Frameworks:
#Phase: Data
#Data Annotation
Category | Tool | Remarks |
---|---|---|
Audio | audio-annotator, audiono | |
General | superintendent, pigeon | Annotate in notebooks |
labelstudio | Open Source Data Labeling Tool | |
awesome-data-labeling | ||
Image | makesense.ai, labelimg, via, cvat | |
Text | doccano, brat | |
chatito | Generate text datasets using DSL | |
prodigy | Paid | |
Inter-rater agreement | disagree | |
simpledorff | Krippendorff’s Alpha |
#Data Collection
#Importing Data
Category | Tool | Remarks |
---|---|---|
Prebuilt | openml, lineflow | |
rs_datasets | Recommendation Datasets | |
nlp | Python interface to NLP datasets | |
tensorflow_datasets | Access datasets in Tensorflow | |
hub | Prebuild datasets for PyTorch and Tensorflow | |
pydataset | ||
ir_datasets | Information Retrieval Datasets | |
App Store | google-play-scraper | |
Arxiv | pyarxiv | Programmatic access to arxiv.org |
Audio | pydub | |
Crawling | MechanicalSoup, libextract | |
pyppeteer | Chrome Automation | |
trafilatura | Extract text sections from HTML | |
justext | Remove boilerplate from scraped HTML | |
hext | DSL for extracting data from HTML | |
ratelimit | API rate limit decorator | |
backoff | Exponential backoff and jitter | |
asks | Async version of requests | |
requests-cache | Cached version of requests | |
html2text | Convert HTML to markdown-formatted plain text | |
Database | blaze | Pandas and Numpy interface to databases |
talon | ||
Excel | openpyxl | |
Google Drive | gdown, pydrive | |
Google Maps | geo-heatmap | |
Google Search | googlesearch | Parse google search results |
Google Sheets | gspread | |
Google Ngrams | google-ngram-downloader | |
HTML | python-readability, html-text | HTML to Text |
Image | py-image-dataset-generator, idt, jmd-imagescraper | Auto fetch images from web for certain search |
Video | moviepy | Edit Videos |
pytube | Download youtube vidoes | |
Lyrics | lyricsgenius | |
Machine Translation Corpus | mtdata | |
News | news-please, news-catcher | Scrap News |
pygooglenews | Google News | |
Network Packet | dpkt, scapy | |
camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf | ||
grobid | Parse PDF into structured XML | |
PyPDF2 | Read and write PDF in Python | |
pdf2image | Convert PDF to image | |
Remote file | smart_open | |
Text to Speech | gtts | |
twint, tweepy, twarc | Scrape Twitter | |
Wikipedia | wikipedia, wikitextparser | Access data from wikipedia |
wikitables | Import table from wikipedia article | |
Wikidata | wikidata | Python API to wikidata |
XML | xmltodict | Parse XML as python dictionary |
YouTube | scrapetube | Scrape video metadata from channel |
#Data Augmentation
Category | Tool | Remarks |
---|---|---|
Audio | audiomentations, muda | |
Image | imgaug, albumentations, augmentor, solt | |
deepaugment | Automatic augmentation | |
TextRecognitionDataGenerator, genalog | OCR | |
Tabular data | deltapy | |
mockaroo | Generate synthetic user details | |
Text | nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe, DataAug4NLP, NL-Augmenter | |
fastent | Expand NER entity list |
#Phase: Exploration
#Data Preparation
Category | Tool | Remarks |
---|---|---|
Class Imbalance | imblearn | |
Categorical encoding | category_encoders | |
dirty_cat | Encode cateogories with typos | |
Dataframe | cudf | Pandas on GPU |
Data Validation | pandera, pandas-profiling | Pandas |
Data Cleaning | pyjanitor | Janitor ported to python |
Graph Sampling | little ball of fur | |
Missing values | missingno | |
Parallelize | pandarallel, swifter, modin | Parallelize pandas |
vaex | Pandas on huge data | |
numba | Parallelize numpy | |
Parsing | pyparsing, parse | |
Split images into train/validation/test | split-folders | |
Submodular Optimization | twinning, apricot | |
Weak Supervision | snorkel |
#Data Exploration
Category | Tool | Remarks |
---|---|---|
Explore Data | sweetviz, dataprep, quickda, vizidata | Generate quick visualizations of data |
ipyplot | Plot images | |
Notebook Tools | nbdime | View Jupyter notebooks through CLI |
papermill | Parametrize notebooks | |
nbformat | Access notebooks programatically | |
nbconvert | Convert notebooks to other formats | |
ipyleaflet | Maps in notebooks | |
ipycanvas | Draw diagrams in notebook | |
fastdoc | Convert notebook to PDF book | |
Relationship | ppscore | Predictive Power Score |
pdpbox | Partial Dependence Plot |
#Feature Generation
Category | Tool | Remarks |
---|---|---|
Automatic feature engineering | featuretools, autopandas | |
tsfresh | Automatic feature engineering for time series | |
DAG based dataset generation | DFFML | |
Dimensionality reduction | fbpca, fitsne, trimap | |
Metric learning | metric-learn, pytorch-metric-learning | |
Time series | python-holidays | List of holidays |
skits | Transformation for time-series data | |
catch22 | Pre-built features for time-series data |
#Phase: Modeling
#Model Selection
Category | Tool | Remarks |
---|---|---|
Project Structure | cookiecutter-data-science | |
Find SOTA models | sotawhat, papers-with-code, codalab, nlpprogress, evalai, collectiveknowledge, sotabench | Benchmarks |
bert-related-papers | BERT Papers | |
survey-papers | Collection of survey papers | |
Pretrained models | modeldepot, pytorch-hub | General |
pretrained-models.pytorch, pytorchcv | Pre-trained ConvNets | |
pytorch-image-models | 200+ pretrained ConvNet backbones | |
huggingface-models, huggingface-pretrained | Transformer Models | |
awesome-models | Pretrained CoreML models | |
huggingface-languages | Multi-lingual Models | |
model-forge, The Super Duper NLP Repo | Pre-trained NLP models by usecase | |
AutoML | auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml | |
lazypredict | Run all sklearn models at once | |
tpot | Genetic AutoML | |
autocat | Auto-generate text classification models in spacy | |
mindsdb, lugwig | Autogenerate ML code | |
Active Learning | modal | |
Anomaly detection | adtk | |
Contrastive Learning | contrastive-learner | |
Deep Clustering | deep-clustering-toolbox | |
Few Shot Learning | keras-fewshotlearning | |
Fuzzy Learning | fylearn, scikit-fuzzy | |
Genetic Programming | gplearn | |
Gradient Boosting | catboost, xgboost, ngboost | |
lightgbm, thunderbm | GPU Capable | |
Graph Neural Networks | spektral | GNN for Keras |
Graph Embedding and Community Detection | karateclub, python-louvain, communities | |
Hidden Markov Models | hmmlearn | |
Interpretable Models | imodels | Models that show rules |
Multi-view Learning | mvlearn | |
Noisy Label Learning | cleanlab | |
Optimization | nevergrad | Gradient Free Optimization |
cvxpy | Convex Optimization | |
Optimal Transport | pot, geomloss | |
Probabilistic modeling | pomegranate, pymc3 | |
Rule based classifier | sklearn-expertsys | |
Self-Supervised Learning | lightly, vissl, solo-learn | Implementations of SSL models |
self_supervised | Self-supervised models in Fast.AI | |
Spiking Neural Network | norse | |
Support Vector Machines | thundersvm | Run SVM on GPU |
Survival Analysis | lifelines |
#Frameworks
Category | Tool | Remarks |
---|---|---|
Addons | mlxtend | Extra utilities not present in frameworks |
tensor-sensor | Visualize tensors | |
Pytorch | pytorch-summary | Keras-like summary |
torchtyping, tsalib | Type annotation for tensors | |
einops | Einstein Notation | |
kornia | Computer Vision Methods | |
nonechucks | Drop corrupt data automatically in DataLoader | |
pytorch-optimizer | Collection of optimizers | |
pytorch-block-sparse | Sparse matrix replacement for nn.Linear | |
pytorch-forecasting | Time series forecasting in PyTorch lightning | |
pytorch-lightning | Lightweight wrapper for PyTorch | |
skorch | Wrap pytorch in scikit-learn compatible API | |
torchcontrib | SOTA Bulding Blocks in PyTorch | |
bitsandbytes | 8-bit optimizers for PyTorch | |
Scikit-learn | scikit-lego, iterative-stratification | |
iterstrat | Cross-validation for multi-label data | |
scikit-multilearn | Multi-label classification | |
tscv | Time-series cross-validation | |
Sparsification | sparseml | Apply sparsification to any framework |
Tensorflow | tensorflow-addons | |
keras-radam | RADAM optimizer | |
ktrain | FastAI like interface for keras | |
larq | Binarized neural networks | |
scikeras | Scikit-learn Wrapper for Keras | |
tavolo | Kaggle Tricks as Keras Layers | |
tensorflow-text | Addons for NLP | |
tensorflow-wheels | Optimized wheels for Tensorflow | |
tf-sha-rnn |
#Natural Language Processing
Category | Tool | Remarks |
---|---|---|
Libraries | spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza, nlp-architect, spark-nlp, pytext, FARM | |
headliner, txt2txt | Sequence to sequence models | |
Nvidia NeMo | Toolkit for ASR, NLP and TTS | |
nlu | 1-line models for NLP | |
pyconverse | Conversational Text Analysis | |
booknlp | NLP for Books | |
fast-bert, simpletransformers | Wrappers | |
finetune | Scikit-learn like API for transformers | |
compromise | Javascript NLP | |
CPU-optimizations | turbo_transformers, onnx_transformers | |
fastT5 | Generate optimized T5 model | |
Preprocessing | textacy, texthero, textpipe, nlpretext | |
JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell, nlprule, spylls | Spelling Correction | |
gramformer | Grammar Checker | |
language-tool-python, gingerit | Grammatical Error Correction | |
ekphrasis | Pre-processing for social media texts | |
editop | Compute edit-operations for text normalization | |
contractions, pycontractions | Contraction Mapping | |
truecase | Fix casing | |
nnsplit, deepsegment, sentence-doctor, pysbd, sentence-splitter | Sentence Segmentation | |
wordninja | Probabilistic Word Segmentation | |
punctuator2 | Punctuation Restoration | |
stopwords-iso | Stopwords for all languages | |
language-check, langdetect, polyglot, pycld2, cld2, cld3, langid, lumi_language_id | Language Identification | |
langcodes | Get language from language code | |
neuralcoref | Coreference Resolution | |
inflect, lemminflect, pyinflect | Inflections | |
scrubadub | PID removal | |
ftfy, clean-text,text-unidecode | Fix Unicode Issues | |
fastpunct | Punctuation Restoration | |
pyphen | Hypthenate words into syllables | |
pypostal, mordecai, usaddress, libpostal | Parse Street Addresses | |
geopy, geocoder, nominatim, pelias, photon, lieu | Geocoding | |
probablepeople, python-nameparser | Parse person name | |
python-phonenumbers | Parse phone numbers | |
numerizer, word2number | Parse natural language number | |
dateparser | Parse natural dates | |
ctparse | Parse natural language time | |
daterangeparser | Parse date ranges in natural language | |
emoji | Handle emoji | |
pyarabic | multilingual | |
Tokenization | sentencepiece, youtokentome, subword-nmt | |
sacremoses | Rule-based | |
jieba, pkuseg | Chinese Word Segmentation | |
kytea | Japanese word segmentation | |
Clustering | kmodes, star-clustering, genieclust | |
spherecluster | K-means with cosine distance | |
sib | Sequential Information Bottleneck | |
kneed | Automatically find number of clusters from elbow curve | |
OptimalCluster | Automatically find optimal number of clusters | |
gsdmm | Short-text clustering | |
Code Switching | codeswitch | |
Constituency Parsing | benepar, allennlp, chunk-english-fast | |
Compact Models | mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML | |
Cross-lingual Embeddings | muse, laserembeddings, xlm, LaBSE | |
transvec, vecmap | Train mapping between monolingual embeddings | |
MuRIL | Embeddings for 17 indic languages with transliteration | |
BPEmb | Subword Embeddings in 275 Languages | |
piecelearn | Train own sub-word embeddings | |
Dictionary | vocabulary | |
Domain-specific | codebert | Code |
clinicalbert-mimicnotes, clinicalbert-discharge-summary | Clinical Domain | |
twitter-roberta-base | ||
scispacy | bio-medical data | |
blackstone | Legal text | |
Entity Linking | dbpedia-spotlight, GENRE | |
Entity Matching | py_entitymatching, deepmatcher | |
Embeddings | InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec,glove-python, fse | |
counterix | Train custom Count-based DSM | |
embeddix | Convert word vectors format | |
wiki2vec | Word2Vec trained on DBPedia Entities | |
chars2vec | Character-embeddings for handling typo and slangs | |
rank_bm25, BM25Transformer | BM25 | |
sentence-transformers, DeCLUTR | BERT sentence embeddings | |
conceptnet-numberbatch | Word embeddings trained with common-sense knowledge graph | |
word2vec-twitter | Word2vec trained on twitter | |
pymagnitude | Access word-embeddings programatically | |
chakin | Download pre-trained word vectors | |
zeugma | Pretrained-word embeddings as scikit-learn transformers | |
starspace | Learn embeddings for anything | |
svd2vec | Learn embeddings from co-occurrence | |
all-but-the-top | Post-processing for word vectors | |
entity-embed | Train custom embeddings for named entities | |
Emotion Classification | goemotion-pytorch, text2emotion | |
emosent-py | Sentiment scores for Emojis | |
Feature Generation | homer, textstat | Readability scores |
LexicalRichness | Lexical Richness Measure | |
Fill mask | fitbert | |
Finite State Transducer | OpenFST | |
Gibberish Detection | nostril, gibberish-detector | |
Grammar Induction | gitta, grasp | Generate CFG from sentences |
Information Extraction | claucy | |
GiveMe5W1H | Extract 5-why 1-how phrases from news | |
spikex | Spacy pipeline for knowledge extraction | |
Keyword extraction | rake, multi-rake, pke, phrasemachine, keybert, word2phrase | |
pyate | Automated Term Extraction | |
Knowledge | conceptnet-lite | |
stanford-openie | Knowledge Graphs | |
verbnet-parser | VerbNet parser | |
Knowledge Distillation | textbrewer, aquvitae | |
Language Model Scoring | lm-scorer, bertscore, kenlm, spacy_kenlm, mlm-scoring | |
Lexical Simplification | easee | Evaluation metric |
Metrics | seqeval | NER, POS tagging |
ranking-metrics, cute_ranking | Metrics for Information Retrieval | |
mir_eval | Music Information Retrieval Metrics | |
Morphology | unimorph | Morphology data for many languages |
Multilingual support | polyglot, trankit | |
inltk, indic_nlp | Indic Languages | |
cltk | NLP for latin and classic languages | |
langrank | Auto-select optimal transfer language | |
Named Entity Recognition(NER) | spaCy , Stanford NER, sklearn-crfsuite | |
med7 | Spacy NER for medical records | |
Nearest neighbor | faiss, sparse_dot_topn, n2, autofaiss | |
NLU | snips-nlu | |
ParlAI | Dialogue System | |
Paraphrasing | parrot | |
pegasus | Question Paraphrasing | |
paraphrase_diversity_ranker | Rank paraphrases of sentence | |
sentaugment | Paraphrase mining | |
Phonetics | epitran | Transliterate text into IPA |
allosaurus | Recognize phone for 2000 languages | |
Phonology | panphon | Generate phonological feature representations |
phoible | Database of segment inventories for 2186 languages | |
Probabilistic parsing | parserator | Create domain-specific parser for address, name etc. |
Profanity detection | profanity-check | |
Pronunciation | pronouncing | |
Question Answering | haystack | Build end-to-end QA system |
mcQA | Multiple Choice Question Answering | |
TAPAS | Table Question Answering | |
Question Generation | question-generation, questiongen.ai | Question Generation Pipeline for Transformers |
Ranking | transformer-rankers | |
Relation Extraction | OpenNRE | |
Search | elasticsearch-dsl | Wrapper for elastic search |
jina | production-level neural semantic search | |
mellisearch-python | ||
Semantic parsing | quepy | |
Sentiment | vaderSentiment, afinn | Rule based |
absa | Aspect Based Sentiment Analysis | |
xlm-t | Models | |
Spacy Extensions | spacy-pattern-builder | Generate dependency matcher patterns automatically |
spacy_grammar | Rule-based grammar error detection | |
role-pattern-builder | Pattern based SRL | |
textpipeliner | Extract RDF triples | |
tenseflow | Convert tense of sentence | |
camphr | Wrapper to transformers, elmo, udify | |
spleno | Domain-specific lemmatization | |
spacy-udpipe | Use UDPipe from Spacy | |
spacymoji | Add emoji metadata to spacy docs | |
String match | phrase-seeker, textsearch | |
jellyfish, fuzzy, doublemetaphone | Perform string and phonetic comparison | |
clavier | Edit distance based on keyboard layout | |
flashtext | Super-fast extract and replace keywords | |
pythonverbalexpressions | Verbally describe regex | |
commonregex | Ready-made regex for email/phone etc. | |
textdistance, editdistance, word-mover-distance, edlib | Text distances | |
wmd-relax | Word mover distance for spacy | |
fuzzywuzzy, spaczz, PolyFuzz, rapidfuzz, fuzzymatcher | Fuzzy Search | |
deduplipy, dedupe | Active-Learning based fuzzy matching | |
recordlinkage | Record Linkage | |
Summarization | textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval | |
doc2query | Summarize document with queries | |
summarizers | Controllable summarization | |
insight_extractor | Extract insightful sentences from docs | |
Text Extraction | textract (Image, Audio, PDF) | |
Text Generation | gp2client, textgenrnn, gpt-2-simple, aitextgen | GPT-2 |
markovify | Markov chains | |
accelerated-text | Template-based generation | |
keytotext | Keyword to Sentence Generation | |
Transliteration | wiktra | |
Machine Translation | MarianMT, Opus-MT, joeynmt, OpenNMT, EasyNMT, argos-translate, dl-translate | |
googletrans, word2word, translate-python, deep_translator | Translation libraries | |
mosesdecoder | Statistical MT | |
apertium | RBMT | |
translators | Free calls to multiple translation APIs | |
giza++, fastalign, simalign, eflomal, awesome-align | Word Alignment | |
Thesaurus | python-datamuse | |
Toxicity Detection | detoxify | |
Topic Modeling | gensim, guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic, tomotopy, ToModAPI | |
zeroshot_topics | Zero-shot topic modeling | |
octis | Evaluate topic models | |
Typology | lang2vec | Compare typological features of languages |
Visualization | stylecloud | Word Clouds |
scattertext | Compare word usage across segments | |
picture-text | Interactive tree-maps for hierarchical clustering | |
ipymarkup | Visualize NER and syntax | |
Verb Conjugation | nodebox_linguistics_extended, mlconj3 | |
Word Sense Disambiguation | pywsd, ewiser, supwsd | |
frame-english-fast | Verb Disambiguation | |
Zero Shot Learning | setfit |
#Computer Vision
Category | Tool | Remarks |
---|---|---|
Face recognition | face_recognition, mtcnn, insightface, face-detection | |
face-alignment | Find facial landmarks | |
Facial-Expression-Recognition.Pytorch | Face Emotion | |
Face swapping | faceit, faceit-live, avatarify | |
GANS | mimicry, imaginaire, pytorch-lightning-gans | |
High-level libraries | terran | Face detection, recognition, pose estimation |
Image Hashing | ImageHash, imagededup | |
Image Inpainting | GAN Image Inpainting | |
Image Processing | scikit-image, imutils, opencv-wrapper, opencv-python | |
torchio | Medical Images | |
Object detection | luminoth, detectron2, mmdetection, icevision | |
OCR | keras-ocr, pytesseract, keras-craft, ocropy, doc2text | |
easyocr, kraken, PaddleOCR | Multilingual OCR | |
layout-parser, pdftabextract | OCR tables from document | |
Segmentation | segmentation_models | Keras |
segmentation_models.pytorch | Segmentation models in PyTorch | |
Semantic Search | scoper | Video |
Video summarization | videodigest |
#Speech
Category | Tool | Remarks |
---|---|---|
Diarization | resemblyzer | |
Feature Engineering | python_speech_features | Convert raw audio to features |
Libraries | speechbrain, pyannotate, librosa, espnet | |
silero-models | Pre-trained models | |
Source Separation | spleeter, nussl, open-unmix-pytorch, asteroid | |
Speech Recognition | kaldi, speech_recognition, delta, pocketsphinx-python, deepspeech, stt, vosk | |
Speech Synthesis | festvox, cmuflite, tts |
#Recommendation System
Category | Tool | Remarks |
---|---|---|
Apriori algorithm | apyori | |
Collaborative Filtering | implicit | |
Libraries | xlearn, DeepCTR, RankFM | Factorization machines (FM), and field-aware factorization machines (FFM) |
libmf-python | Matrix Factorization | |
lightfm, spotlight | Popular Recsys algos | |
tensorflow_recommenders | Recommendation System in Tensorflow | |
Metrics | rs_metrics | |
Recommendation System in Pytorch | CaseRecommender | |
Scikit-learn like API | surprise |
#Timeseries
Category | Tool | Remarks |
---|---|---|
Libraries | prophet, tslearn, pyts, seglearn, cesium, stumpy, darts, gluon-ts, stldecompose | |
sktime | Scikit-learn like API | |
atspy | Automated time-series models | |
Anomaly Detection | orion, luminaire | Unsupervised time-series anomaly detection |
ARIMA models | pmdarima |
#Hyperparameter Optimization
Category | Tool | Remarks |
---|---|---|
General | hyperopt, optuna, evol, talos | |
Keras | keras-tuner | |
Parameter optimization | ParameterImportance | |
Scikit-learn | hyperopt-sklearn, scikit-optimize | Bayesian Optimization |
sklearn-deap, sklearn-generic-opt | Evolutionary algorithm |
#Phase: Validation
#Experiment Monitoring
Category | Tool | Remarks |
---|---|---|
Experiment tracking | tensorboard, mlflow | |
lrcurve, livelossplot | Plot realtime learning curve in Keras | |
GPU Usage | gpumonitor, nvtop | |
jupyterlab-nvdashboard | See GPU Usage in jupyterlab | |
MLOps | clearml, wandb, neptune.ai, replicate.ai | |
Notification | knockknock | Get notified by slack/email |
jupyter-notify | Notify when task is completed in jupyter | |
apprise | Notify to any platform | |
pynotifier | Generate desktop notification |
#Interpretability
Category | Tool | Remarks |
---|---|---|
Adversarial Attack | cleverhans | General |
foolbox | Image | |
triggers | NLP | |
Interpret models | eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML, shapash | |
exbert | Interpret BERT | |
bertviz | Explore self-attention in BERT | |
NLP | word2viz, whatlies | word-vectors |
Language Interpretability Tool, transformers-interpret |
#Visualization
Category | Tool | Remarks |
---|---|---|
Diagrams | dl-visuals, ml-visuals | |
chalk | Declarative drawing API | |
Libraries | matplotlib, seaborn, pygal, plotly, plotnine | |
yellowbrick, scikit-plot | Visualization for scikit-learn | |
pyldavis | Visualize topics models | |
dtreeviz | Visualize decision tree | |
txtmarker | Highlight text in PDF | |
metriculous | Visualize model performance | |
Animated charts | bar_chart_race | Bar chart race animation |
pandas_alive | Animated charts in pandas | |
High dimensional visualization | umap | |
ivis | Ivis Algorithm | |
Interactive charts | bokeh | |
flourish-studio | Create interactive charts online | |
mpld3 | Matplotlib to D3 Converter | |
Model Visualization | netron, nn-svg | Architecture |
keract | Activation maps for keras | |
keras-vis | Visualize keras models | |
PlotNeuralNet | Latex code for drawing neural network | |
loss-landscape-anim | Generate loss landscape of optimizer | |
Styling | open-color | Color Schemes |
mplcyberpunk | Cyberpunk style for matplotlib | |
chart.xkcd | XKCD like charts | |
adjustText | Prevent overlap when plotting point text label | |
Generate graphs using markdown | mermaid | |
Tree-map chart | squarify | |
3D charts | babyplots |
#Phase: Production
#Model Export
Category | Tool | Remarks |
---|---|---|
Benchmarking | torchprof | Profile pytorch layers |
scalene, pyinstrument | Profile python code | |
k6 | Load test API | |
ai-benchmark | Bechmark VM on 19 different models | |
Cloud Storage | Zenodo, Github Releases, OneDrive, Google Drive, Dropbox, S3, mega, DAGsHub, huggingface-hub | |
Data Pipeline | pypeln | |
Dependencies | pip-chill | pip freeze without dependencies |
pipreqs | Generate requirements.txt based on imports | |
conda-pack | Export conda for offline use | |
Distributed training | horovod | |
Model Store | modelstore | |
Optimization | nn_pruning | Movement Pruning |
aimet, tensorflow-lite | Quantization | |
Serialization | sklearn-porter, m2cgen | Transpile sklearn model to C, Java, JavaScript and others |
onnxmltools | Classic ML models to onnx format | |
hummingbird | Convert ML models to PyTorch | |
cloudpickle, jsonpickle | Pickle extensions |
#Inference
Category | Tool | Remarks |
---|---|---|
Authentication | pyjwt (JWT), auth0, okta, cognito | |
Batch Jobs | airflow, luigi, dagster, oozie, prefect, kubernetes-cron-jobs, argo | |
rq, schedule, huey | Task Queue | |
mlq | Queue ML Tasks in Flask | |
Caching | cachetools, cachew (cache to local sqlite) | |
redis-py, pymemcache | ||
Cloud Monitoring | datadog | |
Configuration Management | config, python-decouple, python-dotenv, dynaconf | |
CORS | flask-cors | CORS in Flask |
Database | flask-sqlalchemy, tinydb, flask-pymongo, odmantic | |
tortoise-orm | Asyncio ORM similar to Django | |
Monitoring | whylogs | Data Logging |
grafana, prometheus | Metric | |
sentry, honeybadger | Error Reporting | |
Data Validation | schema, jsonschema, cerebrus, pydantic, marshmallow, validators | |
Dashboard | streamlit | Generate frontend with python |
gradio | Fast UI generation for prototyping | |
dash | React Dashboard using Python | |
voila | Convert Jupyter notebooks into dashboard | |
streamlit-drawable-canvas | Drawable Canvas for Streamlit | |
streamlit-terran-timeline | Show timeline of faces in videos | |
streamlit components | Collection of streamlit components | |
Deployment Checklist | ml-checklist | |
Documentation | mkdocs, pdoc | |
Drift Detection | alibi-detect, torchdrift, boxkite | Outlier and drift detection |
Edge Deployment | Tensorfow Lite, coreml, Tensorflow.js) | |
Logging | loguru | |
Model Serving | cortex, torchserve, ray-serve, bentoml, seldon-core | Serving Framework |
flask, fastapi | API Frameworks | |
Processing | pyspark, hive | |
Serverless | magnum | Use FastAPI in Lambda |
Server-Side Events | sse-starlette | Server-side events for FastAPI |
Stream Processing | flink, kafka, apache beam | |
Testing | schemathesis | Automatic test generation from Swagger |
pytest-benchmark | Profile time in pytest | |
exdown | Extract code from markdown files | |
mktestdocs | Test code present in markdown files |
#Python libraries
Category | Tool | Remarks |
---|---|---|
Async | tomorrow | |
Audio | simpleaudio | Play audio using python |
Automation | pyuserinput, pyautogui, pynput | Control mouse and keyboard |
bloom filter | python-bloomfilter | |
CLI Formatting | rich | |
Concurrent database | pickleshare | |
Code to Maths | latexify-py, handcalcs | |
Create interactive prompts | prompt-toolkit | |
Collections | bidict | Bidirectional dictionary |
sortedcontainers | Sorted list, set and dict | |
munch | Dictionary with dot access | |
Correlation Metric | xicor | |
Date and Time | pendulum | |
Decorators | retrying (retry some function) | |
Debugging | PySnooper | |
Improved doctest | xdoctest | |
Linting | pylint, pycodestyle | Code Formatting |
pydocstyle | Check docstring | |
safety, bandit, shellcheck | Check vulnerabilities | |
mypy | Check types | |
black | Automated Formatting | |
Leaflet maps from python | folium | |
Multiprocessing | filelock | Lock files during access from multiple process |
Path-like interface to remote files | pathy | |
Pretty print tables in CLI | tabulate | |
Progress bar | fastprogress, tqdm | |
Run python libraries in sandbox | pipx | |
Shell commands as functions | sh | |
Standard Library Extension | ubelt | |
Subprocess | delegator.py | |
Testing | crosshair(find failure cases for functions) | |
Virtual webcam | pyfakewebcam |
#Utilities
Category | Tool | Remarks |
---|---|---|
Colab | colab-cli | Manager colab notebook from command line |
Drive | drive-cli | Use google drive similar to git |
Database | mlab | Free 500 MB MongoDB |
Data Visualization | flourish-studio | |
Git | gitjk | Undo what you just did in git |
Linux | ripgrep | |
Trade-off tools | egograph | Find alternatives to anything |