Implements traditional NLP techniques before transformers. Covers text vectorization (TF-IDF, Bag-of-Words), word embeddings (Word2Vec, FastText, GloVe, Doc2Vec), topic modeling (LDA, Gensim), and text similarity (Jaccard, Cosine, FuzzyWuzzy, record linkage). Use when building text classifiers without deep learning, doing topic extraction, entity matching, or when user mentions 'TF-IDF', 'Word2Vec', 'topic modeling', 'LDA', 'text similarity', 'n-grams', 'document clustering', 'GloVe', 'Doc2Vec', 'FuzzyWuzzy', or 'record linkage'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/nlp-classical Install via the SkillsCat registry.
SKILL.md
NLP Classical - Traditional Text Processing
עיבוד טקסט קלאסי: וקטוריזציה, embeddings, ו-topic modeling.
Quick Start - TF-IDF Classification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Vectorize text
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(texts)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, labels, test_size=0.2, stratify=labels, random_state=42
)
# Train classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
print(f"Accuracy: {model.score(X_test, y_test):.3f}")Quick Start - Word2Vec Similarity
from gensim.models import Word2Vec
# Train Word2Vec
sentences = [text.split() for text in texts] # Tokenized sentences
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)
# Find similar words
similar = model.wv.most_similar('python', topn=5)
print(similar)
# Word vector
vector = model.wv['python']When This Skill Activates
Use this skill when:
- Building text classifiers without deep learning
- Creating document vectors for similarity search
- Extracting topics from a corpus
- Matching records across datasets (entity resolution)
- Working with limited compute resources
- Need interpretable text features
Core Patterns
Pattern 1: Text Vectorization Methods
| Method | When to Use | Captures |
|---|---|---|
| Binary | Document presence | Has word? |
| Bag-of-Words | Word counts matter | Frequency |
| TF-IDF | Most cases | Importance |
from sklearn.feature_extraction.text import (
CountVectorizer, TfidfVectorizer
)
# Binary (presence/absence)
binary = CountVectorizer(binary=True)
X_binary = binary.fit_transform(texts)
# Bag-of-Words (word counts)
bow = CountVectorizer()
X_bow = bow.fit_transform(texts)
# TF-IDF (term importance)
tfidf = TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2), # Unigrams and bigrams
min_df=2, # Ignore rare terms
max_df=0.95 # Ignore too common terms
)
X_tfidf = tfidf.fit_transform(texts)Pattern 2: Word2Vec Training
from gensim.models import Word2Vec
# Prepare data (list of tokenized sentences)
sentences = [
['machine', 'learning', 'is', 'fun'],
['deep', 'learning', 'uses', 'neural', 'networks'],
# ...
]
# Train Word2Vec
model = Word2Vec(
sentences,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=2, # Ignore words appearing < 2 times
workers=4, # Parallel threads
sg=1 # 1 = Skip-gram, 0 = CBOW
)
# Save/Load
model.save('word2vec.model')
model = Word2Vec.load('word2vec.model')
# Get word vector
vector = model.wv['machine']
# Similar words
model.wv.most_similar('machine', topn=10)
# Word arithmetic
# king - man + woman ≈ queen
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])Pattern 3: GloVe with Pickle Caching
import pickle
import numpy as np
def load_glove(glove_path, cache_path='glove_cache.pkl'):
"""Load GloVe with caching for faster subsequent loads."""
try:
# Try loading from cache
with open(cache_path, 'rb') as f:
return pickle.load(f)
except FileNotFoundError:
# Load from text file
embeddings = {}
with open(glove_path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
# Cache for next time
with open(cache_path, 'wb') as f:
pickle.dump(embeddings, f)
return embeddings
glove = load_glove('glove.6B.100d.txt')
print(f"Loaded {len(glove)} word vectors")Pattern 4: Doc2Vec for Document Embeddings
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Prepare tagged documents
documents = [
TaggedDocument(words=text.split(), tags=[str(i)])
for i, text in enumerate(texts)
]
# Train Doc2Vec
model = Doc2Vec(
documents,
vector_size=100,
window=5,
min_count=2,
workers=4,
dm=1, # 1 = DM (paragraph vector), 0 = DBOW
epochs=20
)
# Get document vector
doc_vector = model.dv['0'] # By tag
# Infer vector for new document
new_vector = model.infer_vector(['new', 'document', 'text'])
# Find similar documents
similar_docs = model.dv.most_similar('0', topn=5)Pattern 5: LDA Topic Modeling
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Vectorize (use counts for LDA, not TF-IDF)
vectorizer = CountVectorizer(max_features=5000, max_df=0.95, min_df=2)
X = vectorizer.fit_transform(texts)
# Fit LDA
lda = LatentDirichletAllocation(
n_components=10, # Number of topics
random_state=42,
max_iter=20
)
lda.fit(X)
# Print top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
top_words = [feature_names[i] for i in topic.argsort()[:-10:-1]]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
# Transform documents to topic distribution
doc_topics = lda.transform(X)Pattern 6: Text Similarity
from sklearn.metrics.pairwise import cosine_similarity
# Cosine similarity between TF-IDF vectors
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
# Similarity matrix
sim_matrix = cosine_similarity(X)
# Similarity between two documents
sim = cosine_similarity(X[0:1], X[1:2])[0][0]
print(f"Similarity: {sim:.3f}")Pattern 7: FuzzyWuzzy for String Matching
from fuzzywuzzy import fuzz, process
# Simple ratio
fuzz.ratio("hello world", "hello word") # 91
# Partial ratio (substring matching)
fuzz.partial_ratio("hello world", "hello") # 100
# Token sort (order-independent)
fuzz.token_sort_ratio("hello world", "world hello") # 100
# Find best match in list
choices = ["apple inc", "apple computer", "microsoft"]
best = process.extractOne("apple", choices)
print(best) # ('apple inc', 90)
# Get top N matches
matches = process.extract("apple", choices, limit=3)Pattern 8: Record Linkage with Character N-grams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Character n-grams capture typos better than word n-grams
vectorizer = TfidfVectorizer(
analyzer='char', # Character-level
ngram_range=(2, 4), # 2-4 character n-grams
)
# Fit on all names from both datasets
all_names = list(dataset1['name']) + list(dataset2['name'])
vectorizer.fit(all_names)
# Transform
X1 = vectorizer.transform(dataset1['name'])
X2 = vectorizer.transform(dataset2['name'])
# Compute similarity matrix
sim_matrix = cosine_similarity(X1, X2)
# Find matches above threshold
threshold = 0.8
matches = []
for i, row in enumerate(sim_matrix):
best_match = row.argmax()
if row[best_match] >= threshold:
matches.append((i, best_match, row[best_match]))Reference Navigation
For detailed content, see:
- Text Vectorization:
reference/text_vectorization.md- Binary, BoW, TF-IDF, n-grams - Word Embeddings:
reference/word_embeddings.md- Word2Vec, FastText, GloVe, phrase_to_vector - Topic Modeling:
reference/topic_modeling.md- LDA, Gensim, MovieLens examples - Text Similarity:
reference/text_similarity.md- Cosine, Jaccard, record linkage - Text Preprocessing:
reference/text_preprocessing.md- Cleaning, tokenization, FuzzyWuzzy
Common Mistakes to Avoid
1. Using TF-IDF for LDA
# WRONG: LDA expects raw counts
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
lda.fit(X) # Bad!
# CORRECT: Use CountVectorizer
count = CountVectorizer()
X = count.fit_transform(texts)
lda.fit(X)2. Not Handling OOV Words
# WRONG: KeyError for unknown words
vector = word2vec['unknown_word'] # Crashes!
# CORRECT: Check first
if 'word' in model.wv:
vector = model.wv['word']
else:
vector = np.zeros(model.vector_size) # Or skip3. Averaging Empty Lists
# WRONG: NaN from empty list
def doc_to_vector(doc, model):
vectors = [model.wv[w] for w in doc if w in model.wv]
return np.mean(vectors, axis=0) # NaN if vectors is empty!
# CORRECT: Handle empty case
def doc_to_vector(doc, model):
vectors = [model.wv[w] for w in doc if w in model.wv]
if len(vectors) == 0:
return np.zeros(model.vector_size)
return np.mean(vectors, axis=0)4. Not Lowercasing Consistently
# WRONG: Case mismatch
vectorizer.fit(["Hello World"])
vectorizer.transform(["hello world"]) # Won't match!
# CORRECT: Lowercase everything
vectorizer = TfidfVectorizer(lowercase=True) # Default is TrueTeaching Mode
When explaining NLP concepts:
- TF-IDF intuition: "Words that appear often in one document but rarely across all documents are important"
- Word2Vec intuition: "Words in similar contexts have similar meanings - represented by nearby vectors"
- LDA intuition: "Documents are mixtures of topics, topics are distributions over words"
- Cosine similarity: "Angle between vectors - ignores magnitude, focuses on direction"
TF-IDF Formula Visual
TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Where:
TF = count(term in doc) / total_terms_in_doc
IDF = log(total_docs / docs_containing_term)
High TF-IDF = appears often in this doc, rarely in others
= likely important for this document!