levy-n

nlp-classical

Implements traditional NLP techniques before transformers. Covers text vectorization (TF-IDF, Bag-of-Words), word embeddings (Word2Vec, FastText, GloVe, Doc2Vec), topic modeling (LDA, Gensim), and text similarity (Jaccard, Cosine, FuzzyWuzzy, record linkage). Use when building text classifiers without deep learning, doing topic extraction, entity matching, or when user mentions 'TF-IDF', 'Word2Vec', 'topic modeling', 'LDA', 'text similarity', 'n-grams', 'document clustering', 'GloVe', 'Doc2Vec', 'FuzzyWuzzy', or 'record linkage'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/nlp-classical

Install via the SkillsCat registry.

SKILL.md

NLP Classical - Traditional Text Processing

עיבוד טקסט קלאסי: וקטוריזציה, embeddings, ו-topic modeling.

Quick Start - TF-IDF Classification

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Vectorize text
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(texts)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.2, stratify=labels, random_state=42
)

# Train classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
print(f"Accuracy: {model.score(X_test, y_test):.3f}")

Quick Start - Word2Vec Similarity

from gensim.models import Word2Vec

# Train Word2Vec
sentences = [text.split() for text in texts]  # Tokenized sentences
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)

# Find similar words
similar = model.wv.most_similar('python', topn=5)
print(similar)

# Word vector
vector = model.wv['python']

When This Skill Activates

Use this skill when:

  • Building text classifiers without deep learning
  • Creating document vectors for similarity search
  • Extracting topics from a corpus
  • Matching records across datasets (entity resolution)
  • Working with limited compute resources
  • Need interpretable text features

Core Patterns

Pattern 1: Text Vectorization Methods

Method When to Use Captures
Binary Document presence Has word?
Bag-of-Words Word counts matter Frequency
TF-IDF Most cases Importance
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfVectorizer
)

# Binary (presence/absence)
binary = CountVectorizer(binary=True)
X_binary = binary.fit_transform(texts)

# Bag-of-Words (word counts)
bow = CountVectorizer()
X_bow = bow.fit_transform(texts)

# TF-IDF (term importance)
tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=2,            # Ignore rare terms
    max_df=0.95          # Ignore too common terms
)
X_tfidf = tfidf.fit_transform(texts)

Pattern 2: Word2Vec Training

from gensim.models import Word2Vec

# Prepare data (list of tokenized sentences)
sentences = [
    ['machine', 'learning', 'is', 'fun'],
    ['deep', 'learning', 'uses', 'neural', 'networks'],
    # ...
]

# Train Word2Vec
model = Word2Vec(
    sentences,
    vector_size=100,    # Embedding dimension
    window=5,           # Context window size
    min_count=2,        # Ignore words appearing < 2 times
    workers=4,          # Parallel threads
    sg=1                # 1 = Skip-gram, 0 = CBOW
)

# Save/Load
model.save('word2vec.model')
model = Word2Vec.load('word2vec.model')

# Get word vector
vector = model.wv['machine']

# Similar words
model.wv.most_similar('machine', topn=10)

# Word arithmetic
# king - man + woman ≈ queen
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

Pattern 3: GloVe with Pickle Caching

import pickle
import numpy as np

def load_glove(glove_path, cache_path='glove_cache.pkl'):
    """Load GloVe with caching for faster subsequent loads."""
    try:
        # Try loading from cache
        with open(cache_path, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        # Load from text file
        embeddings = {}
        with open(glove_path, 'r', encoding='utf-8') as f:
            for line in f:
                values = line.split()
                word = values[0]
                vector = np.array(values[1:], dtype='float32')
                embeddings[word] = vector

        # Cache for next time
        with open(cache_path, 'wb') as f:
            pickle.dump(embeddings, f)

        return embeddings

glove = load_glove('glove.6B.100d.txt')
print(f"Loaded {len(glove)} word vectors")

Pattern 4: Doc2Vec for Document Embeddings

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare tagged documents
documents = [
    TaggedDocument(words=text.split(), tags=[str(i)])
    for i, text in enumerate(texts)
]

# Train Doc2Vec
model = Doc2Vec(
    documents,
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    dm=1,              # 1 = DM (paragraph vector), 0 = DBOW
    epochs=20
)

# Get document vector
doc_vector = model.dv['0']  # By tag

# Infer vector for new document
new_vector = model.infer_vector(['new', 'document', 'text'])

# Find similar documents
similar_docs = model.dv.most_similar('0', topn=5)

Pattern 5: LDA Topic Modeling

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize (use counts for LDA, not TF-IDF)
vectorizer = CountVectorizer(max_features=5000, max_df=0.95, min_df=2)
X = vectorizer.fit_transform(texts)

# Fit LDA
lda = LatentDirichletAllocation(
    n_components=10,    # Number of topics
    random_state=42,
    max_iter=20
)
lda.fit(X)

# Print top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-10:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

# Transform documents to topic distribution
doc_topics = lda.transform(X)

Pattern 6: Text Similarity

from sklearn.metrics.pairwise import cosine_similarity

# Cosine similarity between TF-IDF vectors
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)

# Similarity matrix
sim_matrix = cosine_similarity(X)

# Similarity between two documents
sim = cosine_similarity(X[0:1], X[1:2])[0][0]
print(f"Similarity: {sim:.3f}")

Pattern 7: FuzzyWuzzy for String Matching

from fuzzywuzzy import fuzz, process

# Simple ratio
fuzz.ratio("hello world", "hello word")  # 91

# Partial ratio (substring matching)
fuzz.partial_ratio("hello world", "hello")  # 100

# Token sort (order-independent)
fuzz.token_sort_ratio("hello world", "world hello")  # 100

# Find best match in list
choices = ["apple inc", "apple computer", "microsoft"]
best = process.extractOne("apple", choices)
print(best)  # ('apple inc', 90)

# Get top N matches
matches = process.extract("apple", choices, limit=3)

Pattern 8: Record Linkage with Character N-grams

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Character n-grams capture typos better than word n-grams
vectorizer = TfidfVectorizer(
    analyzer='char',      # Character-level
    ngram_range=(2, 4),   # 2-4 character n-grams
)

# Fit on all names from both datasets
all_names = list(dataset1['name']) + list(dataset2['name'])
vectorizer.fit(all_names)

# Transform
X1 = vectorizer.transform(dataset1['name'])
X2 = vectorizer.transform(dataset2['name'])

# Compute similarity matrix
sim_matrix = cosine_similarity(X1, X2)

# Find matches above threshold
threshold = 0.8
matches = []
for i, row in enumerate(sim_matrix):
    best_match = row.argmax()
    if row[best_match] >= threshold:
        matches.append((i, best_match, row[best_match]))

Reference Navigation

For detailed content, see:

  • Text Vectorization: reference/text_vectorization.md - Binary, BoW, TF-IDF, n-grams
  • Word Embeddings: reference/word_embeddings.md - Word2Vec, FastText, GloVe, phrase_to_vector
  • Topic Modeling: reference/topic_modeling.md - LDA, Gensim, MovieLens examples
  • Text Similarity: reference/text_similarity.md - Cosine, Jaccard, record linkage
  • Text Preprocessing: reference/text_preprocessing.md - Cleaning, tokenization, FuzzyWuzzy

Common Mistakes to Avoid

1. Using TF-IDF for LDA

# WRONG: LDA expects raw counts
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
lda.fit(X)  # Bad!

# CORRECT: Use CountVectorizer
count = CountVectorizer()
X = count.fit_transform(texts)
lda.fit(X)

2. Not Handling OOV Words

# WRONG: KeyError for unknown words
vector = word2vec['unknown_word']  # Crashes!

# CORRECT: Check first
if 'word' in model.wv:
    vector = model.wv['word']
else:
    vector = np.zeros(model.vector_size)  # Or skip

3. Averaging Empty Lists

# WRONG: NaN from empty list
def doc_to_vector(doc, model):
    vectors = [model.wv[w] for w in doc if w in model.wv]
    return np.mean(vectors, axis=0)  # NaN if vectors is empty!

# CORRECT: Handle empty case
def doc_to_vector(doc, model):
    vectors = [model.wv[w] for w in doc if w in model.wv]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

4. Not Lowercasing Consistently

# WRONG: Case mismatch
vectorizer.fit(["Hello World"])
vectorizer.transform(["hello world"])  # Won't match!

# CORRECT: Lowercase everything
vectorizer = TfidfVectorizer(lowercase=True)  # Default is True

Teaching Mode

When explaining NLP concepts:

  1. TF-IDF intuition: "Words that appear often in one document but rarely across all documents are important"
  2. Word2Vec intuition: "Words in similar contexts have similar meanings - represented by nearby vectors"
  3. LDA intuition: "Documents are mixtures of topics, topics are distributions over words"
  4. Cosine similarity: "Angle between vectors - ignores magnitude, focuses on direction"

TF-IDF Formula Visual

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Where:
  TF  = count(term in doc) / total_terms_in_doc
  IDF = log(total_docs / docs_containing_term)

High TF-IDF = appears often in this doc, rarely in others
            = likely important for this document!