Comprehensive data science, machine learning, and AI guide covering Python, deep learning, NLP, LLMs, prompt engineering, and MLOps. Use when building AI models, data pipelines, or machine learning systems.
Install
npx skillscat add pluginagentmarketplace/custom-plugin-design-system/data-ai-guide Install via the SkillsCat registry.
SKILL.md
Data Science & AI Guide
Master data science, machine learning, generative AI, and modern AI engineering practices.
Quick Start
Python Data Science Stack
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Load and prepare data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)Deep Learning with PyTorch
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(784, 128)
self.linear2 = nn.Linear(128, 10)
def forward(self, x):
x = torch.relu(self.linear1(x))
return self.linear2(x)
# Training loop
model = SimpleNN()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()LLM Prompt Engineering
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
temperature=0.7
)Data Science Path
Fundamentals
- Mathematics: Statistics, linear algebra, calculus
- Python: Libraries (Pandas, NumPy, Scikit-learn)
- Data Analysis: Exploratory analysis, visualization
- SQL: Querying and data manipulation
Machine Learning
- Supervised Learning: Regression, classification
- Unsupervised Learning: Clustering, dimensionality reduction
- Model Evaluation: Cross-validation, metrics
- Hyperparameter Tuning: Grid search, Bayesian optimization
Deep Learning
- Neural Networks: Architecture, training
- CNNs: Computer vision tasks
- RNNs: Sequence modeling
- Transformers: Modern architecture for NLP/Vision
Natural Language Processing
- Text Processing: Tokenization, embeddings
- Word Embeddings: Word2Vec, GloVe, FastText
- BERT: Contextual embeddings
- Transformers: GPT, BERT for various NLP tasks
Generative AI & LLMs
Large Language Models
- GPT Family: GPT-3.5, GPT-4 for text generation
- Claude: Constitutional AI models
- Open Source: Llama, Mistral, Zephyr
- Fine-tuning: Adapting models for specific tasks
Prompt Engineering
- Role-based Prompting: Setting context and expertise
- Few-shot Learning: Examples in prompt
- Chain-of-Thought: Step-by-step reasoning
- Retrieval Augmented Generation (RAG): Knowledge augmentation
# RAG Example
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)AI Agents
- Tool Use: Agents calling external tools
- Planning: Multi-step task execution
- Memory: Conversation history, context
- Evaluation: Assessing agent performance
Data Engineering
ETL Pipelines
- Apache Airflow: Workflow orchestration
- dbt: Data transformation
- Kafka: Stream processing
- Spark: Distributed processing
Big Data
- Hadoop: Distributed storage and processing
- Spark: In-memory processing framework
- Scala: Spark's native language
- Distributed Systems: Understanding CAP theorem
Data Warehousing
- Snowflake: Cloud data warehouse
- BigQuery: Google's data warehouse
- Redshift: AWS data warehouse
- Star Schema: Dimensional modeling
MLOps
Model Management
- Model Versioning: Tracking model versions
- Model Registry: MLflow, Weights & Biases
- Experiment Tracking: Monitoring training runs
- Model Cards: Documenting model capabilities
Deployment
- Model Serving: FastAPI, TFServing
- Containerization: Docker for models
- Kubernetes: Production ML deployment
- API Monitoring: Performance and data drift
Monitoring
- Data Drift: Detecting distribution changes
- Model Drift: Performance degradation
- Feature Store: Consistent feature serving
- Observability: Logging and metrics
Technology Stack
Core Libraries
- Pandas: Data manipulation
- NumPy: Numerical computing
- Scikit-learn: Machine learning
- Matplotlib/Seaborn: Visualization
- Plotly: Interactive plots
Deep Learning
- TensorFlow: Keras API, distributed training
- PyTorch: Dynamic graphs, research-friendly
- JAX: Functional programming for ML
LLM Frameworks
- LangChain: Building LLM applications
- LlamaIndex: RAG and indexing
- OpenAI API: GPT models access
- Hugging Face: Model hub and transformers
Learning Path
Fundamentals (3 months)
- Python programming
- Statistics and mathematics
- Data manipulation with Pandas
Machine Learning (3 months)
- Supervised learning
- Model evaluation
- Feature engineering
Deep Learning (2 months)
- Neural networks
- CNNs and RNNs
- Transformers
Specialization (ongoing)
- NLP / Computer Vision / Tabular Data
- LLMs and generative AI
- MLOps and production
Projects
- Iris Classification - Classic ML project
- Housing Price Prediction - Regression
- Sentiment Analysis - NLP with transformers
- Image Classification - CNN with deep learning
- LLM Chatbot - Using prompt engineering
- RAG System - Knowledge-augmented AI
- Time Series Forecasting - Stock predictions
Resources
Learning Platforms
- Coursera: Andrew Ng's ML course
- Fast.ai: Practical deep learning
- DataCamp: Interactive data science
- Kaggle: Competitions and datasets
Documentation
Roadmap.sh Reference: https://roadmap.sh/ai-engineer
Status: ✅ Production Ready | SASMP: v1.3.0 | Bonded Agent: 04-data-ai-specialist