legout
@legout
Public Skills
data-engineering-catalogs
by legout
"Data catalogs: Iceberg catalogs (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, comparisons of Amundsen/DataHub/OpenMetadata, and patterns for unified data access."
data-engineering-ai-ml
by legout
"AI/ML data pipelines: embedding generation, vector databases, RAG patterns, LLM monitoring, and batch inference workflows."
data-engineering-storage-lakehouse
by legout
"Lakehouse table formats: Delta Lake, Apache Iceberg, and Apache Hudi for ACID transactions, schema evolution, and time travel on data lakes."
data-engineering-quality
by legout
"Data quality testing and validation with Great Expectations and Pandera. Schema validation, data quality tests, profiling, and automated validation in pipelines."
data-engineering-core
by legout
"Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python."
data-engineering-storage-remote-access-integrations-pyarrow
by legout
"Using PyArrow's parquet and dataset modules with remote filesystems (S3, GCS, Azure). Covers native filesystems, fsspec bridge, and obstore wrapper."
data-engineering-storage-remote-access-libraries-fsspec
by legout
"Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem."
data-engineering-storage-remote-access-libraries-pyarrow-fs
by legout
"Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets."
data-engineering-storage-remote-access-integrations-delta-lake
by legout
"Delta Lake integration with cloud storage (S3, GCS, Azure). Covers storage_options, PyArrow filesystem, time travel, and partitioned writes."
data-engineering-storage-remote-access-libraries-obstore
by legout
"High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper."
data-engineering-storage-authentication
by legout
"Cloud storage authentication patterns: AWS, GCP, Azure credentials, IAM roles, service principals, secret management, and secure credential handling for data pipelines."
data-engineering-storage-remote-access-integrations-polars
by legout
"Integrating Polars with remote filesystems (S3, GCS, Azure). Covers native cloud support, fsspec integration, PyArrow dataset scanning, and partitioned writes."
data-engineering-streaming
by legout
"Real-time data pipelines with Apache Kafka, MQTT (IoT), and NATS JetStream. Covers producers, consumers, streaming patterns, and integration with data platforms."
data-engineering-storage-remote-access
by legout
"Cloud storage access in Python: fsspec, pyarrow.fs, obstore libraries, plus integrations with Polars, DuckDB, PyArrow, Delta Lake, and Iceberg."
data-science-feature-engineering
by legout
"Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations."
data-engineering-storage-remote-access-integrations-duckdb
by legout
"Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints."
data-engineering-best-practices
by legout
"Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions."
data-engineering-observability
by legout
"Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting."
data-engineering
by legout
"Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines."
data-engineering-storage-formats
by legout
"Modern data serialization formats: Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (chunked arrays), Avro, and ORC. Covers compression, partitioning, and format selection."
data-science-eda
by legout
"Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling."
data-engineering-storage-remote-access-integrations-iceberg
by legout
"Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations."
data-science-interactive-apps
by legout
"Interactive web apps for data science: Streamlit, Panel, and Gradio. Use for prototyping ML models, creating data exploration dashboards, and sharing insights with non-technical stakeholders."
data-engineering-storage-remote-access-integrations-pandas
by legout
"Reading and writing data with Pandas from/to cloud storage (S3, GCS, Azure) using fsspec and PyArrow filesystems."
data-engineering-orchestration
by legout
"Pipeline orchestration and workflow management with Prefect, Dagster, and dbt. Covers scheduling, dependency management, retries, and integration patterns."
data-science-model-evaluation
by legout
"Model evaluation and validation: cross-validation, metrics, hyperparameter tuning, and model comparison. Use when assessing model performance, selecting models, or diagnosing modeling issues."
flowerpower
by legout
"Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage."
data-science-notebooks
by legout
"Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders."
data-science-visualization
by legout
"Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience."