legout

@legout

GitHub

29 Skills

0 Total Stars

February 2026 Joined

Public Skills

data-engineering-catalogs

by legout

"Data catalogs: Iceberg catalogs (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, comparisons of Amundsen/DataHub/OpenMetadata, and patterns for unified data access."

Cloud 0 5mo ago

data-engineering-ai-ml

by legout

"AI/ML data pipelines: embedding generation, vector databases, RAG patterns, LLM monitoring, and batch inference workflows."

Processing 0 5mo ago

data-engineering-storage-lakehouse

by legout

"Lakehouse table formats: Delta Lake, Apache Iceberg, and Apache Hudi for ACID transactions, schema evolution, and time travel on data lakes."

Processing 0 5mo ago

data-engineering-quality

by legout

"Data quality testing and validation with Great Expectations and Pandera. Schema validation, data quality tests, profiling, and automated validation in pipelines."

Processing 0 5mo ago

data-engineering-core

by legout

"Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python."

Automation 0 5mo ago

data-engineering-storage-remote-access-integrations-pyarrow

by legout

"Using PyArrow's parquet and dataset modules with remote filesystems (S3, GCS, Azure). Covers native filesystems, fsspec bridge, and obstore wrapper."

Automation 0 5mo ago

data-engineering-storage-remote-access-libraries-fsspec

by legout

"Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem."

Auth 0 5mo ago

data-engineering-storage-remote-access-libraries-pyarrow-fs

by legout

"Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets."

Processing 0 5mo ago

data-engineering-storage-remote-access-integrations-delta-lake

by legout

"Delta Lake integration with cloud storage (S3, GCS, Azure). Covers storage_options, PyArrow filesystem, time travel, and partitioned writes."

Auth 0 5mo ago

data-engineering-storage-remote-access-libraries-obstore

by legout

"High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper."

Processing 0 5mo ago

data-engineering-storage-authentication

by legout

"Cloud storage authentication patterns: AWS, GCP, Azure credentials, IAM roles, service principals, secret management, and secure credential handling for data pipelines."

Auth 0 5mo ago

data-engineering-storage-remote-access-integrations-polars

by legout

"Integrating Polars with remote filesystems (S3, GCS, Azure). Covers native cloud support, fsspec integration, PyArrow dataset scanning, and partitioned writes."

Cloud 0 5mo ago

data-engineering-streaming

by legout

"Real-time data pipelines with Apache Kafka, MQTT (IoT), and NATS JetStream. Covers producers, consumers, streaming patterns, and integration with data platforms."

Automation 0 5mo ago

data-engineering-storage-remote-access

by legout

"Cloud storage access in Python: fsspec, pyarrow.fs, obstore libraries, plus integrations with Polars, DuckDB, PyArrow, Delta Lake, and Iceberg."

Cloud 0 5mo ago

data-science-feature-engineering

by legout

"Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations."

CI/CD 0 5mo ago

data-engineering-storage-remote-access-integrations-duckdb

by legout

"Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints."

Processing 0 5mo ago

data-engineering-best-practices

by legout

"Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions."

Processing 0 5mo ago

data-engineering-observability

by legout

"Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting."

Processing 0 5mo ago

data-engineering

by legout

"Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines."

Processing 0 5mo ago

data-engineering-storage-formats

by legout

"Modern data serialization formats: Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (chunked arrays), Avro, and ORC. Covers compression, partitioning, and format selection."

Processing 0 5mo ago

data-science-eda

by legout

"Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling."

Analytics 0 5mo ago

data-engineering-storage-remote-access-integrations-iceberg

by legout

"Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations."

API Dev 0 5mo ago

data-science-interactive-apps

by legout

"Interactive web apps for data science: Streamlit, Panel, and Gradio. Use for prototyping ML models, creating data exploration dashboards, and sharing insights with non-technical stakeholders."

Processing 0 5mo ago

data-engineering-storage-remote-access-integrations-pandas

by legout

"Reading and writing data with Pandas from/to cloud storage (S3, GCS, Azure) using fsspec and PyArrow filesystems."

Auth 0 5mo ago

data-engineering-orchestration

by legout

"Pipeline orchestration and workflow management with Prefect, Dagster, and dbt. Covers scheduling, dependency management, retries, and integration patterns."

Automation 0 5mo ago

data-science-model-evaluation

by legout

"Model evaluation and validation: cross-validation, metrics, hyperparameter tuning, and model comparison. Use when assessing model performance, selecting models, or diagnosing modeling issues."

Processing 0 5mo ago

flowerpower

by legout

"Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage."

Processing 0 5mo ago

data-science-notebooks

by legout

"Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders."

Processing 0 5mo ago

data-science-visualization

by legout

"Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience."

Analytics 0 5mo ago