Dagster
Dagster is a data orchestration tool or a Python framework for writing data pipelines.
It is intended (among other things) for MLOps.
Key concepts
Assets (@asset
)
An asset: a DB table, CSV file, ML model, DataFrame, …
Software defined asset: a Python function that produces an asset
(and is decorated with @asset
).
In Dagster, assets are materialized.
Dependencies between assets are defined in code by passing them as arguments to other assets.
Dagster keeps track of the lineage of assets using code versions and data versions (similar to GNU make).
Ops (@op
)
Ops are the core unit of computation in Dagster. Operations should be relatively simple functions. All assets are ops (since they are functions).
A collection of ops can be assembled into a graph
(graph-based asset, @graph_asset
).
Jobs (@job
)
Jobs are the main unit of execution in Dagster.
Jobs can be run manually, be scheduled (similar to cron) or triggered by sensors (which you define in code).
When an asset/op is combined with a specific set of parameters it is called a job.
When it is run, it is called a run. E.g. logs are stored for each run.
Partitions
Assets and ops can be partitioned, e.g. by date. It helps with parallelization: Dagster will manage that for you. You can limit the number of concurrent runs, ops, etc.
Resources
Clients for external resources (DBs, cloud storage, …). Wrappers around external services, useful for testing and abstracting away the details of the service.
Resources can be configured at runtime with Config
.
IO Managers
I/O managers are special types of Resources: they are used to read and write data from/to external sources. Also useful for persisting intermediate results between Dagster restarts.
Shedule (example)
defs = Definitions(
...
schedules=[
ScheduleDefinition(
job=define_asset_job(name="my_asset_job", selection="*"),
cron_schedule="@daily",
)
],
)
Versioning
By default, every run has a new code version, Dagster expects that your code changes between runs.
You can version code (in asset decorator with code_version
parameter).
Dagster auto-generates data versions by hashing
the code version together with
input data versions.
It can be overwritten in Output
with data_version=DataVersion(<string_value>)
.
Running, deployment
dagster dev
is for local development,dagit
is for running a production service.
Configuration files
workspace.yaml
defines where to find code locations, basically where to find assets, ops, jobs, … (see Workspaces)dagster.yaml
defines runners, number of workers, …
Links
- The core concepts,
- a crash course and
- the accompanying article.
- Documentation.
last modified: 2024-04-17
https://vit.baisa.cz/notes/code/dagster/