Dagster

Dagster is a data orchestration tool or a Python framework for writing data pipelines.

It is intended (among other things) for MLOps.

Key concepts

Assets (`@asset`)

An asset: a DB table, CSV file, ML model, DataFrame, …
Software defined asset: a Python function that produces an asset (and is decorated with @asset).

In Dagster, assets are materialized.

Dependencies between assets are defined in code by passing them as arguments to other assets.

Dagster keeps track of the lineage of assets using code versions and data versions (similar to GNU make).

Ops (`@op`)

Ops are the core unit of computation in Dagster. Operations should be relatively simple functions. All assets are ops (since they are functions).

A collection of ops can be assembled into a graph (graph-based asset, @graph_asset).

Jobs (`@job`)

Jobs are the main unit of execution in Dagster.

Jobs can be run manually, be scheduled (similar to cron) or triggered by sensors (which you define in code).

When an asset/op is combined with a specific set of parameters it is called a job.

When it is run, it is called a run. E.g. logs are stored for each run.

Partitions

Assets and ops can be partitioned, e.g. by date. It helps with parallelization: Dagster will manage that for you. You can limit the number of concurrent runs, ops, etc.

Resources

Clients for external resources (DBs, cloud storage, …). Wrappers around external services, useful for testing and abstracting away the details of the service.

Resources can be configured at runtime with Config.

IO Managers

I/O managers are special types of Resources: they are used to read and write data from/to external sources. Also useful for persisting intermediate results between Dagster restarts.

Shedule (example)

defs = Definitions(
    ...
    schedules=[
        ScheduleDefinition(
            job=define_asset_job(name="my_asset_job", selection="*"),
            cron_schedule="@daily",
        )
    ],
)

Versioning

By default, every run has a new code version, Dagster expects that your code changes between runs.

You can version code (in asset decorator with code_version parameter).

Dagster auto-generates data versions by hashing the code version together with input data versions. It can be overwritten in Output with data_version=DataVersion(<string_value>).

Running, deployment

dagster dev is for local development,
dagit is for running a production service.

Configuration files

workspace.yaml defines where to find code locations, basically where to find assets, ops, jobs, … (see Workspaces)
dagster.yaml defines runners, number of workers, …