A great feature of Airflow is the fact that you can automate the execution of processes and also deal with possible failures without the need to sit in front of your computer, staring at the screen until the processes finish or until an error comes up. Instead, you can use SLAs and callback functions to automatically restart the process, warn the developers via Slack or Teams or send an email to the CEO if a DAG/task takes longer than you expect it to, or if it completely fails.
In I.T., a service-level agreement (SLA) defines the level of service you expect from a given vendor or process. SLAs can exist between companies and external suppliers, or between two departments within a company. In the Airflow context, the SLA is closely linked with the execution time of a process and its final state (Success, Failure, Skipped…)
A valuable part of process monitoring is to act upon unwanted states of a given task, or across all tasks in a given DAG. For example, you may wish to alert managers when certain tasks are taking longer than expected to finish, or have the last task in your DAG invoke a callback when it succeeds.
Currently, there are four types of callbacks:
on_success_callback | Invoked when the task succeeds |
on_failure_callback | Invoked when the task fails |
sla_miss_callback | Invoked when a task misses its defined SLA |
on_retry_callback | Invoked when the task is up for retry |
The callback function needs to be defined on the DAG level and will be called if any task gets into a state that matches the callback trigger state. For this example if any task on the DAG fails, sla_missed!!!! will be printed:
In Airflow’s context, SLA can be seen as “for how long your DAG can run before you need to do something about it”. You can set any datetime interval you want, and if the execution time of your DAG exceeds that, Airflow will automatically trigger a callback function defined by you. However, Airflow`s SLA has some issues:
You would expect Airflow would not trigger any SLA for this DAG, as the duration of each task did not exceed their respective SLAs, as shown is figure 1:
But this is what really happens:
The solve all these problems, we need to:
Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered. It also monitors states and runs the manage_slas subprocess responsible for checking DAGs and tasks SLAs. Because it runs on an interval, quick tasks (less than a minute with default settings) are prone to omit SLA misses, this has to be taken into consideration when dealing with fast and time sensitive tasks.
In this Blogpost we did not get too deep into coding the modifications made to the Airflow library, The patch code that changes the SLA behavior, in order to reflect this more coherent treatment of SLAs in the Airflow DAGS and tasks can be found in our Githubs for Airflow versions 1.10.15 and 2.3.0.