ETL for IATI
Using Apache Hop to process IATI data.
The code processes IATI XML data to produce content in a Postgres or Azure SQL database. The data is put into a dimensional model to allow different analytics platforms to present the data in dashboards and reports.
The code uses a spreadsheet with "seed activities" as a starting point. It currently assumes a two-tier approach, with programme activities on the top level, and project activities on the bottom level.
The same spreadsheet is also used to define IATI result indicators to show. In addition, it contains a mapping table for disaggregated data. These tables help standardise the presentation of data.
In addition to IATI data, the following data is used:
- Programme name
A programme (or partnership, joint response) consists of a group of organisations that each publish one or more activities, and that should be presented together.
The programme has a name, and typically a tree structure with a single top-level activity, with links via parent-child relations and financial transactions.
With sufficient data quality, only the top-level activity identifier needs to be specified. If needed, more activities can be explicitly added to the list.
- Results framework
The programme has a result framework it wants to use in its dashboards and reports.
IATI results are presented based on a set of indicator reference codes.
Results are further grouped in "themes", each with one or more results and underlying indicators.
Results and indicators have "official titles" in dashboards and reports, independent of the actual texts published by each organisation.
Dimensions are processed based on a set of standard dimension names and values. Since these are text strings, common variations are translated into an "official text".
This project started as a project based on Pentaho Data Integration (PDI).
In 2005, Kettle started as open source ETL tool, with a visual programming interface. It turned into Pentaho Data Integration.
In 2019, the original authors created a fork to allow experimentation to develop it further. The project was brought under the stewardship of the Apache Foundation, and named Hop.
To benefit from the more rigourous licensing, as well as the huge upgrade in software setup, we migrated our PDI code to Hop.
Apache Hop code can be run on a local machine, and also be deployed to cloud services via Apache Beam.