Edit log:###
2018.09.25
This article is now expanded to an article series, where we have more detailed discussion and open-source code, check them out!
2018.08.26 - Updated the data schema:
- Have an unified format that covers both lightweight format and standard format, but more flexible and self-explained.
- Specified mandatory fields and optional field in the format. For example, Timestamp is now an optional field.
Introduction
If we say “Data is the new oil”, then data lineage is an issue that we must to solve. Various data sets are generated (most likely by sensors), transferred, processed, aggregated and flowed from upstream to downstream.
The goal of data lineage is to track data over its entire lifecycle, to gain a better understanding of what happens to data as it moves through the course of its life. It increases trust and acceptance of result of data process. It also helps to trace errors back to the root cause, and comply with laws and regulations.

You can easily compare this with the traditional supply chain of raw materials in manufacturing industry and/or logistic industry. However, compares to the traditional industries, data lineage are facing new challenges.