top of page

Santona Tuli: Data quality in streaming data workflows.

Santona Tuli, PhD began her data journey through fundamental physics—searching through massive event data from particle collisions at CERN to detect rare particles. She’s since extended her machine learning engineering to natural language processing, before switching focus to product and data engineering for data workflow authoring frameworks. At Astronomer, she helped improve the development experience for data science and machine learning pipelines in popular data orchestration tool, Airflow. Currently at Upsolver, she leads data engineering and science, driving developer research and product strategy for declarative workflow authoring in SQL. Dr. Tuli is passionate about building, as well as empowering others to build, end-to-end data and ML pipelines, scalably.

Data quality in streaming data workflows

Streaming data sources can be challenging to incorporate into data workflows due to the difficulty of checking for quality issues and inconsistencies while data are in motion. Upstream data validation helps minimize the detrimental effects of creeping quality issues on both finance and customer experience. Many consistency requirements, such as exactly once processing and total ordering, rely on holding state across many more than the current record being processed. Delivering quickly is a tradeoff against keeping state in memory for every streaming engine. However, the impact of quality enforcement at ingestion far outweighs the cost of slightly slower delivery in most cases, and particularly for analytics use cases. I’ll share how implementing event-time based one-minute microbatching combined with continuous delivery helped us ensure exactly once consistency and strong ordering of data, while keeping spending predictable with auto-scaling compute.

7 views0 comments
bottom of page