top of page
Captura de pantalla 2025-12-03 a la(s) 11.34.58 p. m..png

THE SOLUTION

We designed and implemented an automated data platform that ingests global data, generates daily forecasts, simulates risk scenarios and evaluates selling strategies under real producer constraints.

An End-to-End Data Platform

The system is built to answer a simple question:

“If I am a coffee producer, is today a good day to sell?”


To do that, it must connect multiple data sources, produce robust forecasts and translate them into actionable strategies.

01

Automated global data ingestion

02

Unified hierarchical dataset across regions and commodities

03

14-day forecasting engine

04

Strategy simulator that reflects real producer constraints

05

Strategy simulator that reflects real producer constraints

06

Daily automation across AWS and Databricks

  • Lambda functions perform scheduled API calls

  • EventBridge orchestrates daily jobs

  • S3 acts as our central data lake

Our ingestion layer runs on AWS:

Captura de pantalla 2025-12-04 a la(s) 1.32.39 p. m..png
Captura de pantalla 2025-12-04 a la(s) 1.30.51 p. m..png
  • Run scheduled ETL pipelines

  • Use Delta Live Tables for incremental processing

  • Track experiments and models with MLflow

In Databricks, we:

2025_Data-Engineering-for-Coffee-Producers.pdf.jpeg

What We Built

A robust, scalable infrastructure designed to handle data just like a top-tier financial institution would.

  •  Our ingestion layer runs on AWS:

    • Lambda functions perform scheduled API calls

    • EventBridge orchestrates daily jobs

    • S3 acts as our central data lake

     

    In Databricks, we:

    • Run scheduled ETL pipeline

    • Use Delta Live Tables for incremental processing

    • Track experiments and models with MLflow

  • Multimodal Data and a Unified Hierarchical Model. The platform integrates:

    • Climate data (temperature, rainfall, humidity)

    • Coffee and sugar futures (OHLCV)

    • Volatility indices such as VIX

    • FX rates

    • Global news and sentiment via GDELT

    • 29 coffee regions and 38 sugar regions

    These inputs are organized into a hierarchical model that lets each forecasting configuration pick the most relevant features, whether global averages or region-level signals.

    This hierarchical approach allows each model to analyze the information that best fits its needs.

  • Configurable Model Registry and Automated ML Workflow:

    Instead of hardcoding a single model, we use a configuration-based model registry. Each model is stored as data, including its type, parameters, feature functions and forecast horizon.

     

    This enables:

    • Rapid experimentation across many configurations

    • Parallel training and evaluation

    • Clean, maintainable architecture
       

    Pipeline:

    1. Train – Fit SARIMAX using weather, market and FX features

    2. Backtest – Walk-forward validation across rolling windows

    3. Publish – Generate daily forecasts and 2,000 sample paths

  • Forecasts, Risk Distributions and Scenario Analysis

    Each model generates approximately 2,000 simulated price paths over a 14-day horizon. From these paths, we derive:

    • Good, bad and extreme market scenarios

    • Expected price levels

    • Volatility and dispersion

    • Risk metrics such as VaR and CVaR
       

    For a producer, this translates into a simple but powerful question:
    Is it statistically better to sell now, or to wait?

  • From Forecasts to Decisions: Strategy Simulator

    The strategy simulator recreates the life of a coffee producer within the model:

    • Harvested coffee enters inventory gradually

    • Storage costs increase over time

    • Coffee must be sold before the next harvest

    • Nine strategies are tested: four baselines and five forecast-enhanced
       

    We run these strategies over eight years of historical data to evaluate whether forecasts lead to materially better income.

Current Results

-0.26

t-statics

0.80

p-valeu

Effect size

Negligible

Our current models do not yet outperform the best baseline strategy in a statistically significant way (p-value ≈ 0.80, negligible effect size). This is expected at an early stage and highlights the need for more expressive models and richer features.

Limitations

  • SARIMAX is sensitive to extreme shocks
     

  • Climate data has limited regional resolution
     

  • Sentiment features are not fully integrated
     

  • Forecast horizon is restricted to 14 days
     

  • Producers still lack direct, low-bandwidth access (e.g., WhatsApp)

Road Map

  • Integrate global sentiment from GDELT into production models
     

  • Experiment with LSTM and TimesFM for sequence forecasting
     

  • Explore ensemble forecasting across multiple model families
     

  • Automate end-to-end daily runs
     

Prototype a WhatsApp interface for producers with limited internet access

secado.jpg
Captura de pantalla 2025-12-04 a la(s) 4.33.20 p. m..png

SOCIAL IMPACT

This system is not only about data and infrastructure. It is about economic dignity and fairness. By making market intelligence accessible to small producers, we move one step closer to a future where the value of their work is protected with the same tools used by global traders.

Who Built This Project

Every model, dataset and decision in this project was shaped by a team committed to supporting the livelihoods of coffee-growing families. Meet the people who brought this mission to life.

bottom of page