In-Depth Data Handling: Missing Files & Data Validation

For a deeper technical insight, this section explains how to handle missing files and manage Data Validation. Practical examples are provided to guide you through these processes effectively.

💡

Basic Definition

Check Data Exists is a checkpoint in the ETL that makes sure the files a client has sent have the expected amount of data (in terms of row count). The client decides the criteria for Data Validation during onboarding.

Intro

At the end of each ETL, Optimove checks all client-selected tables to confirm a reliable data load for the day, using specific parameters to detect any faults. Clients can request parameter adjustments as needed.

Changeable ETL Settings

  • CompareAvg
    Determines whether to compare the daily number of rows to the average. If false, the component will only check that data with yesterday’s date was received in the table (at least 1 row). The default is True.
  • DaysBackwards
    How many days backward should be used to calculate the average number of records? The default value is 180.
  • IsFail
    Determine whether an unreliable number of rows in the table should fail the process or not. If the isFail is true, then the process should fail, and if false it should only notify relevant stakeholders and continue with the daily process. The default value is True.
  • MultFactor
    Threshold, decimal number between 0 and 1. The default value is 0.2 (20%). For example, if the daily number of records is X and the average number of records is Y the procedure will fail the process if X<0.2*Y.
  • DaysToIgnore
    List of weekdays in which the table is excluded from this check procedure. The default value is NULL (no exclusion). Optional values: 1,2,3,4,5,6,7 (where Sunday =1, Monday =2 etc.) A common use case: ‘1,7’ ->Sunday, Saturday
  • ISMEDIAN
    If the client wants to use Median instead of average as a check. We usually offer it to clients with outliers. Some clients will send a lot of data in 1 day, which will affect the average, but the median will have less effect.

Use Cases

  1. GAMES is an incremental table so the client configured it with CompareAvg =1 and MultFactor = 0.2
  2. Bonus is an incremental table, but not every day has new rows, so the client configured it CompareAvg =0, in addition, the client doesn't want to fail the process if Optimove doesn't receive new rows, he just wants to get a notification on that issue.
  3. Since Game_Types is a full table, the Optimove data team configured it to check if the file received contains at least one row. In addition the client does not upload a new file during the weekend, so he configured the DaysToIgnore = {1,7}.

On 2024-03-03 the client sends the following amount on each file.

Games: 50 rows

Bonus : 0 rows

Game_Types : 0 rows

For each table, we will now check whether the daily process of the Client will fail:

  • Games table: Calculation that determines whether a process will fail or not:

300 (client's average row per day is 300 rows)*0.2(MultFactor ) > 50(new rows received today), on this case the daily process will fail and the client will get a notification about it.

  • Bonus table: We received 0 rows however the client configured to not fail the daily if the file is empty. The client's daily process will proceed and the client will only receive a notification.
  • Game_Types table: We received 0 rows however, the weekday is Sunday. The client's daily process will proceed and the client will only receive a notification