S3 lifecycle policies and scheduled RDBMS jobs are the low hanging fruit here.
I used to work at a data platform team and built a cleaning service that used tags and object hierarchy trees to find and clean old PII data. Not an easy thing to do as our data analytics bucket had over 7PiB of data.
Overall the architecture was based of 3 components: detector, enforcer, cleaner. Detector sifted through the datalake to find PII datasets(llm based), enforcer tracked down ETL of the datasets in our VCS to set appropriate tags/metada(custom coding agent), finally cleaner used search to find and clean the data based on the metadata.
I feel like this should be a service in itself, similar to Heroku or Supabase. Just tick which laws you want to adhere to, upload files to their buckets. Tick another box for audit logs and such and it'll ask you where you need your human in the loop and which buttons those humans need to press. So a bit like Carta or Deel in that sense.
I've had some big enterprise deals fall through because of something like this - military, insurance, fintech, etc.
Mostly cron jobs and lifecycle rules in my experience, it’s rarely clean. S3 lifecycle policies handle the easy stuff but anything touching multiple systems usually ends up as a scheduled job that someone wrote once and nobody fully trusts.
S3 lifecycle policies and scheduled RDBMS jobs are the low hanging fruit here.
I used to work at a data platform team and built a cleaning service that used tags and object hierarchy trees to find and clean old PII data. Not an easy thing to do as our data analytics bucket had over 7PiB of data.
Overall the architecture was based of 3 components: detector, enforcer, cleaner. Detector sifted through the datalake to find PII datasets(llm based), enforcer tracked down ETL of the datasets in our VCS to set appropriate tags/metada(custom coding agent), finally cleaner used search to find and clean the data based on the metadata.
I feel like this should be a service in itself, similar to Heroku or Supabase. Just tick which laws you want to adhere to, upload files to their buckets. Tick another box for audit logs and such and it'll ask you where you need your human in the loop and which buttons those humans need to press. So a bit like Carta or Deel in that sense.
I've had some big enterprise deals fall through because of something like this - military, insurance, fintech, etc.
Mostly cron jobs and lifecycle rules in my experience, it’s rarely clean. S3 lifecycle policies handle the easy stuff but anything touching multiple systems usually ends up as a scheduled job that someone wrote once and nobody fully trusts.