5 comments

  • PaulShin a day ago ago

    This is a fantastic question that gets to the heart of a huge pain point in applied AI. You're not missing something obvious; you've just discovered a gap in the market that we're obsessed with solving at my startup, Markhub.

    Your problem of cleaning a dataset ("remove all records without foul language") is functionally identical to the problem our users face every day: cleaning up messy team conversations ("turn this chaotic chat into a clear list of tasks").

    Our approach has been to build an AI agent, MAKi, that acts as an interface layer on top of unstructured data. Instead of writing complex scripts, our users simply talk to MAKi.

    For example, they can highlight a long conversation and give it a prompt like, "Extract all action items from this, assign them to the relevant person, and set due dates for next Friday."

    MAKi parses the request, understands the context, and generates structured To-Do items, effectively "cleaning" the conversational data into an actionable format. We call this a "Conversation-Driven Workflow."

    While our use case is collaboration, the underlying technology to "interact with a dataset via prompts" is exactly what you're looking for. It seems like the next wave of AI tools won't just be models, but intuitive interfaces for manipulating data with natural language.

  • jonahbenton 3 days ago ago

    +1. Have used AI to write code for me to do various cleaning steps but as the process of iterating over data cleaning is usually one of discovery, both in terms of the data and also in terms of requirements, especially in the context of problems, have not found a conversational workflow tool capable of working at the right level of abstraction to be useful. Curious if any folks have.

  • constantinum 3 days ago ago

    There is https://www.visitran.com/ which is still in closed beta.

  • kevinherron 3 days ago ago

    Submit for batch processing using the OpenAI batch API?

    • sprobertson 3 days ago ago

      This in combination with making the tool call / json response itself also "batched" is a good pattern. Instead of returning a single `{english, foul}` object per record, pass in an array of records and have it return an array of `[{english, foul}]`. Adjust the inner batch size depending on your record size and spread the rest over the batched API.