I built an AI pipeline to analyze every SEC 8-K filing in real-time

(secwhisperer.com)

11 points | by borxtrk a day ago ago

2 comments

borxtrk a day ago ago

I got tired of missing material corporate events buried in SEC filings, so I built SEC Whisperer - a system that monitors, downloads, and summarizes 8-K filings using Gemini 2.5 Flash.

  Technical Stack:
  - Python pipeline polling SEC EDGAR API every 2 hours
  - Cloud Run jobs for serverless processing (avoiding cold starts with batch processing)
  - 98% noise reduction on HTML filings before LLM analysis
  - Firebase for real-time publishing to Next.js frontend
  - Gemini with structured JSON output + post-processing to prevent hallucination

  The interesting technical challenges:
  1. SEC filings are massive (40KB+ exhibits). Had to build a sectionizer that
     identifies item boundaries and caps exhibit text at 5KB (770x speedup)
  2. LLMs hallucinate quarters and M&A tags. Solution: deterministic post-processing
     that strips anything not in source text
  3. Filing amendments create tricky supersedes/superseded_by relationships in Firestore

  Live site: https://secwhisperer.com
  Code: Not open source yet, but happy to discuss architecture

  Example output: The site caught Nvidia's $5B Intel deal within minutes of the 
  8-K filing and had AI analysis published before most financial news sites.

  Would love feedback from the HN community - especially on the LLM hallucination 
  prevention patterns. What other techniques are you all using?

golden-face 20 hours ago ago
Can you share any details or samples of the code/prompts especially with regards to "Gemini with structured JSON output" and "LLMs hallucinate quarters and M&A tags. Solution: deterministic post-processing"?
I recently started using Gemini to perform perform classification tasks and I have been struggling with 2 things:
1) Documentation on the input/prompt schema when you want to require structured output 2) How to enforce outputs like "this key-value output must come from the supplied list of key-values"
It is really fascinating to work with this tool if only because it works well on 90% of tasks and then decides to go full stream of consciousness "hello good day, I know this is a horse race on TV but I cannot find the horse race in the list of car manufacturers you supplied" with a random schema for the output.