2 comments

  • borxtrk a day ago ago

    I got tired of missing material corporate events buried in SEC filings, so I built SEC Whisperer - a system that monitors, downloads, and summarizes 8-K filings using Gemini 2.5 Flash.

      Technical Stack:
      - Python pipeline polling SEC EDGAR API every 2 hours
      - Cloud Run jobs for serverless processing (avoiding cold starts with batch processing)
      - 98% noise reduction on HTML filings before LLM analysis
      - Firebase for real-time publishing to Next.js frontend
      - Gemini with structured JSON output + post-processing to prevent hallucination
    
      The interesting technical challenges:
      1. SEC filings are massive (40KB+ exhibits). Had to build a sectionizer that
         identifies item boundaries and caps exhibit text at 5KB (770x speedup)
      2. LLMs hallucinate quarters and M&A tags. Solution: deterministic post-processing
         that strips anything not in source text
      3. Filing amendments create tricky supersedes/superseded_by relationships in Firestore
    
      Live site: https://secwhisperer.com
      Code: Not open source yet, but happy to discuss architecture
    
      Example output: The site caught Nvidia's $5B Intel deal within minutes of the 
      8-K filing and had AI analysis published before most financial news sites.
    
      Would love feedback from the HN community - especially on the LLM hallucination 
      prevention patterns. What other techniques are you all using?
  • golden-face 20 hours ago ago

    Can you share any details or samples of the code/prompts especially with regards to "Gemini with structured JSON output" and "LLMs hallucinate quarters and M&A tags. Solution: deterministic post-processing"?

    I recently started using Gemini to perform perform classification tasks and I have been struggling with 2 things:

    1) Documentation on the input/prompt schema when you want to require structured output 2) How to enforce outputs like "this key-value output must come from the supplied list of key-values"

    It is really fascinating to work with this tool if only because it works well on 90% of tasks and then decides to go full stream of consciousness "hello good day, I know this is a horse race on TV but I cannot find the horse race in the list of car manufacturers you supplied" with a random schema for the output.