The #1 thing you need to think about is training data. (Going forward you are going to do a huge amount of manual work like it or not, the important thing is that you do it efficiently)
My take is that the PhraseMatcher is about the best thing you will find in spaCy, and I don't think any of the old style embeddings will really help you. (Word vectors are a complete waste of time! Don't believe me, fire up scikit-learn and see if you can train a classifier that can identify color words or emotionally charged words: related words are closer in the embedding space than chance but that doesn't mean you've got useful knowledge there) Look to the smallest and most efficient side of LLMs, maybe even BERT-based models. I do a lot of simple clustering and classification tasks with these sorts of models
I can train a calibrated model on 10,000 or so documents in three minutes using stuff from sk-learn.
Another approach to the problem is to treat it as segmentation, that is you want to pick out a list of phrases like "proficiency in spreadsheets" and you can then feed those phrases through a classifier that turns that into "Excel". Personally I'm interested in running something like BERT first and then training an RNN to light up phrases of that sort or do other classification. The BERT models don't have a good grasp on the order of words in the document but the RNN fills that gap.
I’ve been sticking with PhraseMatcher because it’s simple, fast, and predictable—but your suggestion about using smaller BERT-based models or embeddings like SBERT (sentence-transformers) is intriguing. I’ve avoided LLMs so far because of the computational overhead, but it sounds like even lightweight models can provide significant value.
Out of curiosity, when training models like SBERT or even smaller BERT versions, do you see diminishing returns when working with smaller training sets (e.g., a few thousand annotated job descriptions)? My current dataset isn’t huge yet (10k), so I wonder where that line starts to appear.
I’ll definitely look more into SBERT and segmentation approaches—thanks for sharing those!
I have a content-based recommender based on SBERT + SVM, it starts to learn with around 500 examples, I don't think it benefits from having more than about 10,000.
I have also tried fine-tuning BERT models to do the same, it takes at least 30 minutes to make one model (not do all the model selection I do w/ the sk-learn based models) and I never developed a training protocol that reliably did better than my SVM-based model. My impression there was that the small BERT models don't really seem to have a lot of learning capacity and don't seem to really benefit from 5000+ documents but really high accuracy isn't possible with my problem (predict my own fickle judgements, I feel like I am doing great with AUC-ROC 0.78 or so)
Do you think SBERT + SVM is a good fit for handling ambiguous or less common phrases, or do you still end up needing some post-processing rules for edge cases?
Have you considered using an LLM model API? You could just send it the posting text and come with a good prompt to extract the most likely required skills.
I’ve been considering an LLM API, and it definitely sounds promising. My main concern is the cost—I'm processing around 300 job offers per day and plan to scale up further. Do you have any go-to APIs you’d recommend that balance performance and pricing?
A while ago I built a prototype app to extract songs and artists from Reddit posts with thousands of comments. I used the Anthropic API, and it was pretty easy to set up and wasn't very expensive. I think you can get a pretty good estimation of how much it would cost beforehand or after a few test runs.
The #1 thing you need to think about is training data. (Going forward you are going to do a huge amount of manual work like it or not, the important thing is that you do it efficiently)
My take is that the PhraseMatcher is about the best thing you will find in spaCy, and I don't think any of the old style embeddings will really help you. (Word vectors are a complete waste of time! Don't believe me, fire up scikit-learn and see if you can train a classifier that can identify color words or emotionally charged words: related words are closer in the embedding space than chance but that doesn't mean you've got useful knowledge there) Look to the smallest and most efficient side of LLMs, maybe even BERT-based models. I do a lot of simple clustering and classification tasks with these sorts of models
https://sbert.net/
I can train a calibrated model on 10,000 or so documents in three minutes using stuff from sk-learn.
Another approach to the problem is to treat it as segmentation, that is you want to pick out a list of phrases like "proficiency in spreadsheets" and you can then feed those phrases through a classifier that turns that into "Excel". Personally I'm interested in running something like BERT first and then training an RNN to light up phrases of that sort or do other classification. The BERT models don't have a good grasp on the order of words in the document but the RNN fills that gap.
I’ve been sticking with PhraseMatcher because it’s simple, fast, and predictable—but your suggestion about using smaller BERT-based models or embeddings like SBERT (sentence-transformers) is intriguing. I’ve avoided LLMs so far because of the computational overhead, but it sounds like even lightweight models can provide significant value.
Out of curiosity, when training models like SBERT or even smaller BERT versions, do you see diminishing returns when working with smaller training sets (e.g., a few thousand annotated job descriptions)? My current dataset isn’t huge yet (10k), so I wonder where that line starts to appear.
I’ll definitely look more into SBERT and segmentation approaches—thanks for sharing those!
I have a content-based recommender based on SBERT + SVM, it starts to learn with around 500 examples, I don't think it benefits from having more than about 10,000.
I have also tried fine-tuning BERT models to do the same, it takes at least 30 minutes to make one model (not do all the model selection I do w/ the sk-learn based models) and I never developed a training protocol that reliably did better than my SVM-based model. My impression there was that the small BERT models don't really seem to have a lot of learning capacity and don't seem to really benefit from 5000+ documents but really high accuracy isn't possible with my problem (predict my own fickle judgements, I feel like I am doing great with AUC-ROC 0.78 or so)
Do you think SBERT + SVM is a good fit for handling ambiguous or less common phrases, or do you still end up needing some post-processing rules for edge cases?
I haven't tried classifying anything as small as a phrase (assuming you've extracted it yet) with SBERT+SVM so I really don't know.
Another thing to consider is a T5 model. A T5 model maps strings to strings so it can be trained to take an input like
"Extract the skills from this resume: ..."
with the output like
"Excel, Pandas, Python, Cold Fusion, C#, ..."
and it will try to do the same. You'll probably still find it makes some mistake that drives you up the wall that need some pre- or post- processing.
Have you considered using an LLM model API? You could just send it the posting text and come with a good prompt to extract the most likely required skills.
I’ve been considering an LLM API, and it definitely sounds promising. My main concern is the cost—I'm processing around 300 job offers per day and plan to scale up further. Do you have any go-to APIs you’d recommend that balance performance and pricing?
A while ago I built a prototype app to extract songs and artists from Reddit posts with thousands of comments. I used the Anthropic API, and it was pretty easy to set up and wasn't very expensive. I think you can get a pretty good estimation of how much it would cost beforehand or after a few test runs.
I’ll take a look at that. Good to hear it’s affordable for smaller tasks like that.