2 min readfrom Data Science

Retraining strategy with evolving classes + imbalanced labels?

Hi all — I’m looking for advice on the best retraining strategy for a multi-class classifier in a setting where the label space can evolve. Right now I have about 6 labels, but I don’t know how many will show up over time, and some labels appear inconsistently or disappear for long stretches. My initial labeled dataset is ~6,000 rows and it’s extremely imbalanced: one class dominates and the smallest class has only a single example. New data keeps coming in, and my boss wants us to retrain using the model’s inferences plus the human corrections made afterward by someone with domain knowledge. I have concerns about retraining on inferences, but that's a different story.

Given this setup, should retraining typically use all accumulated labeled data, a sliding window of recent data, or something like a recent window plus a replay buffer for rare but important classes? Would incremental/online learning (e.g., partial_fit style updates or stream-learning libraries) help here, or is periodic full retraining generally safer with this kind of label churn and imbalance? I’d really appreciate any recommendations on a robust policy that won’t collapse into the dominant class, plus how you’d evaluate it (e.g., fixed “golden” test set vs rolling test, per-class metrics) when new labels can appear.

submitted by /u/fleeced-artichoke
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#real-time data collaboration
#big data management in spreadsheets
#conversational data analysis
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#rows.com
#financial modeling with spreadsheets
#machine learning in spreadsheet applications
#large dataset processing
#real-time collaboration