New version of our open STT dataset - 0.5, now in beta
Please share and repost!
What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;
I'm back from vacation)