Common Voice | User-friendly Deep Learning: Datasets

Speech datasets obtained from the Common Voice project:

Dutch
- Common Voice (10.0-2022-07-04) (2.7GB)
Indonesian
- Common Voice (10.0-2022-07-04) (1.3GB)
- Coqui STT (10.0-2022-07-04) (1.4GB), conversion
Japanese
- Common Voice (10.0-2022-07-04) (1.1GB)
- Coqui STT (10.0-2022-07-04) (1.8GB), conversion
Norwegian Nynorsk
- Common Voice (10.0-2022-07-04) (18.5MB)
- Coqui STT (10.0-2022-07-04) (61.5MB), conversion

Relevant datasets to use from the archives:

Common Voice
- train.tsv - the training set
- dev.tsv - the validation set
- test.tsv - the test set
Coqui STT
- train.csv - the training set
- dev.csv - the validation set
- test.csv - the test set

Conversion into other formats can be achieved with the wai.annotations library.

License