Speech datasets obtained from the Common Voice project:
-
Dutch
Common Voice (10.0-2022-07-04) (2.7GB)
-
Indonesian
Common Voice (10.0-2022-07-04) (1.3GB)
Coqui STT (10.0-2022-07-04) (1.4GB), conversion
-
Japanese
Common Voice (10.0-2022-07-04) (1.1GB)
Coqui STT (10.0-2022-07-04) (1.8GB), conversion
-
Norwegian Nynorsk
Common Voice (10.0-2022-07-04) (18.5MB)
Coqui STT (10.0-2022-07-04) (61.5MB), conversion
Relevant datasets to use from the archives:
-
Common Voice
train.tsv
- the training setdev.tsv
- the validation settest.tsv
- the test set
-
Coqui STT
train.csv
- the training setdev.csv
- the validation settest.csv
- the test set
Conversion into other formats can be achieved with the wai.annotations library.
License