The good ways to obtain a database for a new language are:
-
Manually segment audio recordings with existing transcription (podcasts, news, etc)
-
Record your friends and family and colleagues
-
Setup automated collection on Voxforge
You have to design database prompts and postprocess the results to ensure that audio actually correspondsto prompts. The file structure for the database is:
-
etc
-
your_db.dic - Phonetic dictionary
-
your_db.phone - Phoneset file
-
your_db.lm.DMP - Language model
-
your_db.filler - List of fillers
-
your_db_train.fileids - List of files for training
-
your_db_train.transcription - Transcription for training
-
your_db_test.fileids - List of files for testing
-
your_db_test.transcription - Transcription for testing
-
-
wav
-
speaker_1
-
file_1.wav - Recording of speech utterance
-
-
speaker_2
-
file_2.wav
-
-