Mozilla has publicly published its collection of voice data Common Voice. With 1361 hours – which is almost two months – data transcribed audio, it is, according to Mozilla, the largest collection of free access in the world. As important as the size is the versatility of Mozilla samples: 42,000 speakers collaborated in it and wrote short texts in 18 different languages.
The release is available to the public under the CC0 license – the most liberal version of Creative Commons ("No rights reserved"). The main objective of the collection is to create high quality voice data sets and free of charge for the formation of voice recognition systems. An area that has so far dominated the cloud applications of large corporations with huge collections of voice data. with DeepSpeech is developing Mozilla's open source voice recognition, which is already being used or tested on products such as Mycroft or Leon.
The project began in the middle of 2017 with a collection of text in English; A year later, Common Voice opened for other languages. For English, Mozilla registered 685 hours of almost 36,000 speakers; The German continues in second place with 254 hours, in which about 4000 volunteers participated.
Kabylisch, Tatar, Welsh
While commercial providers focus on the key markets, Common Voice also includes many that are rarely represented on the internet, such as Kabylic (Berber, Berber, Berber, or Tatar). Often, few enthusiasts are pushing the project. Recently, Mozilla collaborates with the German Society for International Cooperation, for example, to reach the speakers of the African country of Rwanda. On the contrary, some of the main languages in the world are still behind, such as Spanish, Arabic or Russian.
Once the launch has finished, the number of languages in the admission phase amounted to 22; Almost 200 hours of records were added. Other 70 languages are in the preparatory phase where the volunteers gather prayers and translate the website.
Although German is well represented in Common Voice, the project continues to seek speakers: the goal is to collect 1200 hours of material for each language. Participation does not require special knowledge and only takes a few minutes.
See also trial 18/2018: