Playing with TIMIT corpus in NLTK - Part - I

Written by Jaganadh Gopinadhan

The corpus collection in NLTK contains TIMIT Corpus Sample. I just explored the TMIT corpus and TimitCorpusReader function. If you are interested to know more about TIMIT please check http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1.

Some description available in NLTK code is given below. “This corpus contains selected portion of the TIMIT corpus.

16 speakers from 8 dialect regions
1 male and 1 female from each dialect region
total 130 sentences (10 sentences per speaker. Note that some sentences are shared among other speakers, especially sa1 and sa2 are spoken by all speakers.)
total 160 recording of sentences (10 recordings per speaker)
audio format: NIST Sphere, single channel, 16kHz sampling, 16 bit sample, PCM encoding”

If you explore the TMIT directory in nltk data you will find the following items in it

allfilelist.txt allsenlist.txt dr2-faem0 dr4-falr0 dr6-fapb0 dr8-fbcg1 prompts.txt sentences.pos spkrsent.txt wrdalign.timit allphonedur.txt allsentime.txt dr2-marc0 dr4-maeb0 dr6-mbma1 dr8-mbcg0 README sentences.ppp testset.doc allphonelist.txt dr1-fvmh0 dr3-falk0 dr5-ftlg0 dr7-fblv0 l readme.doc sentences.tags timitdic.doc allphonetime.txt dr1-mcpm0 dr3-madc0 dr5-mbgt0 dr7-madd0 phoncode.doc sentences spkrinfo.txt timitdic.txt

The items having ‘dr’ in the name are directories which contains spoken corpora files. In each directory some .wav files and associated transcriptions are located. We can acces and explore these files with the corpus api of NLTK.

Let’s see how to explore the data.

Importing the timit corpus from NLTK

`>>> from nltk.corpus import timit`

Now the corpus is imported

To view the fileid in the corpus

`>>> timit.fileids()`

It will print a long list like

    ......................
    'dr8-mbcg0/sx417.wav',
     'dr8-mbcg0/sx417.wrd',
     'dr8-mbcg0/sx57.phn',
     'dr8-mbcg0/sx57.txt',
     'dr8-mbcg0/sx57.wav',
     'dr8-mbcg0/sx57.wrd',
     'spkrinfo.txt',
     'timitdic.txt']

You might have noticed that there is .wav file, .wrd file, .phn file and .txt files are there. The .wav file contain utterances, .wrd file contains time marked word file, .phn file contains time marked phone list ….. For each utterance file there will be a .ph, .wrd and .txt file.

From the file ids we can get list of utterance in the NLTK TIMIT corpus.

>>> timit.utteranceids()

It will print a long list like

    .....................
     'dr8-mbcg0/sa1',
     'dr8-mbcg0/sa2',
     'dr8-mbcg0/si2217',
     'dr8-mbcg0/si486',
     'dr8-mbcg0/si957',
     'dr8-mbcg0/sx147',
     'dr8-mbcg0/sx237',
     'dr8-mbcg0/sx327',
     'dr8-mbcg0/sx417',
     'dr8-mbcg0/sx57']

We can store the utterance ids to a variable for future use.

`>>> utid = timit.utteranceids()`

Now the list ‘utid’ contains all the utterance ids in the TIMIT corpora. The utterance id is in a specified pattern. This pattern helps the corpus reader module to identify information regarding the utterances. Letus see how this is happening.

To get the speaker id from an utterance id. Let us take the first utterance id from the ‘utid’ list.

    >>> timit.spkrid(utid[0])
    'dr1-fvmh0'

For future use lets store the id in a variable

    >>> sid = timit.spkrid(utid[0])

For the selected utterance we can get the sentence id also.

    >>> timit.sentid(utid[0])
    'sa1'

‘sa1’ is the sentence id for the given utterence.

    >>> sentid = timit.sentid(utid[0])

Now I am storing the sentence id to a variable for future use.

I think from the above example you might have got an idea about the file name pattern of the timit corpus in NLTK.

Now we can try to get a list of utterance with an utterance is and sentence id.

>>> timit.utterance(utid, sentid)

Now you will get a long list like this “[‘dr1-fvmh0/sa1’, ‘dr1-fvmh0/sa2’, ‘dr1-fvmh0/si1466’, ‘dr1-fvmh0/si2096’, ‘dr1-fvmh0/si836’, ‘dr1-fvmh0/sx116’, ‘dr1-fvmh0/sx206’, ‘dr1-fvmh0/sx26’, ‘dr1-fvmh0/sx296’, ……………….

Now I am going to store the list of utterances to a list.

` »> utt = timit.utterance(utid, sentid)`

We can access the utterance based on the speaker id .

`>>> sputid = timit.spkrutteranceids(sid)`

It will give all the utterance associated with the id ‘sid’. Those ids are

    ['dr1-fvmh0/sa1',
     'dr1-fvmh0/sa2',
     'dr1-fvmh0/si1466',
     'dr1-fvmh0/si2096',
     'dr1-fvmh0/si836',
     'dr1-fvmh0/sx116',
     'dr1-fvmh0/sx206',
     'dr1-fvmh0/sx26',
     'dr1-fvmh0/sx296',
     'dr1-fvmh0/sx386']

I stored all he ids to a list ‘sputid’.

Let’s try to get details about the speaker having the id ‘dr1-fvmh0’. This id is stored in the variable ‘sid’ earlier.

    >>> timit.spkrinfo(sid)
    SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')

Now it is giving the details about the speaker such as gender, birth date, and race etc…….

Continued …………..

Happy Hacking !!!!!!!!1

Migrated from my old blog jaganadhg.freeflux.net

Written on January 6, 2010

The Opinions Expressed In This Post Are My Own And Not Necessarily Those Of My Employer.

[ Python Speech Processing Natural Language Processing Copus Processing ]