nlp-private:spoken-language-id [CS Wiki]

Conferences

Top-Level Goals

Scenario #1:
- Starting point: True phonetic (or broad-class phone) transcripts and audio files
- Focus: Feature engineering on transcripts and audio files

Scenario #2:
- Starting point: Hypothesis phonetic transcripts (from another research group) and audio files
- Focus: Feature engineering on transcripts and audio files

Other Directions for Research

Scenario #3:
- Starting point: audio files
- Focus: Reliable broad-class phone recognition for languages having broad-class phone LMs (trained by supervised learning)

Scenario #4:
- Starting point: audio files
- Focus: Reliable broad-class phone recognition for all languages, where broad-class phone LMs are trained by semi-supervised learning

Resources

Possible Paper

What is the latest system that performs best when based on phonetic transcription as an intermediate representation
Contact the builders of the baseline system to ask if this line of research would be interesting to them (see below)
Implement this technique as a strong baseline on true transcripts
Beat it using feature engineering, ensembles, maxent, etc.
Part 2: grab real data from somebody else

Action List

General

Get things working under Cygwin: this will wait until Robbie's new detware implementation gets committed.
- Actually, the current blocker seems to be Cygwin's interaction with praat, which is not a cygwin app
Reengineer seg2xml3.pl - can praat be run by the make system, and the seg2xml3.pl be run on the output of praat?
Add comments to cmakelists
Reproduce Pedro's and Bruce's results – Using Feature Engineering Console
- re-run best experiments from Pedro's and Bruce's experiments with an eye specifically on the impact on (for example) Mandarin performance.
- Try comparing Bruce's results to a simpler baseline rather than comparing them to than Pedro's results (try Trigram, 5gram)
<s>Reconcile Pedro's SE_*.def.xml files with other files in the repository. Keep only the unique ones.</s> - presumably completely DONE in r378
- <s>Purge the duplicates.</s> - DONE in r245 and r252
Feature engineering on both pitch and F0.
- <s>with quantization (may need different quantiles for each approach)</s> - DONE in r382
- <s>with linear regression</s> - DONE in r375 and MERGED TO HEAD in r382
- with _quadratic_ regression - Quadratic regression for pitch checked in in r402
<s>Parallelize seg2xml3.pl to take advantage of multiple processors/processor cores.</s> - DONE in r388, though a bug remains and it isn't currently enabled.
Revamp the feature definition system according to Feature Definition XML File Roadmap
- Domain Specific Language for SpokenLID Feature Definitions
Robbie: provide Eric with final normalization proof paper for tech report.

Feature Engineering Console Module

Tasks related to the Language-ID implementation of the Edu.byu.nlp.experimentation API, located at Language-ID-Experimentation-Module

Can we enable multiple simultaneous jobs in cmake? What sort of locking will this require? (slx2 and ling files, etc.)
Show accurate durations for the wav files. This will be accomplished somewhere in or around the SLIDTrial class's getOtherInfo method. The durations are taken from the relevant .result file, so perhaps they're hardcoded in resultbuilder.pl?
Wider legend colors
Tag or label on the chart to identify which language is which
<s>Outcome isn't showing up in trial list</s> - FIXED
Check fivegram for regression?
Check title of file feature weights?
Outline view of features and experiments?
'done' or 'status' file to show where an experiment run terminated

Feature Engineering with True Transcripts

All:
- Read NIST 2007 Guidelines
- Feature engineering (see Feature Log and Available Features)
  - Experimenting with many features, in priority order

<br>

Robbie:
- Log results of the full set of n-gram LM & maxent experiments on new .slx2 file set on the wiki.
- MaxEnt Optimization: Start with feature weights from prior iteration
- Replace NIST Detware with Own DET-curve software
- In new DET-curve software automatically calculate the following:
  - the aggregate DCF
  - aggregate EER
  - aggregate Operating point
  - per-language statistics (DCF, EER, Op Pt)
- Reconcile: EER in .csv result file and plot-*/global/eer.txt
- Understand the relationship with plot-*/avgeer.txt
- Sweep out one threshold per binary classifier. Equivalent to normalization?
- Optimize theta sweep.

<br>

Eric:
- investigate whether output from operating point selection on the training set is overwriting or being over-written during test time.
- is the training sweep .csv file over-writing the test sweep .csv file?

Feature Selection:
- Count-cut-offs with MaxEnt
- Mutual information based feature selector
- Berger's feature selection procedure with learning in the loop

In pitch and formant change features, Choose something other than min. > max (i.e., min. - max. = 0)

Investigate using continuous-valued features in MaxEnt. Would obviate need for quanitzation in the feature set.
- See Franz Och's (short, obfuscated) implmentation of MaxEnt. available online.

Cite Chen & Maison - refer to text lang ID prior work.
Re-run n-gram experiments with Kneser-Ney instead of Katz-style Good-Turing
Re-examine n-gram vs. n-gram-all models for MaxEnt in wake of “null” bug removal
Implement multi-class classifier and track accuracy (in addition to the aggregate DCF, DET, etc. for the binary one-v-rest classifiers) – the decision is not blind to evidence for other languages.

<br>

Statistical significance
Incorporate training / test split in Makefile (with option to hold fixed or to re-split)
Cross-validation: general

Contact Audrew Le (audrey.le@nist.gov): inquire about multi-dimensional (one theta per lang.) DETware.
Also ask Audrey about the normal deviate scale
Use answer key files, rather than reading the answer out of the filename
Debug: operating point is sometimes off the DET curve (e.g., Pedro's English v. Spanish curves)
Robbie: incorporate Richard Arthur's confusability matrix code into Spoken LID and 401R Codebase. Adapt for Maxent feature weights.

Speech Reco. with languages with broad-class phone LMs (trained by supervised learning)

Implement new file hierarchy
Re: quality of the segment labels and the segment endpoints. 8/7/06: We noticed a trend in endpoint position discrepancy with the truth. Most egregiously, the final segment always had erroneous start-/end-time stamps. Debug this code, and take another look at a couple of utterances in order to see where we stand.

<br>

Phone reco. in Makefile
Optimization of SR parameters
NIST datasets: See Singer; OGI_TS, CallFriend
Verify that content of NIST 2005 dev. dataset <math>\supseteq</math> 2003 <math>\supseteq</math> 1996
Inventory of resources at our disposal that we can use to train phone LMs. This can be done either by (in order of preference): (a) using the OGI-generated phoneme annotations directly; (b) directly using phoneme-level (or “close”) transcriptions and converting them to phoneme classes;© using straight text, using text-to-speech tools (aka phonemicizers) to convert the text to phonemic form, and then convert the phonemes to phoneme classes.
- The original OGI corpus ://cslu.cse.ogi.edu/corpora/mlts/ which is annotated with broad-class phones for the following languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese.
- The expanded OGI corpus ://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26 that has orthographic transcriptions only for some 19,758 utterances. Languages include: Arabic, Chinese, Czech, English, Farsi, German, Hindi, Hungarian, Italian, Japanese, Korean, Malay, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tamil, Vietnamese.
- The C-ORAL-ROM corpus ://catalog.elra.info/product_info.php?products_id=757&language=en that has orthographic transcriptions for Spanish, French, Italian, and Portguese. (Note: Deryle has to verify whether these are time-aligned.)
- The FarsDat corpus ://catalog.elra.info/product_info.php?products_id=18&osCsid=b318053a418942949c0690346a486616 for spoken Farsi/Persian, which has time-aligned phonetic and phonemic transcriptions.

SR with all languages, where LMs are trained by semi-supervised learning

Semi-supervised LM training

Other

4: Idea from discussion with Hal: hierarchical linguistic similarity – leverage similarity to pool data and try to get better lang id. rates. Combine with error-correcting code-style multi-class classification.
5: Add Farsi phoneme-aligned data

Low Priority

Split makefile: MaxEnt v. n-gram
Clean-up
Subversion repository organization
Globally rename SEGLOLA to derivative of Broad-class phones (BRDCLSPHONES???)
- Move scripts down a level
- Can Eclipse still checkout projects how we want??
- 3 separate repositories (one per Eclipse project)?
Try other pitch tracker from Mark Lieberman
Try ESPS (currently stored as tar ball in entropy:/home/tools )
Consider Hal's maxent toolkit. Use for multi-class classification.

Bugs

The project can't be run from cygwin in a home directory (has to be run from a mapped drive on entropy). Here's the error:

Processing /home/pep6/workspace/experiments/data/seg/en013num.seg ..
CMD1: c:/Program\ Files/Praat/praatcon.exe Language-ID/scripts/getpitch3.praat /home/pep6/workspace/
experiments/data/wav/en013num.wav
Error: Cannot open file "C:\cygwin\home\pep6\workspace\Language-ID\scripts\/home/pep6/workspace/expe
riments/data/wav/en013num.wav".
No object was put into the list.

Paper Ideas

Berger's feature selection loop: comparing feature selection with feature template selection. Need to aggregate usefulness of indiv. features to characterize the usefulness of the template from which they were instantiated.

Done

Bruce: Quadratic regression on pitch tracks

Spoken Language ID

nlp-private/spoken-language-id.txt · Last modified: 2015/04/22 15:16 by ryancha

Back to top

Table of Contents

Conferences

Top-Level Goals

Other Directions for Research

Resources

Possible Paper

Action List

General

Feature Engineering Console Module

Feature Engineering with True Transcripts

Speech Reco. with languages with broad-class phone LMs (trained by supervised learning)

SR with all languages, where LMs are trained by semi-supervised learning

Other

Low Priority

Bugs

Paper Ideas

Done