Outstanding Issues

  • Cygwin
  • Eliminate Ruby dependency?
  • Paths should be converted to once again be relative rather than absolute.
  • Graphviz output has proven less useful than originally thought, as the only nodes put on the graph are actual binary files compiled, such as detware.

Notable SVN Commits

SVN Revision 292


CMake utilizes a more verbose, structured syntax than gmake. Hopefully this will improve maintainability of our fairly brittle build system in the future.

Using the New System

CMake works by generating Makefiles for regular make to process. The main configuration file is called


. Other support scripts are contained in


. To do this, cd into HEAD type :

cmake .

<br/> Assuming this succeeds, proceed to invoke make. First do so without specifying a target in order to build all needed utilities:<br/> :


<br/> will build DETware and Sphinx. Next, invoke a specific build target:<br/> :

make detcurve

<br/> will run the default experiments and produce detcurves using gnuplot.

To specify a particular experiment, special –norm value for resultbuilder.pl, or a different result name (nist-result-file, etc.), use a command like the following prior to running

make detcurve

:<br/> :


Some Things You Need to Know

With this commit, Cygwin continues to be a broken platform. Run experiments on entropy or on any other Linux workstation that meets the minimum requirements. Spoken Language-ID now depends on CMake 2.4 and Ruby 1.8 in addition to previous dependencies. Both of these packages are already installed on entropy and are available for installation under Cygwin installations, in the event that Cygwin starts working again ;-)

Files Affected

The cmake support folder is being moved to Language-ID/scripts/cmake. We're dropping some no-longer-used CMake modules, scripts, and documentation in the process. The FindJava.cmake module provides a special workaround for entropy having GCJ set as the default java implementation, so we can use Sun's java 5. A new get_wav_files.rb script replaces the old get_wav_files.sh.cmake script, and get_seg_files.pl and mkdir_if_missing.pl are dropped in favor of their already-existing Ruby counterparts so that all of scripts in cmake/Scripts are Ruby-based rather than being a hodgepodge. This allowed some improvements in indicating how much progress has been made towards copying the large seg and wav file datasets:<br/>

D      cmake<br/>
D      cmake/Scripts<br/>
D      cmake/Scripts/get_wav_files.sh.cmake<br/>
D      cmake/Scripts/get_seg_files.sh.cmake<br/>
D      cmake/Docs<br/>
D      cmake/Docs/todo.txt<br/>
D      cmake/Docs/cmake notes.txt<br/>
D      cmake/Modules<br/>
D      cmake/Modules/MacroStripFileExtension.cmake<br/>
D      cmake/Modules/MacroAddPrefix+AddSuffix.cmake<br/>
D      cmake/Modules/MacroGetCygpath.cmake<br/>
D      cmake/Modules/MacroMakeLangDirs.cmake<br/>
D      cmake/Modules/MacroMakeDirectory.cmake<br/>
A  +   Language-ID/scripts/cmake<br/>
A  +   Language-ID/scripts/cmake/Scripts<br/>
A  +   Language-ID/scripts/cmake/Scripts/mkdir_if_missing.rb<br/>
A  +   Language-ID/scripts/cmake/Scripts/get_seg_files.rb<br/>
A      Language-ID/scripts/cmake/Scripts/get_wav_files.rb<br/>
A  +   Language-ID/scripts/cmake/Modules<br/>
A  +   Language-ID/scripts/cmake/Modules/MacroAddPrefix+AddSuffix.cmake<br/>
A  +   Language-ID/scripts/cmake/Modules/FindJava.cmake<br/>
A  +   Language-ID/scripts/cmake/Modules/MacroLoadProperty.cmake<br/>

Make some final core changes to the CMakeLists.txt files and move the old Makefile to Language-ID/scripts/Makefile-original so there will be no conflict with the CMake-generated output:<br/>

M      CMakeLists.txt<br/>
M      Sphinx4-1.0beta/CMakeLists.txt<br/>
M      Statistical-NLP/CMakeLists.txt<br/>
D      Language-ID/CMakeLists.txt<br/>
M      Language-ID/scripts/CMakeLists.txt<br/>
M      Language-ID/scripts/detware/bin/CMakeLists.txt<br/>
D      Language-ID/scripts/Makefile<br/>
R  +   Language-ID/scripts/Makefile-original<br/>

Reduce the verbosity of this class's output so we can see what else is going on around it:<br/>

M      Statistical-NLP/src/edu/berkeley/nlp/math/LBFGSMinimizer.java<br/>

Create a cmake-ified version of blddetcurve.sh as well as a fresh new Ruby implementation. The Ruby script no longer stores temporary data in an external file called


. More is done within the script itself, rather than given to helpers like awk, helping to guarantee that the data flows through a sequential pipeline. The Ruby version may eventually be allowed to fully supersede the shell script and is currently being invoked by the build system:<br/>

D      Language-ID/scripts/blddetcurve.sh<br/>
M      Language-ID/scripts/blddetcurve.sh.cmake<br/>
A  +   Language-ID/scripts/blddetcurve.rb.cmake<br/>

Also create a cmake-ified version of regressiontest.sh along with a new Ruby implementation. The Ruby implementation allows lid-console.rb to run regression tests very conveniently, but can also be invoked independently from the command line and may eventually supersede regressiontest.sh[.cmake]:<br/>

D      Language-ID/scripts/regressiontest.sh<br/>
A  +   Language-ID/scripts/regressiontest.sh.cmake<br/>
A  +   Language-ID/scripts/regression_test.rb<br/>

Fix a small syntax error in the gnu_det.sh script. This script could be considered deprecated in favor of using generate_gnuplot_script.rb :<br/>

M      Language-ID/scripts/detware/scripts/gnu_det.sh<br/>

Remove the detware plot binary. This binary caused a problem by not having the executable bit set, preventing detcurves from being built. By removing it, we force it to be rebuilt on each system, which removes the need for 64-bit systems to run 32-bit code, since the repository's copy was 32-bit:<br/>

D      Language-ID/scripts/detware/bin/plot<br/>

Then, create a successor implementation of gnu_det.sh. This is implemented in Ruby and facilitates generation of plots by the lid-console.rb script as well as the blddetcurve.rb script:<br/>

A  +   Language-ID/scripts/generate_gnuplot_script.rb<br/>
A  +   Language-ID/scripts/plot.rb<br/>
A  +   Language-ID/scripts/plot_line.rb<br/>

Take the 'norm' option as an integer rather than as a string; fix a typo:<br/>

M      Language-ID/scripts/thetasweep.pl<br/>

Add a 'lang' command line option that specifies what language is being operated on rather than having that be inferred from the filename. The format of the outcome file names has changed in the past. This flag should prevent this script from breaking if we ever change the filenames again:<br/>

M      Language-ID/scripts/thetasweep2.pl<br/>

Correctly infer languages from filenames. It's not possible to use a 'lang' flag as in thetasweep2.pl because this script operates on more than one language at a time. Also make some variable names more descriptive:<br/>

M      Language-ID/scripts/resultbuilder.pl<br/>

Adjust to absolute paths being used in cmake. Take OS and architecture command line options so we can avoid quadratic regression on 64-bit and cygwin setups; check if corresponding output files already exist and don't re-process the wav/seg files if so; clean up output so it doesn't flood the terminal:<br/>

M      Language-ID/scripts/seg2xml3.pl<br/>

Introduce the Feature Engineering Console prototype script:<br/>

A  +   lid-console.rb<br/>

A property file system for use both by lid-console.rb and cmake. This allows certain settings to be persistent and shared between cmake and ruby scripts:<br/>

A  +   Language-ID/scripts/properties.rb<br/>

A Perl one-liner that didn't work inside of CMake due to escaping problems. Of course, we could also probably use awk, but this works for now:<br/>

A  +   Language-ID/scripts/printfirstcolumn.pl<br/>

Enable filtering of what languages are used in training, etc. This helper script is used by the build system:<br/>

A  +   Language-ID/scripts/filter_langs.rb<br/>

Delete old, defunct detcurve stuff I discovered during the process:<br/>

D      Language-ID/scripts/preamble.gp<br/>
D      Language-ID/scripts/multiplot.pl<br/>
D      Language-ID/scripts/detcurve.gp<br/>

Known Issues

  • The CMake build files currently force the use of absolute paths, which seems to cause problems in seg2xml3.pl under cygwin (possibly; it's also possible that this is simply a problem with cygwin in general, since Kevin has encountered essentially the same problem using the regular make setup).
  • CMake will create a bunch of ugly looking directories in a good number of places. We need to set svn:ignore properly to hide these.

Old Notes File

You can successfully use the $@ variable as in regular make, but I suggest that you don't, as this variable is left uninterpreted by cmake and is only resolved at the gmake level, it's difficult to tell exactly what file will be pointed to in the end.

FindJava: Find Java This module finds if Java is installed and determines where the include files and libraries are. This code sets the following variables:

 JAVA_RUNTIME    = the full path to the Java runtime
 JAVA_COMPILE    = the full path to the Java compiler
 JAVA_ARCHIVE    = the full path to the Java archiver''

FindPerl: Find perl this module looks for Perl

 PERL_EXECUTABLE - the full path to perl
 PERL_FOUND      - If false, don't attempt to use perl.''

Source data: /home/data/langid/OGI_TS/SEGLOLA/MANDARIN seems to be a symlink, should be the actual directory. Do mv /home/data/langid/OGI_TS/SEGLOLA/mandarin /home/data/langid/OGI_TS/SEGLOLA/MANDARIN

LID Project Prereq's (Ubuntu Package Names):

  • cmake – the meta-build system
  • make – the low-level build system
  • pdl – perl modules required at least by seg2xml3.pl
  • perl
  • praat – for phonetic processing
  • sun-java5-jdk OR sun-java6-jdk
  • ant

How the Conversion Was Done

My original heading: I have been attempting to move the Language-ID build / testrun system from (g)make to cmake. This has been rather tricky since I knew nothing about cmake at first, but it's coming together now and I want to document how I've accomplished the conversion.

Create CMakeLists.txt files in HEAD/, HEAD/Language-ID, etc. CMakeLists.txt is the main file upon which cmake operates. Language-ID/CMakeLists.txt was originally copied from Language-ID/scripts/Makefile.

Convert gmake syntax to cmake syntax

  • Variable declarations:<br/>gmake: VARIABLE_NAME = some sort of data<br/>cmake: set(VARIABLE_NAME “some sort of data”)
  • Variable dereferences:<br/>gmake: $(VARIABLE_NAME)<br/>'''cmake:''' ${VARIABLE_NAME}
  • List variables:<br/>gmake: LANGUAGE_LIST := en fa fr ge ja ko ma sp ta vi<br/>cmake: set(LANGUAGE_LIST “en” “fa” “fr” “ge” “ja” “ko” “ma” “sp” “ta” “vi”)<br/>Each list element must be enclosed in its own set of quotes.
  • Adjust path declarations. For now I've made everything relative to ${CMAKE_SOURCE_DIR} with the assumption that cmake will be run in the directory containing the Language-ID, Sphinx4-1.0beta, and Statistical-NLP directories. This is the HEAD directory in subversion.

Platform checks

  • Gather platform data the cmake way when possible:<br/>gmake: HOSTNAME = $(shell hostname)<br/>'''cmake:''' site_name(HOSTNAME)<br/>'''gmake:''' HOSTTYPE = $(shell uname)<br/>cmake: set(HOSTTYPE ${CMAKE_SYSTEM})<br/>We might migrate away from using HOSTTYPE and just use CMAKE_SYSTEM directly.
  • Use builtin platform checks:<br/>gmake: ifeq ($(HOSTTYPE),Linux) … else … endif<br/>cmake: if(UNIX) .. if(CYGWIN) .. else(CYGWIN) .. endif(CYGWIN) .. endif(UNIX)<br/>CMake is even aware of the type of processor the system is running on.
  • Change all invocations of 'java' and 'perl' to use the auto-discovered JAVA_RUNTIME and PERL_EXECUTABLE variables

Convert macros

<br/>gmake: THE_PREFIXED_LIST = $(addprefix theprefix/, $(SOME_LIST_OF_VALUES))<br/>cmake: set(THE_PREFIXED_LIST ${SOME_LIST_OF_VALUES})<br/>add_prefix(THE_PREFIXED_LIST “theprefix/”) Make all targets explicit:<br/>gmake: A_TARGET_NAME : ANOTHER_TARGET somefile/that-should-exist<br/>cmake: add_custom_target(A_TARGET_NAME DEPENDS ANOTHER_TARGET DEPENDS somefile/that-should-exist)<br/>gmake: somefile/that-should-exist :<br/>|–tab–| command_to_create_file arg1 arg2 …<br/>cmake: add_custom_command(OUTPUT somefile/that-should-exist COMMAND command_to_create_file arg1 arg2) :$(word $(words $(subst /, ,$(dir $@))),$(subst /, ,$(dir $@)))<br/>This just finds the last directory name in a path – /a/path/example/file.extension would yield example

Spoken Language ID

nlp-private/cmake.txt · Last modified: 2015/04/22 15:10 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0