Pocketsphinx is a modification of CMU's Sphinx-II speech recognition system. “PocketSphinx is CMU's fastest speech recognition system. It uses Hidden Markov Models (HMM) with semi-continuous output probability density functions (PDF). Even though it is not as accurate as Sphinx-3 or Sphinx-4, it runs at real time, and therefore it is a good choice for live applications.” (from the CMU Speech group's comparison documentation).
There are three major components;
The N800/N810 uses ALSA for audio input and output. This means that audio can be accessed via the elements of /dev/dsptask or via /dev/snd. Scratchbox uses the host computer's I/O. In my case, this means that Scratchbox using OSS. Due to this, the sphinxbase libraries can work on one or the other but not both. There is a way to specify the audio input device (using -adcdev) but until I can figure out how to simulate ALSA at a device level, I cannot offer a way to have scratchbox behave like the device.
Additionally, the N800/N810s only allow sampling of up to 8000 Hz; however, the defaults for the example programs are set to 16000 since that is a more reliable sampling frame. This means that if you are going to be using any of the live sampling examples, you will need to specifically pass it something like -samprate 8000
You will notice that there are 9 sphinx packages.
First you need to install sphinxbase, which provides the libraries upon which PocketSphinx depends:
Once sphinxbase is installed, you may install the elements of PocketSphinx that you need. There are two types of packages available. First, there are the core and development packages, that allow you to use and access PocketSphinx. Second, there are the sample models packages. These give you access to some simple acoustic and language models that are used by the example scripts and that you may use in your own recognition programs.
The rest of this document will assume that you have installed all of the packages.
Four test scripts will have been placed in /usr/bin:
You may run any of these to test your installation in scratchbox, but at present I recommend only using pocketsphinx_numptt. The others of these use the program pocketsphinx_continuous which causes sampling problems on the N800. pocketsphinx_continuous is an example application that regularly checks the audio input and when it detects something loud enough to be non-silence, it begins recording the input. When it detects silence again, it will begin decoding the audio sample and will display the results to the screen when it is done. pocketsphinx_ptt is similar but uses push-to-talk functionality.
You can test digit recognition using:
pocketsphinx_numptt
Note: the continuous test programs are designed to terminate when you say “goodbye”, but since that is not part of the TIDIGITS dictionary, you will need to kill the process with ctrl+c.
For examples of these functions in context, check out the source code for the example programs (available from the mercurial repository). src/programs/tty_ptt.c or src/programs/tty_continuous.c are probably the best bet.
Includes:
/* Access fbs8 audio library */ #include <fbs.h> /* For printing messages */ #include <err.h> /* Audio device functions */ #include <ad.h>
Useful functions:
/* Start audio recording. Return value: 0 if successful, <0 otherwise */ SPHINXBASE_EXPORT int32 ad_start_rec (ad_rec_t *); /* * Read next block of audio samples while recording; read upto max samples into buf. * Return value: # samples actually read (could be 0 since non-blocking); -1 if not * recording and no more samples remaining to be read from most recent recording. */ SPHINXBASE_EXPORT int32 ad_read (ad_rec_t *, int16 *buf, int32 max); /* Stop audio recording. Return value: 0 if successful, <0 otherwise */ SPHINXBASE_EXPORT int32 ad_stop_rec (ad_rec_t *); /** * Begins processing an utterance. * * @param uttid An string identifying the utterance; utterance data * (raw or mfc files, if any) logged under this name. * The recognition result in the "match" file also * identified with this id. If uttid is NULL, an * automatically generated running sequence number (of * the form %08d) is used instead. * * @return 0 if successful, else -1. */ POCKETSPHINX_EXPORT int32 uttproc_begin_utt(char const *uttid); /*process the data*/ /** * Decode the next block of input samples in the current utterance. * * The "block" argument specifies whether the decoder should block * until all pending data have been processed. If 0, it is * "non-blocking". That is, the decoder steps through only a few * pending frames (at least 1), and the remaining input data is queued * up internally for later processing. In particular, this function * can be called with 0-length data to simply process internally * queued up frames. * * @note The decoder will not actually process the input data if any * of the processing depends on the entire utterance. (For example, * if CMN/AGC is based on entire current utterance.) The data are * queued up internally for processing after uttproc_end_utt is * called. Also, one cannot combine uttproc_rawdata and * uttproc_cepdata within the same utterance. * * * @param raw Block of samples in native-endian linear PCM format * @param nsample Number of samples in raw, can be 0. * @param block Whether to process data immediately. * @return Number of frames internally queued up and remaining to be * decoded; -1 if any error occurs. */ POCKETSPHINX_EXPORT int32 uttproc_rawdata(int16 *raw, int32 nsample, int32 block); /** * Finish processing an utterance. * * For bookkeeping purposes, this function marks that no more data is * forthcoming in the current utterance. It should be followed by * uttproc_result to obtain the final recognition result. * * @return 0 if successful, else -1. */ POCKETSPHINX_EXPORT int32 uttproc_end_utt(void); /** * Obtain recognition result for utterance after uttproc_end_utt() has * been called. * In the blocking form (i.e. if the block paramter is * non-zero), all queued up data is processed and final result * returned. In the non-blocking version, only a few pending frames * (at least 1) are processed. In the latter case, the function can * be called repeatedly to allow the decoding to complete. * * @param out_frm (output) pointer to number of frames in utterance. * @param out_hyp (output) pointer to READ_ONLY recognition string. * The contents of this string will be clobbered by the * next call to uttproc_result() or * uttproc_partial_result(). * @param block If non-zero, process all data and return final result. * @return Number of frames remaining to be processed. If non-zero * (non-blocking mode) the final result is not yet available. * If 0, frm and hyp contain the final recognition result. If * there is any error, the function returns -1. */ POCKETSPHINX_EXPORT int32 uttproc_result(int32 *out_frm, char **out_hyp, int32 block);