Audio processing pipeline
Introduction
Red Hen Lab's Summer of Code 2015 students worked mainly on audio. Graduate student Owen He has now assembled several of the contributions into an integrated audio processing pipeline, to process the entire NewsScape dataset. This is a description of the current pipeline along with some design instructions; for the code itself, see our github account.
Related resources
- Audio annotation with Praat
- How to annotate with ELAN
- Current state of text tagging
- Fake laughter detection (Greg Bryant)
- How to install software on the Case HPC
- Machine learning
Red Hen processing pipelines
Red Hen is developing the following automated processing pipelines:
- Capture and text extraction (multiple locations around the world)
- OCR and video compression (Hoffman2)
- Text annotation (Red Hen server, UCLA)
- Audio parsing (Case HPC)
- Video parsing (Hoffman2)
The first three are in production; the task is to create the fourth. Dr. Jungseock Joo and graduate student Weixin Li have started on the fifth. The new audio pipeline is largely based on the audio parsing work done over the summer, but Owen He has also added some new code. The pipeline may be extended to allow video analysis and text to contribute to the results. The data is multimodal, so we're aiming to eventually develop fully multimodal pipelines.
Audio pipeline design
Candidate extensions
- Temporal windows in gentle -- see https://github.com/lowerquality/gentle/issues/103
- Speech to text -- cf. https://github.com/gooofy/py-kaldi-simple
- Acoustic and language models at CMU Sphinx
Temporal windows could usefully be combined with speech to text. For some of our video files, especially those digitized from tapes, the transcript is very poor. We could use a dictionary to count the proportion of valid words, and run speech to text on passages where the proportion falls below a certain level.
Current implementation by Owen He
Below I briefly describe each component of the pipeline. You can find detailed documentation as an executable main script at https://github.com/RedHenLab/Audio/blob/master/Pipeline/Main.ipynb
1. Python Wrapper: All the audio processing tools from last summer are wrapped into a Python Module called "AudioPipe". In addition, I fixed an almost undetectable bug in the Diarization code, leading to an efficiency improvement by 3 times.
2. Shared Preprocessing: the preprocessing part of the pipeline (media format conversion, feature extraction, etc.) is also wrapped as python modules (features, utils) in "AudioPipe".
3. Data Storage: Data output is stored in this folder (https://github.com/RedHenLab/Audio/tree/master/Pipeline/Data), where you can also find the results from testing the pipeline on a sample video (media files are .gitignored since their sizes are too large). Note that the speaker recognition algorithm is now able to detect imposters (tagged as "Others"). The subfolder "Model" is the Model Zoo, where future machine learning algorithms should store their model configurations and README files. The result data are stored in RedHen format (but the meta data for computation are in .json).
4. Data Managing: For manipulating the data, we use abstract syntax specified in the data managing module. The places where data are stored are abstracted as "Node", and the computation processes are abstracted as "Flow" from one Node to the other.
5. Main Script: As you can see in the main script(https://github.com/RedHenLab/Audio/blob/master/Pipeline/Main.ipynb), by deploying the data managing module, the syntax becomes so concise that every step in the pipeline boils down to only 2 lines of code. This will be very convenient for the non-developers to use the audio pipeline.
Design targets
The new audio processing pipeline will be implemented on Case Western Reserve University's High-Performance Computing Cluster. Design elements:
- Core pipeline is automated, processing all NewsScape videos via GridFTP from UCLA's Hoffman2 cluster
- Incoming videos -- around 120 a day, or 100 hours
- Archived videos -- around 330,000, or 250,000 hours (will take months to complete)
- Extensible architecture that facilitates the addition of new functions, perhaps in the form of conceptors and classifiers
The pipeline should have a really clear design, with an overall functional structure that emphasizes core shared functions and a bunch of discrete modules. For instance, we could think of a core system that ingests the videos and extracts the features needed by the different modules. Or a 'digestive system' type approach where each stage contributes to the subsequent stage.
The primary focus for the first version of the pipeline is an automated system that ingests all of our videos and texts and processes them in ways that yield acceptable quality output with no further training or user feedback. There shouldn't be any major problems completing this core task, as the code is largely written and it's a matter of creating a good processing architecture.
Audio pipeline modules
The audio processing pipeline should tentatively have at least these modules, using the code from our GSoC 2015:
- Forced alignment (Gentle, using Kaldi)
- Speaker diarization (Karan Singla)
- Gender detection (Owen He)
- Speaker identification (Owen He -- a pilot sample and a clear procedure for adding more people)
- Paralinguistic signal detection (Sri Harsha -- two or three examples)
- Emotion detection and identification (pilot sample of a few very clear emotions)
- Acoustic fingerprinting (Mattia Cerrato -- a pilot sample of recurring audio clips)
The last four modules should be implemented with a small number of examples, as a proof of concept and to provide basic functionality open to expansion.
Audio pipeline output
The output is a series of annotations in JSON-Lines and also in Red Hen's data format, with timestamps, primary tags indicating data type, and field=value pairs.
Output samples
Here are the outputs from running the pipeline on the sample video
2015-08-07_0050_US_FOX-News_US_Presidential_Politics.mp4, in sections below.
Diarization
Speaker diarization using Karan Singla's Code:
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 0.0 7.51 <NA> <NA> speaker_1.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 7.5 2.51 <NA> <NA> speaker_0.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 10.0 5.01 <NA> <NA> speaker_1.0 <NA>
Gender identification
Gender Identification based on Speaker Diarization results:
GEN_01|2016-03-28 16:38|Source_Program=SpeakerRec.py Data/Model/Gender/gender.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|GEN_01|Gender=Male|Log Likelihood=-19.6638865771
20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474
Gender Identification without Speaker Diarization, but based on 5-second segments:
Speaker recognition
Speaker Recognition (of Donald Trump) based on Speaker Diarizaiton results:
SPK_01|2016-03-28 16:57|Source_Program=SpeakerID.py Data/Model/Speaker/speaker.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|SPK_01|Name=Other|Log Likelihood=-19.9594622598
20150807005009.500|20150807005012.000|SPK_01|Name=Other|Log Likelihood=-20.9657984337
20150807005012.000|20150807005017.000|SPK_01|Name=Other|Log Likelihood=-20.7527012621
Speaker Recognition without Speaker Diarization, but based on 5-second segments,
Acoustic fingerprinting
Acoustic Fingerprinting using Mattia Cerrato's Code (as panako database files):
https://github.com/RedHenLab/Audio/tree/master/Pipeline/Data/Fingerprint/dbs
The video, audio and feature files are not pushed to GitHub due to their large sizes, but they are part of the pipeline outputs as well.
An alternative tool for audio fingerprinting is the open-source tool dejavu on github.
Training
We may have to do some training to complete the sample modules. It would be very useful if you could identify what is still needed to complete a small number of classifiers for modules 4-6, so that we can recruit students to generate the datasets. We can use Elan, the video coding interface developed at the MPI in Nijmegen, to code some emotions (see Red Hen's integrated research workflow).
We have several thousand tpt files, and I suggest we use them to build a library of trained models for recurring speakers. The tpt files must first be aligned; they inherit their timestamps from the txt files, so they are inaccurate. We can then
- read the tpt file for boundaries
- extract the speech segments for every speaker
- concatenate the segments from the same speakers, so that we can have at least 2 minutes training data for everyone
- feed these training data to the speaker recognition algorithm to get the models we want
This way, the entire training process can be automated.
A simple, automated method to select which speakers to train for would be to extract the unique speakers from each tpt file and then count how often they recur. I did this in the script cartago:/usr/local/bin/speaker-list; it generates this output:
tna@cartago:/tmp$ l *tpt
-rw-r--r-- 1 tna tna 125138 Apr 7 08:39 2006-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 305251 Apr 7 08:36 2006-Speakers.tpt
-rw-r--r-- 1 tna tna 468137 Apr 7 08:40 2007-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1352569 Apr 7 08:28 2007-Speakers.tpt
-rw-r--r-- 1 tna tna 403465 Apr 7 08:40 2008-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1370985 Apr 7 08:30 2008-Speakers.tpt
-rw-r--r-- 1 tna tna 442405 Apr 7 08:40 2009-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1294787 Apr 7 08:31 2009-Speakers.tpt
-rw-r--r-- 1 tna tna 375668 Apr 7 08:40 2010-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1015746 Apr 7 08:32 2010-Speakers.tpt
-rw-r--r-- 1 tna tna 336958 Apr 7 08:40 2011-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1024899 Apr 7 08:33 2011-Speakers.tpt
-rw-r--r-- 1 tna tna 277164 Apr 7 08:40 2012-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 833007 Apr 7 08:34 2012-Speakers.tpt
-rw-r--r-- 1 tna tna 342556 Apr 7 08:40 2013-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 940021 Apr 7 08:35 2013-Speakers.tpt
-rw-r--r-- 1 tna tna 328208 Apr 7 08:40 2014-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1106093 Apr 7 08:37 2014-Speakers.tpt
-rw-r--r-- 1 tna tna 283859 Apr 7 08:40 2015-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 980963 Apr 7 08:38 2015-Speakers.tpt
-rw-r--r-- 1 tna tna 75845 Apr 7 08:40 2016-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 242861 Apr 7 08:38 2016-Speakers.tpt
So the /tmp/$YEAR-Recurring-Speakers.tpt files list how many shows a person appears in, by year. If we want more granularity, we could run this by month instead, to track who moves in and out of the news. The script tries to clean up the output a bit, though we may want to do more.
If you look at the top speakers so far in 2016, it's a fascinating list:
32 Alison Kosik
32 Margaret Hoover
32 Nic Robertson
32 Polo Sandoval
32 Sen. Rand Paul
33 Jean Casarez
33 Nima Elbagir
33 Sarah Palin
33 Victor Blackwell
34 CAMEROTA
34 CLINTON
34 Sara Sidner
34 Sen. Lindsey Graham
35 CUOMO
35 Kate Bolduan
35 QUESTION
36 Chad Myers
36 Sen. Bernie Sanders (Vt-i)
37 Andy Scholes
37 Katrina Pierson
38 Jeffrey Toobin
38 Nick Paton Walsh
39 Brian Todd
39 Frederik Pleitgen
39 Hillary Rodham Clinton
39 Nancy Grace
39 Sara Ganim
40 Matt Lewis
40 Michelle Kosinski
40 Paul Cruickshank
42 Bakari Sellers
42 Bill Clinton
42 PEREIRA
43 Ana Navarro
43 Clarissa Ward
43 S.E. Cupp
43 Van Jones
44 Ben Ferguson
44 Jason Carroll
45 CLIENT
46 AVO
46 Evan Perez
46 Maeve Reston
47 David Chalian
47 David Gergen
47 Errol Louis
47 Ron Brownstein
48 SFX
49 Dr. Sanjay Gupta
50 CRUZ
51 Amanda Carpenter
51 John King
52 Miguel Marquez
53 FLO
54 Kayleigh Mcenany
55 Poppy Harlow
58 Athena Jones
59 Bernie Sanders
59 Coy Wire
60 FARMER
61 Don Lemon
61 Nick Valencia
62 Barbara Starr
63 Brian Stelter
65 Jim Sciutto
67 Gov. Chris Christie
68 VO
70 Erin Burnett
71 Carol Costello
72 Joe Johns
72 Pamela Brown
74 Ashleigh Banfield
74 Jeffrey Lord
75 Manu Raju
76 Chris Frates
81 Phil Mattingly
88 Christine Romans
90 TRUMP
93 Mark Preston
95 Brooke Baldwin
96 Gloria Borger
101 Jake Tapper
103 Jeb Bush
107 SANDERS
112 Jim Acosta
122 Michaela Pereira
127 PRESIDENTIAL CANDIDATE
131 Brianna Keilar
133 Gov. John Kasich
134 Sunlen Serfaty
135 Anderson Cooper
143 Dana Bash
144 Sen. Bernie Sanders (I-vt)
149 Sara Murray
150 John Berman
157 Wolf Blitzer
160 Barack Obama
163 Alisyn Camerota
166 Sen. Bernie Sanders
168 Jeff Zeleny
180 Chris Cuomo
258 Sen. Marco Rubio
326 Hillary Clinton
382 Sen. Ted Cruz
522 Donald Trump
You see Sanders as Sen. Bernie Sanders (I-vt), Sen. Bernie Sanders (Vt-i), Sen. Bernie Sanders, and SANDERS, Trump as Donald Trump and TRUMP, so let's include multiple names for the same person in extracting the training data.
For systematic disambiguation, it may be possible to use the Library of Congress Name Authority File (LCNAF). It contains 8.2 million name authority records (6 million personal, 1.4 million corporate, 180,000 meeting, and 120,000 geographic names, and .5 million titles). As a publicly supported U.S. Government institution, the Library generally does not own rights in its collections and what is posted on its website."Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete." For an example record, see:
The virtue of using this database is that it is likely accurate. However, the records are impoverished relative to Wikipedia; arguably, it's Wikipedia that should be linking to into LCNAF. It is also unclear if the LCNAF has an API that facilitates machine searches; see LoC SRW for leads.
Pete Broadwell writes on 17 April 2016,
In brief, I think the best way to disambiguate named persons would be to set up our own local DBpedia Spotlight service:
https://github.com/dbpedia-spotlight/dbpedia-spotlight
We’ve discussed Spotlight briefly in the past; it’s trivial to set up a basic local install via apt, but (similar to Gisgraphy), I think more work will be necessary to download and integrate the larger data sets that would let us tap into the full potential of the software.
In any case, this is something Martin and I have planned to do for the library for quite some time now. I suggest that we first try installing it on babylon, with the data set and index files stored on the Isilon (which is what we do for Gisgraphy) — we could move it somewhere else if babylon is unable to handle the load. We can also see how well it does matching organizations and places (the latter could help us refine the Gisgraphy matches), though of course places and organizations don’t speak.
I share your suspicion that the LCNAF isn’t necessarily any more extensive, accurate or up-to-date than DBpedia/Wikipedia, especially for people who are in the news. It also doesn’t have its own API as far as I can tell; the suggested approach is to download the entire file as RDF triples and set up our own Apache Jena service (http://jena.apache.org/) to index them. Installing Spotlight likely would be a better use of our time.
If we set the cutoff at speakers who have appeared in at least 32 shows, we would get a list of a hundred common speakers. But it may be useful to go much further. Even people who appear in a couple of shows could be of interest; I recognize a lot of the names. That would give us thousands of speakers:
tna@cartago:/tmp$ for YEAR in {2006..2016} ; do echo -en "$YEAR: \t" ; grep -v '^ 1 ' $YEAR-Recurring-Speakers.tpt | wc -l ; done
2006: 2121
2007: 8441
2008: 6922
2009: 7360
2010: 6112
2011: 5442
2012: 4438
2013: 5416
2014: 5189
2015: 4434
2016: 1149
It's likely 2007 is high simply because we have a lot of tpt files from that year. Give some thought to this; the first step is to get the alignment going. Once we have that, we should have a large database of recurring speakers we can train with.
Efficiency coding
The PyCASP project makes an interesting distinction between efficiency coding and application coding. We have a bunch of applications; your task is to integrate them and make them run efficiently in an HPC environment. PyCASP is installed at the Case HPC if you would like to use it. The Berkeley team at ICSI who developed it also have some related projects; we have good contacts with this team:
Please asses if the efficiency coding framework could be useful in the pipeline design. It's important to bear in mind that we want this design to be clear, transparent, and easy to maintain; it's possible that introducing the PyCASP infrastructure will make it more difficult to extend, in which case we should not use it.
Integrating the training stage
To the extent there's time, I'd also like us to consider a somewhat more ambitious project that integrates the training stage. Could you for instance sketch an outline of how we might create a processing architecture that integrates deep learning for some tasks and conceptors for others? There are a lot of machine learning tools out there; RHSoC2015 used SciPy and Kaldi. Consider Google's project TensorFlow -- this is a candidate deep learning approach for integrated multimodal data. We see this as a longer-term project.
Red Hen Audio Processing Pipeline Guide
Dependencies:
In order to run the main processing main script, the following modules should be loaded on the HPC Cluster:
module load boost/1_58_0
module load cuda/7.0.28
module load pycasp
module load hdf5
module load ffmpeg
And Python related modules are installed in a virtual environment, which can be activated by the following command:
. /home/hxx124/myPython/virtualenv-1.9/ENV/bin/activate
Python Wrapper:
Although we welcome audio processing tools implemented in any language, to make it easy for the integration of several audio tools into one unified pipeline, we strongly recommend the developers can wrap their code as a Python module, so that the pipeline can include their work by simply importing the corresponding module. All audio-related works from GSoC 2015 have been wrapped into a Python Module called "AudioPipe". See below for some examples:
# the AudioPipe Python Module
import AudioPipe.speaker.recognition as SR # Speaker Recognition Module
import AudioPipe.fingerprint.panako as FP # Acoustic Fingerprinting Module
from AudioPipe.speaker.silence import remove_silence # tool for remove the silence in the audio, not needed
import numpy as np
from AudioPipe.features import mfcc # Feature Extraction Module, part of the shared preprocessing
import scipy.io.wavfile as wav
from AudioPipe.speaker.rec import dia2spk, getspk # Speaker Recognition using diarization results
from AudioPipe.utils.utils import video2audio # Format converting module, part of the shared preprocessing
import commands, os
from AudioPipe.diarization.diarization import Diarization # Speaker Diarization Module
import AudioPipe.data.manage as DM # Data Management Module
Pipeline Abstract Syntax:
To specify a new pipeline, one can use the abstract syntax defined in the Data Management Module, where a Node is a place to store data and a Flow is a computational process that transforms input data to output data.
For instance, one can create a Node for videos as follows:
# Select the video file to be processed
Video_node = DM.Node("Data/Video/",".mp4")
name = "2015-08-07_0050_US_FOX-News_US_Presidential_Politics"
where DM.Node(dir, ext) is a constructor for node that takes 2 arguments: directory(dir) and extension(ext).
And to construct a small pipeline that converts video to audio, one can do the following:
# Convert the video to audio
Audio_node = DM.Node("Data/Audio/", ".wav")
audio = Video_node.Flow(video2audio, name, Audio_node, [Audio_node.ext])
which first creates a node for audio outputs, and then flow a specific file from the video node to the audio node through the computational process called video2audio. Name specifies which file exactly is going to be processed and it will also be used as the output file name. The last argument of .Flow() is a list of arguments required by the computational process(video2audio in this case).
The following code gives a more complex pipeline example:
# Select the file for the meta information
Meta_node = DM.Node("Data/RedHen/",".seg")
meta = Meta_node.Pick(name)
# Store the fingerprint of the video
FP_node= DM.Node("Data/Fingerprint/")
output, err, exitcode = Video_node.Flow(FP.Store, name, FP_node, [])
# Run speaker diarization on the audio
Dia_node = DM.Node("Data/Diarization/", ".rttm")
args = dict(init_cluster=20, dest_mfcc='Data/MFCC', dest_cfg="Data/Model/DiaCfg")
dia = Audio_node.Flow(Diarization, name, Dia_node, args)
# Gender Identification based on Speaker Diarization
Gen_node = DM.Node("Data/Gender/",".gen")
gen = Audio_node.Flow(dia2spk, name, Gen_node, [model_gender, dia, meta, Gen_node.ext])
# Speaker Recognition based on Speaker Diarization
Spk_node = DM.Node("Data/Speaker/",".spk")
spk = Audio_node.Flow(dia2spk, name, Spk_node, [model_speaker, dia, meta, Spk_node.ext])
Basically this pipeline does the following:
It takes the video and stores the fingerprints of it; takes the audio and produces diarization results; from the audio it identifies the gender of each speaker based on the boundary information provided by the diarization results; similarly from the audio it recognizes the speakers.