Title of Invention

APPARATUS AND METHOD FOR EXTRACTING AN AMBIENT SIGNAL IN AN APPARATUS AND METHOD FOR OBTAINING WEIGHTING COEFFICIENTS FOR EXTRACTING AN AMBIENT SIGNAL AND COMPUTER PROGRAM

Abstract An apparatus for extracting an ambient signal from an input audio signal comprises a gain-value determinator configured to determine a sequence of time-varying ambient signal gain values for a given frequency band of the time-frequency distribution of the input audio signal in dependence on the input audio signal. The apparatus comprises a weighter configured to weight one of the sub-band signals representing the given frequency band of the time- frequency-domain representation with the time-varying gain values, to obtain a weighted sub-band signal. The gain- value determinator is configured to obtain one or more quantitative feature-values describing one or more features of the input audio signal and to provide the gain-value as a function of the one or more quantitative feature values such that the gain values are quantitatively dependent on the quantitative values. The gain value determinator is configured to determine the gain values such that ambience components are emphasized over non-ambience components in the weighted sub-band signal.
Full Text Apparatus and Method for Extracting an Ambient Signal in an
Apparatus and Method for Obtaining Weighting Coefficients
for Extracting an Ambient Signal and Computer Program
Description
Technical Field
Embodiments according to the invention relate to an
apparatus for extracting an ambient signal and to an
apparatus for obtaining weighting coefficients for
extracting an ambient signal.
Some embodiments according to the invention are related to
methods for extracting an ambient signal and to methods for
obtaining weighting coefficients.
Some embodiments according to the invention are directed to
a low-complexity extraction of a front signal and an
ambient signal from an audio signal for upmixing.
Background
In the following, an introduction will be given.
1 Introduction
Multi-channel audio material is becoming more and more
popular also in the consumer home environment. This is
mainly due to the fact that movies on DVD offer 5.1 multi-
channel sounds and therefore even home users frequently
install audio playback systems, which are capable of
reproducing multi-channel audio.
Such a setup may e.g. consist of three speakers (L, C, R)
in the front, two speakers (Ls, Rs) in the back and one low
frequency effects channel (LFE). For convenience, the given
explanations are related to 5.1 systems. They apply to any
other multi-channel systems with minor modifications.
Multi-channel systems provide several well-known advantages
over two-channel stereo reproduction, e.g.:
• Advantage 1: Improved front image stability even off
the optimal (central) listening position. Due to the
center channel the "sweet-spot" is enlarged. The term
"sweet-spot" denotes the area of listening positions
where an optimal sound impression is perceived.
• Advantage 2: An increased experience of "envelopment"
and spaciousness is created by the rear channel
speakers.
Nevertheless, there exists a huge amount of legacy audio
content with two audio channels ("stereo") or even only one
("mono"), e.g. old movies and television series.
Recently, various methods for generating a multi-channel
signal from an audio signal with fewer channels have been
developed (see Section 2 for an overview of the related
conventional concepts). The process of generating a multi-
channel signal from an audio signal with fewer channels is
called "upmixing".
Two concepts of upmixing are widely known.
1. Upmixing with additional information guiding the upmix
process. The additional information may be either
"encoded" in a specific way in the input signal or may be
stored additionally. This concept is frequently called
"guided upmix".
2. The "blind upmix", whereas a multi-channel signal is
obtained from the audio signal exclusively without any
additional information.
Embodiments according to the present invention are related
to the latter, i.e. the blind upmix process.
In the literature, an alternative taxonomy for upmix
processes is reported. Upmix processes may follow either
the Direct/Mihient-Concept or the '¦'¦ In-the-band"-Concept or
a mixture of both. These two concepts are described in the
following.
A. Direct/Ainbient-Concept
The "direct sound sources" are reproduced through the three
front channels in a way that they are perceived at the same
position as in the original two-channel version. The term
"direct sound source" is used to describe a sound coming
solely and directly from one discrete sound source (e.g. an
instrument), with little or without any additional sounds,
e.g. due to reflections from the walls.
The rear speakers are fed with ambient sounds (ambience-
like sounds). Ambient sounds are those forming an
impression of a (virtual) listening environment, including
room reverberation, audience sounds (e.g. applause),
environmental sounds (e.g. rain), artistically intended
effect sounds (e.g. vinyl crackling) and background noise.
Figure 23 illustrates the sound image of the original two-
channel version and Figure 24 shows the same for an upmix
following the Direct/Ambient-Concept.
B. "In-the-band"-Concept
Following the "In-the-band"-Concept, every sound, or at
least some sounds (direct sound as well as ambient sounds)
may be positioned all around the listener. The position of
a sound is independent of its characteristics (i.e. whether
it is a direct sound or an ambient sound) and only
dependent on the specific design of the algorithm and its
parameter settings. Figure 25 illustrates the sound image
of the "In-the-band"-Concept.
Apparatus and methods according to the invention relate to
the direct/ambient concept. The following section gives an
overview of conventional concepts in the context of
upmixing an audio signal with m channels to an audio signal
with n channels, with m 2 Conventional concepts in blind upmixing
2 .1 Upmixing of mono recordings
2.1.1 Pseudo-stereophonic processing
Most of the techniques to produce a so-called "pseudo-
stereophonic" signal are not signal adaptive. This means
that they process any mono signal in the same way, no
matter what the content is. Those systems often work with
simple filter structures and/or time delays to decorrelate
the output signals, e.g. by processing two copies of the
one-channel input signal by a pair of complementary comb
filters [Sch57]. A comprehensive overview of such systems
can be found in [Fal05] .
2.1.2 Semi-automatic mono to stereo upmixing using
sound source formation
The authors propose an algorithm to identify signal
components (e.g. time-frequency bins of a spectrogram)
which belong to the same sound source and should therefore
be panned together [LMT07]. The sound source formation
algorithm considers principles of stream segregation
(derived from the Gestalt principles) : continuity in time,
harmonic relations in frequency and amplitude similarity.
Sound sources are identified using clustering methods
(unsupervised learning). The derived "time-frequency-
clusters" are further grouped into larger sound streams
using (a) information on the frequency range of the objects
and (b) timbral similarities. The authors report the use of
a sinusoidal modeling algorithm (i.e. the identification of
sinusoidal components of a signal) as a front end.
After the sound source formation, the user selects sound
sources and applies panning weights to them. It should be
noted that (according to some conventional concepts) many
of the proposed methods (sinusoidal modeling, stream
segregation) do not perform reliable when processing real-
world signals of average complexity.
2.1.3 Ambience extraction using Non-negative Matrix
Factorization
A time-frequency distribution (TFD) of the input signal is
computed, e.g. by means of Short-term Fourier Transform. An
estimate of the TFD of the direct signal components is
derived by means of the numerical optimization method of
Non-negative Matrix Factorization. An estimate of the TFD
of the ambient signal is obtained by computing the
difference of the TFD of the input signal and the estimate
of the TFD of the direct signal (i.e. the approximation
residual). The re-synthesis of the time signal of the
ambient signal is carried out using the phase spectrogram
of the input signal. Additional post-processing is
optionally applied in order to improve the listening
experience of the derived multi-channel signal [UWHH07].
2.1.4 Adaptive spectral panoramization (ASP)
A method for the panoramization of a mono signal for
playback using a stereo sound system is described in
[VZA06]. The processing incorporates an STFT, the weighting
of the frequency bins used for the re-synthesis of the left
and right channel signal, and the inverse STFT. The time-
varying weighting factors are derived from low-level
features computed from the spectrogram of the input signal
in sub-bands.
2.2 Upmixing of stereo recordings
2.2.1 Matrix decoders
Passive matrix decoders compute a multi-channel signal
using a time-invariant linear combination of the input
channel signals.
Active matrix decoders (e.g. Dolby Pro Logic II [DreOO],
DTS NE0:6 [DTS] or HarmanKardon/Lexicon Logic 7 [Kar] )
apply an analysis of the input signal and perform signal-
dependent adaptation of the matrix elements (i.e. the
weights for the linear combination). These decoders use
inter-channel differences and signal adaptive steering
mechanisms to produce multi-channel output signals. Matrix
steering methods aim at detecting prominent sources (e.g.
dialogues). The processing is performed in the time domain.
2.2.2 A method to convert stereo to multi-channel sound
Irwan and Aarts present a method to convert a signal from
stereo to multichannel [lAOl]. The signal for the surround
channels is calculated by using a cross-correlation
technique (an iterative estimation of the correlation
coefficient is proposed in order to reduce the
computational load).
The mixing coefficients for the center channel are obtained
using Principal Component Analysis (PCA). PCA is applied to
calculate a vector, which indicates the direction of the
dominant signal. Only one dominant signal can be detected
at a time. The PCA is performed using an iterative gradient
descent method (which is less demanding with respect to
computational load compared to the standard PCA using an
eigenvalue decomposition of the covariance matrix of the
observation). The computed vector of direction is similar
to the output of a goniometer if all decorrelated signal
components are neglected. The direction is then mapped from
a two-to a three-channel representation to create the 3
front channels.
2.2,3 An unsupervised adaptive filtering approach of 2-
to-5 channel upmix
The authors propose an improved algorithm compared to the
method by Irwan and Aarts. The originally proposed method
is applied to each sub-band [LD05] . The authors assume w-
disjoint orthogonality of the dominant signals. The
frequency decomposition is carried out using either a
Pseudo Quadrature Mirror Filterbank or a wavelet-based
octave filter-bank. A further extension to the method by
Irwan and Aarts is the use of an adaptive step size for the
iterative computation of the (first) principal component.
2.2.4 Ambience Extraction and Synthesis from Stereo
Signals for Multi-channel Audio Upmix
Avendano and Jot propose a frequency-domain technique to
identify and extract the ambience information in stereo
audio signals [AJ02].
The method is based on the computation of an inter-channel
coherence index and a non-linear mapping function that
allows for the determination of the time-frequency regions
that consist mostly of ambience components. Ambient signals
are subsequently synthesized and used to feed the surround
channels of the multi-channel playback system.
2.2.5 Descriptor based spatialization
The authors describe a method for one-to-n upmixing, which
can be controlled by an automated classification of the
signal [MPA"'05] . The paper contains some errors; therefore
it might be that the authors aimed at different goals than
described in the paper.
The upmix process uses three processing blocks: the "upmix
tool", artificial reverberation and equalization. The
"upmix tool" consists of various processing blocks,
including the extraction of an ambient signal. The method
for the extraction of an ambient signal ("spatial
discriminator") is based on the comparison of the left and
right signal of a stereo recording in the spectral domain.
For upmixing mono-signals, artificial reverberation is
used.
The authors describe 3 applications: l-to-2 upmixing, 2-to-
5 upmixing, and l-to-5 upmixing.
Classification of the audio signal The classification
process uses a supervised learning approach: Low-level
features are extracted from the audio signal and a
classifier is applied to classify the audio signal into one
of three classes: music, voices or any other sounds.
A particularity of the classification process is the use of
a genetic programming method to find
• optimal features (as compositions of different
operations)
• optimal combination of the obtained low-level features
• the best classifier from a set of available
classifiers
• the best parameter setting for the chosen classifier
l-to-2 upmixing The upmix is done using reverberation
and equalization. If the signal contains voice, the
equalization is enabled and reverberation is disabled.
Otherwise, the equalization is disabled and reverberation
is enabled. No dedicated processing aiming at the
suppression of speech in the rear channels is incorporated.
2-to-5 upmixing The authors aim at building a multi-
channel soundtrack whereas detected voices are attenuated
by muting the center channel.
l-to-5 upmixing The multi-channel signal is generated
using reverberation, equalization and the "upmix tool"
(which generates a 5.1 signal from a stereo signal. The
stereo signal is the output of the reverberation and the
input to the "upmix tool".). Different presets are used for
music, voices and all other sounds. By controlling
reverberation and equalization, a multi-channel soundtrack
is build that keeps voices in the center channel and has
music and other sounds in all channels.
If the signal contains voice, the reverberation is
disabled. Otherwise, reverberation is enabled. Since the
extraction of the rear-channel signal relies on a stereo
signal, no rear-channel signal is generated when
reverberation is disabled (which is the case for voices).
2.2.6 Ambience-based upmixing
Soulodre presents a system, which creates a multi-channel
signal from a stereo signal [Sou04]. The signal is
decomposed into so-called "individual source streams" and
"ambience streams". Based on these streams a so-called
"Aesthetic Engine" synthesizes the multi-channel output. No
further technical details of the decomposition and the
synthesis steps are given.
2.3 Upmixing of audio signals with arbitrary number
of channels
2.3.1 Multichannel surround format conversion and
generalized up-mix
The authors describe a method based on spatial audio coding
using an intermediate mono downmix and introduce an
improved method without the intermediate downmix. The
improved method comprises passive matrix upmixing and
principles known from Spatial Audio Coding. The
improvements are gained at the expense of increased data
rate of the intermediate audio [GJ07a].
2.3.2 Primary-ambient signal decomposition and vector-
based localization for spatial audio coding and
enhancemen t
The authors propose a separation of the input signal into a
primary (direct) signal and an ambient signal using
Principal Component Analysis (PCA) [GJ07b].
The input signal is modeled as the sum of a primary
(direct) signal and an ambient signal. It is assumed that
the direct signals have substantially more energy than the
ambient signal and both signals are uncorrelated.
The processing is carried out in the frequency domain. The
STFT coefficients of the direct signal are obtained from
the projection of the STFT coefficients of the input signal
onto the first principal component. The STFT coefficients
of the ambient signal are computed from the difference of
the STFT coefficients of the input signal and the direct
signal.
Since only the (first) principal component (i.e. the
eigenvector of the covariance matrix corresponding to the
largest eigenvalue) is needed, a computationally efficient
alternative for the eigenvalue decomposition used in
standard PCA is applied (which is an iterative
approximation). The cross-correlation needed for the PCA
decomposition is also estimated iteratively. The direct and
ambient signal add up to the original, i.e. no information
is lost in the decomposition.
Summary
In view of the above, there is a need for a low-complexity
extraction of an ambient signal from an input audio signal.
Some embodiments according to the invention create an
apparatus for extracting an ambient signal on the basis of
a time-frequency-domain representation of an input audio
signal, the time-frequency-domain representation
representing the input audio signal in terms of a plurality
of sub-band signals describing a plurality of frequency
bands. The apparatus comprises a gain-value determinator
configured to determine a sequence of time-varying ambient
signal gain values for a given frequency band of the time-
frequency-domain representation of the input audio signal
in dependence on the input audio signal. The apparatus
comprises a weighter configured to weight one of the sub-
band signals representing the given frequency band of the
time-frequency-domain representation with the time-varying
gain values to obtain a weighted sub-band signal. The gain-
value determinator is configured to obtain one or more
quantitative feature values describing one or more features
or characteristics of the input audio signal, and to
provide the gain-values as a function of the one or more
quantitative feature values, such that the gain values are
quantitatively dependent on the quantitative feature
values. The gain-value determinator is configured to
provide the gain-values such that ambient components are
emphasized over non-ambient components in the weighted sub-
band signal.
Some embodiments according to the invention provide an
apparatus for obtaining weighting coefficients for
extracting an ambient signal from an input audio signal.
The apparatus comprises a weighting coefficient
determinator configured to determine the weighting
coefficients such, that gain values obtained on the basis
of a weighted combination, using the weighting coefficients
(or defined by the weighting coefficients), of a plurality
of quantitative feature values describing a plurality of
features of a coefficient-determination input audio signal
approximate expected gain-values associated with the
coefficient-determination input audio signal.
Some embodiments according to the invention provide methods
for extracting an ambient signal and for obtaining
weighting coefficients.
Some embodiments according to the invention are based on
the finding that an ambient signal can be extracted from an
input audio signal in a particularly efficient and flexible
manner by determining quantitative feature values, for
example a sequence of quantitative feature values
describing one or more features of the input audio signal,
as such quantitative feature values can be provided with
limited computational effort and can be translated into
gain-values efficiently and flexibly. By describing one or
more features in terms of one or more sequences of
quantitative feature values, gain values can easily be
obtained, which are quantitatively dependent on the
quantitative feature values. For example, simple
mathematical mappings can be used to derive the gain-values
from the feature-values. In addition, by providing the
gain-values such that the gain-values are quantitatively
dependent on the feature values, a fine-tuned extraction of
the ambient components from the input audio signal can be
obtained. Rather than making a hard decision as to which
components of the input audio signal are the ambient
components and which components of the input audio signal
are non-ambient components, a gradual extraction of the
ambient components can be performed.
In addition, the usage of quantitative feature values
allows for a particularly efficient and precise combination
of feature values describing different features.
Quantitative feature values can, for example, be scaled or
processed in a linear or a non-linear way according to
mathematical processing rules.
In some embodiments in which multiple feature values are
combined to obtain a gain value, details regarding the
combination {for example, details regarding a scaling of
different feature values) can be adjusted easily, for
example by adjusting respective coefficients.
To summarize the above, a concept for extracting an ambient
signal comprising a determination of quantitative feature
values and also comprising a determination of gain values
on the basis of the quantitative feature values may
constitute an efficient and low-complexity concept of
extracting an ambient signal from an input audio signal.
In some embodiments according to the invention, it has been
shown to be particularly efficient to weight one or more of
the sub-band signals of the time-frequency-domain
representation of the input audio signal. By weighting one
or more of the sub-band signals of the time-frequency-
domain representation, a frequency-selective or specific
extraction of ambient signal components from the input
audio signal can be achieved.
Some embodiments according to the invention create an
apparatus for obtaining weighting coefficients for
extracting an ambient signal from an input audio signal.
Some of these embodiments are based on the finding that
coefficients for an extraction of an ambient signal can be
obtained on the basis of a coefficient-determination-input-
audio-signal, which can be considered as a "calibration
signal" or "reference signal" in some embodiments. By using
such a coefficient-determination input audio signal,
expected gain values of which are for example known or can
be obtained with moderate effort, coefficients defining a
combination of quantitative feature values can be obtained,
such that the combination of quantitative feature values
results in gain values which approximate the expected gain
values.
According to said concept, it is possible to obtain a set
of appropriate weighting coefficients, such that an ambient
signal extractor configured with these coefficients may
perform a sufficiently good extraction of ambient signals
(or ambient components) from input audio signals, which are
similar to the coefficient-determination-input-audio-
signal .
In some embodiments according to the invention, the
apparatus for obtaining weighting coefficients allows for
an efficient adaptation of an apparatus for extracting an
ambient signal to different types of input audio signals.
For example, on the basis of a "training signal", i.e. a
given audio signal which serves as the coefficient-
determination-input-audio-signal, and which may be adapted
to the listening preferences of a user of an ambient signal
extractor, an appropriate set of weighting coefficients can
be obtained. In addition, by providing the weighting
coefficients, optimal usage can be made of the available
quantitative feature values describing different features.
Further details, effects and advantages of embodiments
according to the invention will be described subsequently.
Brief Description of the Drawings
Embodiments according to the invention will subsequently be
described taking reference to the enclosed Figs, in which:
Fig. 1 shows a block schematic diagram of an apparatus for
extracting an ambient signal, according to an embodiment
according to the invention;
Fig. 2 shows a detailed block schematic diagram of an
apparatus for extracting an ambient signal from an input
audio signal, according to an embodiment according to the
invention;
Fig. 3 shows a detailed block schematic diagram of an
apparatus for extracting an ambient signal from an input
audio signal, according to an embodiment according to the
invention;
Fig. 4 shows a block schematic diagram of an apparatus for
extracting an ambient signal from an input audio signal,
according to an embodiment according to the invention;
Fig. 5 shows a block schematic diagram of a gain value
determinator, according to an embodiment according to the
invention;
Fig. 6 shows a block schematic diagram of a weighter,
according to an embodiment according to the invention;
Fig. 7 shows a block schematic diagram of a post processor,
according to an embodiment according to the invention;
Figs. 8a and 8b show extracts from a block schematic
diagram of an apparatus for extracting an ambient signal,
according to embodiments according to the invention;
Fig. 9 shows a graphical representation of the concept of
extracting feature values from a time-frequency-domain
representation;
Fig. 10 shows a block diagram of an apparatus or a method
for performing an l-to-5 upmixing, according to an
embodiment according to the invention;
Fig. 11 shows a block diagram of an apparatus or of a
method for extracting an ambient signal, according to an
embodiment according to the invention;
Fig. 12 shows a block diagram of an apparatus or a method
for performing a gain computation, according to an
embodiment according to the invention;
Fig. 13 shows a block schematic diagram of an apparatus for
obtaining weighting coefficients, according to an
embodiment according to the invention;
Fig. 14 shows a block schematic diagram of another
apparatus for obtaining weighting coefficients, according
to an embodiment according to the invention;
Figs.15a and 15b show block schematic diagrams of apparatus
for obtaining weighting coefficients, according to
embodiments according to the invention;
Fig. 16 shows a block schematic diagram of an apparatus for
obtaining weighting coefficients, according to an
embodiment according to the invention;
Fig. 17 shows an extract of a block schematic diagram of an
apparatus for obtaining weighting coefficients, according
to an embodiment according to the invention;
Figs. 18a and 18b show block schematic diagrams of
coefficient determination signal generators, according to
embodiments according to the invention;
Fig. 19 shows a block schematic diagram of a coefficient-
determination signal generator, according to an embodiment
according to the invention;
Fig. 20 shows a block schematic diagram of a coefficient-
determination signal generator, according to an embodiment
according to the invention;
Fig. 21 shows a flow chart of a method for extracting an
ambient signal from an input audio signal, according to an
embodiment according to the invention;
Fig. 22 shows a flow chart of a method for determining
weighting coefficients, according to an embodiment
according to the invention;
Fig. 23 shows a graphical representation illustrating a
stereo playback;
Fig. 24 shows a graphical representation illustrating a
direct/ambient concept; and
Fig. 25 shows a graphical representation illustrating an
in-the-band-concept.
Detailed Description of the Embodiments
Apparatus for extracting an ambient signal - first
embodiment
Fig. 1 shows a block schematic diagram of an apparatus for
extracting an ambient signal from an input audio signal.
The apparatus shown in Fig. 1 is designated in its entirety
with 100. The apparatus 100 is configured to receive an
input audio signal 110 and to provide at least one weighted
sub-band signal on the basis of the input audio signal such
that ambience components are emphasized over non-ambience
components in the weighted sub-band signal. The apparatus
100 comprises a gain value determinator 120. The gain value
determinator 120 is configured to receive the input audio
signal 110 and to provide a sequence of time varying
ambient signal gain values 122 (also briefly designated as
gain-values) in dependence on the input audio signal 110.
The gain-value determinator 120 comprises a weighter 130.
The weighter 130 is configured to receive a time-frequency-
domain representation of the input audio signal or at least
one sub-band signal thereof. The sub-band signal may
describe one frequency band or one frequency sub-band of
the input audio signal. The weighter 130 is further
configured to provide the weighted sub-band signal 112 in
dependence on the sub-band signal 132, and also in
dependence on the sequence of time-varying ambient signal
gain values 122.
Based on the above structural description, the
functionality of the apparatus 100 will be described in the
following. The gain-value determinator 120 is configured to
receive the input audio signal 110 and to obtain one or
more quantitative feature values describing one or more
features or characteristics of the input audio signal. In
other words, the gain value determinator 120 may, for
example, be configured to obtain a quantitative information
characterizing one feature or characteristic of the input
audio signal. Alternatively, the gain-value determinator
120 nay be configured to obtain a plurality of quantitative
feature values (or sequences thereof) describing a
plurality of features of the input audio signal. Thus,
certain characteristics of the input audio signal, also
designated as features (or, in some embodiments, as "low-
level features") may be evaluated for providing the
sequence of gain-values. The gain-value determinator 120 is
further configured to provide the sequence 122 of time-
varying ambient signal gain-values as a function of the one
or more quantitative feature values (or the sequences
thereof).
In the following, the term "feature" will sometimes be used
to designate a feature or a characteristic in order to
shorten the description.
In some embodiments, the gain-value determinator 120 is
configured to provide the time-varying ambient signal gain-
values such that the gain-values are quantitatively
depencent on the quantitative feature values. In other
words, in some embodiments the feature values may take
multiple values (in some cases more than two values, and in
some cases even more than ten values, and in some cases
even a quasi-continuous number of values), and the
corresponding ambient signal gain-values may follow (at
least over a certain range of feature values) the feature
values in a linear or non-linear way. Thus, in some
embodim.ents, a gain-value may increase monotonically with
an increase of one of the one or more corresponding
quantitative feature-values. In another embodiment, the
gain-value may decrease monotonically with an increase of
one of the one or more corresponding values.
In seme embodiments, the gain-value determinator may be
configured to generate a sequence of quantitative feature
values descr-.binq a temporal evolution of a first feature.
Accordingly, the gain-value determinator may, for example,
be configured to map the sequence of feature-values
describing the first feature on a sequence of gain-values.
In some other embodiments, the gain value determinator may
be configured to provide or calculate a plurality of
sequences of feature-values describing a temporal evolution
of a plurality of different features of the input audio
signal 110. Accordingly, the plurality of sequences of
quantitative feature-values may be mapped to a sequence of
gain-values.
To surrimarize the above, the gain-value determinator may
evaluate one or more features of the input audio signal in
a quantitative way and may provide the gain values based
thereon.
The weighter 130 is configured to weight a portion of a
frequency spectrum of the input audio signal 110 (or even
the complete frequency spectrum) in dependence on the
sequence of time-varying ambient signal gain-values 122.
For this purpose, che weighter receives at least one sub-
band signal 132 (or a plurality of sub-band signals) of a
time-frequency-domain representation of the input audio
signal.
The gain-value determinator 120 may be configured to
receive the input audio signal either in a time-domain
representation or in a time-frequency-domain
representation. However, it has been found that the process
of extracting the ambient signal can be performed in a
particularly efficient manner if the weighting of the input
signal is performed by the weighter using a time-frequency-
domain of the input audio signal 110. The weighter 130 is
configured to weight the at least one sub-band signal 132
of the input audio signal in dependence on the gain values
122. The weighter 130 is configured to apply the gain
values of the sequence of gain values to the one or more
sub-band signals 132 to scale the sub-band signals, to
obtain one or more weighted sub-band signals 112.
In some embodiments, the gain-value determinator 120 is
configured such that features of the input audio signal are
evaluated, which characterize (or at least provide an
indication) wherher the input audio signal 110 or a sub-
band thereof (represented by a sub-band signal 132) is
likely to represent an ambient component or a non-ambient
component of an aadio signal. However, the feature values
processed by the gain value determinator may be chosen to
provide a quantitative information regarding a relationship
between ambient components and non-ambient components
within the input audio signal 110. For example, the feature
values may carry an information (or at least an indication)
regarding a relationship between ambient components and
non-am.bient components in the input audio signal 110, or at
least an information describing an estimate thereof.
Accordingly, the gain-value determinator 130 may be
configured to generate the sequence of gain-values such
that ambience comiponents are emphasized with respect to
non-ambience components in the weighted sub-band signal
112, weighted in accordance with the gain-values 122.
To sum.marize the above, the functionality of the apparatus
100 is based on a determination of a sequence of gain-
values on the basis of one or more sequences of
quantitative feauure-values describing features of the
input audio signal 110. The sequence of gain-values is
generated such that the sub-band signal 132 representing a
frequency band of the input audio signal 110 is scaled with
a large gain value if the feature-values indicate a
comparatively large "ambience-likeliness" of the respective
time-frequency bin and such that the frequency band of the
input audio signal 110 is scaled with a comparatively small
gain-value if the one or more features considered by the
gain-value determinator indicate a comparatively low
"ambier.ce-likeli HGSs" of the respective time-frequency bin.
Apparatus for Rxtractinq an ambient signal - second
embodiment
Taking reference now to Fig. 2, an optional extension of
the apparatus 10? shown in Fig. 1 will be described. Fig. 2
shows a detailed block schematic diagram of an apparatus
for extracting an ambient signal from an input audio
signal. The apparatus shown in Fig. 2 is designed in its
entirely with 200.
The apparatus 2Cr, is configured to receive an input audio
signal 210 and •...o provide a plurality of output sub-band
signals 212a to 2'.2d, some of which may be weighted.
The apparatus 2C-C may, for example, comprise an analysis
filterfcank 215, which may be considered as optional. The
analysis filterbar.k 216 may, for example, be configured to
receive the inpu' audio signal content 210 in a time-domain
representation and to provide a time-frequency-domain
representation of the input audio signal. The time-
frequency-donair. representation of the input audio signal
may, for example, describe the input audio signal in terms
of a plurality of sub-band signals 218a to 218d. The sub-
band signals 218a to 218d may, for example, represent a
temporal evolution of an energy, which is present in
different sub-bands or frequency bands of the input audio
signal 210. Tor example, the sub-band signals 218a to 218d
may represent a sequence of Fast Fourier transform
coefficients for subsequent (temporal) portions of the
input audio signal 210. For example, the first sub-band
signa". 21Ba m.ay describe a temporal evolution of an energy,
which IS present in a given frequency sub-band of the input
audio signal in subsequent temporal segments, which may be
overlaoping or non-overlapping. Similarly, the other sub-
band signals 218b to 218d may describe a temporal evolution
of energies present m other sub-bands.
The gain-value determinator may (optionally) comprise a
plurality of quantitative feature value determinators 250,
252, 254. The quantitative feature value determinators 250,
252, 2 54 may, : r^. some embodiments, be part of the gain-
value determinator 220. However, in other embodiments, the
quanti:.ative fea:,':re value determinators 250, 252, 254 may
be external to r.he gain-value determinator 220. In this
case, -...he gain-value determinator 220 may be configured to
receive quantitative feature values from external
quantitative feature value determinators. Both receiving
externally generated quantitative feature values and
internally generating quantitative feature values will be
considered as "ootaining" quantitative feature values.
The quantitative feature value determinators 250, 252, 254
may, for example, be configured to receive an information
about the input audio signal and to provide quantitative
feature values 25Ga, 252a, 254a describing, in a
quantitative manner different features of the input audio
signal.
In some embodiments, the quantitative feature value
deterninators 250, 252, 254 are chosen to describe, in
terms of corresponding quantitative feature values 250a,
252a, 254a, features of the input audio signal 210, which
provide an indication with respect to an ambience-
component-conter.t of the input audio signal 210 or with
respect to a relationship between an ambience-component-
content and a ron-am.bience-component-content of the input
audio signal 21';.
The gain value determinator- 220 further comprises a
weighting combiner 260. The weighting combiner 260 may be
configured to receive the quantitative feature values 250a,
252a, 254a and ro provide, on the basis thereof, a gain-
value 222 (or a sequence of gain values) . The gain value
222 (or the sequence of gain values) may be used by a
weighter unit to weight one or more of the sub-band signals
218a, 218b, 21Bc, 218d. For example, the weighter unit
(also sometimes designated briefly as "weighter") may
comprise, for example, a plurality of individual scalers or
individual weighters 270a, 270b, 270c. For example, a first
individual weighter 270a may be configured to weight a
first sub-band signal 218a in dependence on the gain value
(or sequence of gain values) 222. Thus, the first weighted
sub-band signal 212a is obtained. In some embodiments, the
gain value (or sequence of gain values) 222 may be used to
weight additional sub-band signals. In an embodiment, an
optional second individual weighter 270b may be configured
to weight the second sub-band signal 218b to obtain the
second weighted sub-band signal 212b. Further, a third
individual weighter 270c may be used to weight the third
sub-band signal 218c to obtain the third weighted sub-band
signal 212c. It can be seen from the above discussion that
the gain value (or the sequence of gain values) 222 can be
used to weight one or more of the sub-band signals 218a,
218b, 218c, 218d representing the input audio signal in the
form of a time-frequency-domain representation.
Quantitative-feature-value determinators
In the following, various details regarding the
quantitative-feature-value determinators 250, 252, 254 will
be described.
The quantitative feature value determinators 250, 252, 254
may be configured to use the different types of input
information. For example, the first quantitative feature
value determinator 250 miay be configured to receive, as an
input information, a time-domain representation of the
input audio signal, as shown in Fig. 2. Alternatively, the
first quantitative feature value determinator 250 may be
configured to receive an input information describing the
overall spectrum of the input audio signal. Thus, in some
embodiments, at least one quantitative feature value 250a
may (optionally) be calculated on the basis of the time-
domain representation of the input audio signal or on the
basis of another representation describing the input audio
signal in its entirety (at least for a given period in
time).
The second quantitative feature value determinator 252 is
configured to receive, as an input information, a single
sub-band signal, for example, the first sub-band signal
218a. Thus, the second quantitative-feature-value
determinator may, for example, be configured to provide the
corresponding q-.:.antitative-feature-value 252a on the basis
of a single sub-oand signal. In an embodiment in which the
gain value 222 (or the sequence thereof) is applied only to
a single sub-band signal, the sub-band signal to which the
gain value 222 is applied, may then be identical to the
sub-band signal used by the second quantitative feature
value determma'-or 222.
The third quantitative feature value determinator 254 may,
for example, be configured to receive, as an input
information, a plurality of sub-band signals. For example,
the third quantitative feature value determinator 254 is
configured to receive, as an input information, the first
sub-band signal 218a, the second sub-band signal 218b and
the third sub-band signal 218c. Thus, the quantitative
feature value determinator 254 is configured to provide the
quantitative feature value 254a on the basis of a plurality
of sub-band signals. In an embodiment in which the gain
value 222 (cr a sequence thereof) is applied to weight a
plurality of sub-band signals (for example, the sub-band
signals 218a, 218b, 218c), the sub-band signals to which
the gain value 222 is applied, may be identical to the sub-
band signals evaluated by the third quantitative feature
value determinator 254.
To summarize the above, the gain value determinator 222
may, in some embodiments, comprise a plurality of different
quantitative feature value determinators configured to
evaluate different input information in order to obtain a
plurality of different feature values 250a, 252a 254a. In
some embodiments, one or more of the feature value
determinators may be configured to evaluate features on the
basis of a bread band representation of the input audio
signal (for example, on the basis of the time-domain
representation of the input audio signal), while other
feature value determinators may be configured to evaluate
only a portion of a frequency spectrum of the input audio
signal 210, or even only a single frequency band or
frequency sub-band.
Weighting
In the followinc, some details regarding the weighting of
the quantitative feature values, which is performed, for
example, by the weighting combiner 260, will be described.
The weighting combiner 2 60 is configured to obtain, on the
basis of the quantitative feature values 250a, 252a, 254a
provided by the quantitative feature value determinators
250, 252, 254, the gain values 222. The weighting combiner
may, for example, be configured to linearly scale the
quantitative feature values provided by the quantitative
feature value determinators. In some embodiments, the
weighting combiner may be considered to form a linear
combination of the quantitative feature values, wherein
different weights (which may, for example, be described by
respective weighting coefficients) may be associated to the
quantitative feature values. In some embodiments, the
weighting combiner may also be configured to process the
feature values provided by the quantitative feature value
determinators ;.n a non-linear way. The non-linear
processing may, for example, be performed prior to the
combination or as an in:,eger part of the combination.
In some embodiments, the weighting combiner 260 may be
configured to be adjustable. In other words, in some
embodiments, the weighting combiner may be configured such
that weights associated with the quantitative feature
values of the different quantitative feature value
determinators are adjustable. For example, the weighting
combiner 260 may be configured to receive a set of
weighting coefficients, which may, for example, have an
impact on a non-linear processing of the quantitative
feature values 25Ca, 252a, 254a and/or on a linear scaling
of the quantitative feature values 250a, 252a, 254a.
Details regarding the weighting process will be
subsequently described.
In some embcdim.ents, the gain value determinator 220 may
comprise an optional weight adjuster 270. The optional
weight adjuster 270 may be configured to adjust the
weighting of the quantitative feature values 250a, 252a,
254a perform.ed by the weighting combiner 260. Details
regarding the determination of the weighting coefficients
for the weighting of the quantitative feature values will
be subsequently described, for example, taking reference to
Figs. 14 to 20.Saia determination of the weighting
coefficients may for example be performed by a separate
apparatus or by the weight adjuster 270.
Apparatus for extracting an ambient signal - third
embodiment
In the following, another embodiment according to the
invention will be described. Fig. 3 shows a detailed block
schematic diagram of an apparatus for extracting an ambient
signal from an input audio signal. The apparatus shown in
Fig. 3 is designated in its entirety with 300.
However, it should be noted that throughout the present
description, identical reference numerals are chosen to
designate identical means, signals or functionalities.
The apparatus 300 is very similar to the apparatus 200.
However, the apparatus 300 comprises a particularly
efficient set of feature value determinators.
As can be seen from Fig. 3, a gain value determinator 320,
which takes the place of the gain value determinator 220
shown in Fig. 2, comprises, as a first quantitative feature
value determinator, a tonality feature value determinator
350. The tonality feature value determinator 350 may, for
example, be configured to provide, as a first quantitative
feature value, a quantitative tonality feature value 350a.
Moreover, the gain value determinator 320 comprises, as a
second quantitative feature value determinator, an energy
feature value determinator 352, which is configured to
provide, as a second quantitative feature value, an energy
feature value 352a.
Furthermore, the gain value determinator 320 may comprise,
as a third quantitative feature value determinator, a
spectral centroid feature value determinator 354. The
spectral centroid feature value determinator may be
configured to provide, as a third quantitative feature
value, a spectral centroid feature value describing a
centroid of a frequency spectrum of the input audio signal
or of a portion of the frequency spectrum of the input
audio signal 210.
Accordingly, the weighting combiner 260 may be configured
to combine, in a linearly and/or non-linearly weighted
manner, the tonality feature value 350a (or a sequence
thereof), the energy feature value 352a (or a sequence
thereof) and the spectral centroid feature value 354a (or a
sequence thereof) to obtain the gain value 222 for
weighting the sub-band signals 218a, 218b, 218c, 218d (or,
at least, one of the sub-band signals).
Apparatus for extracting an ambient signal - fourth
embodiment
In the following, a possible extension of the apparatus 300
will be discussed, taking reference to Fig. 4. However, the
concepts described with reference to Fig. 4 can also be
used independent on the configuration shown in Fig. 3.
Fig. 4 shows a block schematic diagram of an apparatus for
extracting an ambient signal. The apparatus shown in Fig. 4
is designated in its entirety with 400. The apparatus 400
is configured to receive, as an input signal, a multi-
channel mpjt audio signal 410. In addition, the apparatus
400 is configured to provide at least one weighted sub-band
signal 412 on the basis of the multi-channel input audio
signal 4]0.
The apparatus 400 comprises a gain value determinator 420.
The gain value determinator 420 is configured to receive an
information describing a first channel 410a and a second
channel 4 10b of the Tulti-channel input audio signal.
Moreover, the gain value determinator 420 is configured to
provide, on the basis of an information describing the
first channel 410a and the second channel 410b of the
multi-channel input audio signal, a sequence of time-
varying ambient signal gain values 422. The time varying
ambient signal gain values 422 may, for example, be
equivalent to the time-varying gain values 222.
Moreover, the apparatus 400 comprises a weighter 430
configured to weight at least one sub-band signal
describ'ng the multi-channel input audio signal 410 in
dependence on the time-varying ambient signal gain values
422.
The weiqhter 4 30 may, for example, comprise the
functionality of the weighter 130 or of the individual
weighters 270a, 270b, 270c.
Taking reference now to the gain value determinator 420,
the gain value determinator 420 may be extended, for
example, with reference to the gain value determinator 120,
the gam value determinator 220 or the gain value
determinator 32C, in that the gain value determinator 420
is configured to obtain one or more quantitative channel-
relationship feature values. In other words, the gain value
determinator 42C may be configured to obtain one or more
quantitative feature values describing a relationship
between zwo or more of the channels of the multi-channel
input signal 410.
For example, the gain value determinator 420 may be
configured to obtain an information describing a
correla:. ion between two of the channels of the multi-
channe] inp-.;t audio signal 410. Alternatively, or in
additio:;, ti-.o gain value determinator 420 may be configured
to obta-in a quantitative feature value describing a
relationship between intensities of signals of a first
channel of the multi-channel input audio signal 410 and of
a second channel of the input audio signal 410.
In som;e emb'" dimients, the gain value determinator 420 may
comprise one or more channel-relationship gain value
determmators configured to provide one or more feature
values (or sequences of feature values) describing one or
more channel-relationship features. In some other
embodiments, in the channel-relationship feature value
determinators may be external to the gain value
determinator 420.
In some embodiments, zhe gain value determinator may be
configured r,o determine the gain values by combining, for
example in a weighted manner, one or more quantitative
channel relationship feature values describing different
channel relationship features. In some embodiments, the
gain value determinator 420 may be configured to determine
the sequence of time-varying ambient signal gain values 422
only on the basis of one or more quantitative channel
relation feaLure values, for example, without considering
quantitative single-channel feature values. However, in
some other embodiments, the gain value determinator 420 is
configured ".:o combine, for example in a weighted manner,
one or more quantitative channel relationship feature
values (describing one or more different channel-
relationship features) and one or more quantitative single
channel fear/ure values (describing one or more single
channel features). Thus, in some embodiments, both single
channel features, which are based on a single channel of
the multi-channel input audio signal 410, and channel
relationship features, which describe a relationship
between two or more channels of the multi-channel input
audio signal 410, can be considered to determine the time-
varying ambient signal gain values.
Thus, in soi'.e embodiments according to the invention, a
particularly meaningful sequence of time varying ambient
signal gain values can be obtained by taking into
consideration both single channel features and channel
relationship features. Accordingly, the time-varying
ambient sicrial gain values can be adapted to the audio
signal cnanr-icl to be weighted with said gain values, while
still taking into consideration precious information, which
can be obtained from evaluating a relationship between
multiple channels.
Gain value determinator details
In the following, details regarding the gain value
determinator will be described taking reference to Fig. 5.
Fig. 5 shows a detailed block schematic diagram of a gain
value determinator. The gain value determinator shown in
Fig. 5 is designated in its entirety with 500. The gain
value deter:r.inator 500 may, for example, take over the
functionality of the gain value determinators 120, 220,
320, 420 described herein.
Non-linear Preprocessor
The gain value determinator 500 comprises an (optional)
non-linear pre-processor 510. The non-linear pre-processor
510 may be configured to receive a representation of one or
more input a::dio signals. For example, the non-linear pre-
processor "10 may be configured to receive a time-
frequency-dcmain representation of an input audio signal.
However, ir some embodiments, the non-linear pre-processor
510 may bo configured to receive, alternatively or
additionally, a time-domain representation of the input
audio signal. In some further embodiments, the non-linear
pre-processor may be configured to receive a representation
of a first channel of an input audio signal (for example, a
time-dom.ain representation or a time-frequency-domain
representation) and a representation of a second channel of
the input audio signal. The non-linear pre-processor may
further be configured to provide a pre-processed
representation of one or more channels of the input audio
signal or a': least a portion (for example, a spectral
portion) of the pre-processed representation to a first
quantitative :eature value determinator 520. Moreover, the
non-linear i.re-processor may be configured to provide
another pre-orocessed representation of the input audio
signal (or a portion thereof) to a second quantitative
feature value determinator 522. The representation of the
input audio signal provided to the first quantitative
feature value determinator 520 may be identical to, or
different from, the representation of the input audio
signal provided to the second quantitative feature value
determinauor 522.
However, it should be noted that the first quantitative
feature valui. determinator 520 and the second quantitative
feature va' ue determinator may be considered as
representing two or more feature value determinators, for
example K fe.":ture value determinators, with K>=1 or K>=2.
In other woris, the gain value determinator 500 shown in
Fig. 5 can oe extended by further quantitative feature
value determ.. lators, as desired and described herein.
Details regarding the functionality of the non-linear
preprocessor ¦.¦ill be described below. However, it should be
noted that tie preprocessing may comprise a determination
of magr.iiude values, energy values, logarithmic magnitude
values, loga: :thmic energy values of the input audio signal
or a spectr-il representation thereof or other nonlinear
preprocessinc: of the input audio signal or a spectral
reoresentatic "i thereof.
Feature value postprocessors
The gain value determinator 500 comprises a first feature
value post-processor 530 configured to receive a first
feature value- (or a sequence of first feature values) from
the first quantitative feature value determinator 520.
Moreover, a .-;econd feature value post-processor 532 may be
coupled to the second quantitative feature value
determma'cor 522 to receive from the second quantitative
feature val.ie determinator 522 a second quantitative
feature value, (or a sequence of second quantitative feature
values). The first feature value post-processor 530 and the
second feature value post-processor 532 may, for example,
be configuied to provide respective post-processed
quantitative; feature values.
For example, the feature value post-processors may be
configured to process the respective quantitative feature
values such "hat a range of values of the post-processed
feature values is limited.
Weighting Cor.biner
The gain value determinator 500 further comprises a
weighting combiner 540. The weighting combiner 540 is
configured to receive the post-processed feature values
from the fe;-;cure value post-processors 530, 532 and to
provide, on che basis thereof, a gain value 560 (or a
sequence of gain values). The gain value 560 may be
equivalent to the gain value 122, the gain value 222, the
gain val'.je 322 or to the gain value 422.
In the following, some details regarding the weighting
combiner 540 will be discussed. In some embodiments, the
weighting combiner 540 may, for example, comprise a first
non-linear processor 542. The first non-linear processor
542 may, for example, be configured to receive the first
post-processed quantitative feature value and to apply a
non-linear mapping to the post-processed first feature
value, to provide non-linearly processed feature values
542a. Moreover, the weighting combiner 540 may comprise a
second non-linear processor 544, which may be configured to
be similar -,0 the first non-linear processor 542. The
second non-linear processor 544 may be configured to non-
linearly map the post-processed second feature value to a
non-linearly processed feature value 544a. In some
embodimenr.s, parameters of non-linear mappings performed by
the nor.-]ir:ear processors 542, 544 may be adjusted in
accordance with respective coefficients. For example, a
first non-linear weighting coefficient may be used to
determine the mapping of the first non-linear processor 542
and the second non-linear weighting coefficient may be used
to determine ::he mapping performed by the second non-liner
processor. 54 4.
In some embodiments, the one or more of the feature value
post-processors 530, 532 may be omitted. In other
embodim.tr.rs, one or all of the non-linear processors 542,
544 may be om.itted. In addition, in some embodiments, the
f unctior.a". ities of the corresponding feature value post-
processors 530,532 and non-linear processors 542, 544 may
be meltf?d into one unit.
The weighting combiner 540 further comprises a first
weighte:" or scaler 550. The first weighter 550 is
configurod Lo receive the first non-linearly processed
quantitaiive feature value (or, in cases where the non-
linear processing is omitted, the first quantitative
feature value) 542a and to scale the first non-linearly
processed quantitative value in accordance with a first
linear ,-.eich: ing coefficient to obtain a first linearly
scaled auanticative feature value 550a. The weighting
combiner 54C further comprises a second weighter or scaler
552. The second weighter 552 is configured to receive the
second r.on-1 i nearly processed quantitative feature value
544a :;;,:, in cases where the non-linear processing is
omitted, the second quantitative feature value) and to
scale S::i:d value in accordance with a second linear
weightin;^ coefficient to obtain a second linearly scaled
quantita;. Ive r^eature value 552a.
The we: q'-^ri'ig comibiner 540 further comprises a combiner
556. The: combiner 556 is configured to receive the first
linearly scaled quantitative feature value 550a and the
second linearly scaled quantitative feature value 552a. The
combiner 556 is configured to provide, on the basis of said
values, the gain value 560. For example, the combiner 555
may be configured to perform a linear combination (for
example, a summation or an averaging operation) of the
first linearly scaled quantitative feature value 550a and
of the second linearly scaled quantitative feature value
552a.
To sumiTiarize -.he above, the gain value determinator 500 may
be configured to provide a linear combination of
quantitative feature values determined by a plurality of
quantirar.ive feature value determinators 520, 522. Prior to
the weighted linear combination, one or more non-linear
post-processing steps may be performed on the quantitative
feature values, for example to limit a range of values
and/or to modify a relative weighting of small values and
large values.
It shou^ i be noted that the structure is the gain value
determinanor 50C shown in Fig. 5 should be considered
exemplary only in order to facilitate the understanding.
However, any of the functionalities of the blocks of the
gain va". ue determinator 500 could be implemented in a
different, circuit structure. For example, some of the
functiona'.. i tic-;S could be combined into a single unit. In
addition, the functionalities described with reference to
Fig. 5 could be performed by shared units. For example, a
single teatu:e value post-processor could be used to
perform, for example in a time-sharing manner, the post-
processing of the feature values provided by a plurality of
quantita'-ive feature value determinators. Similarly, the
functionality of the non-linear processors 542, 544 could
be performed, in a time-sharing manner, by a single non-
linear processor. In addition, a single weighter could be
used to .fulfill the functionality of the weighters 550,
552.
In some embodiments, the functionalities described with
reference to Fig. 5 could be performed by a single tasking
or multi-'iasking computer program. In other words, in some
embodiments, a completely different circuit topology can be
chosen to implement the gain value determinator, as long as
the desired functionality is obtained.
Direct signal Extraction
In the ::ollov:ing, some further details will be described
with respect "o an efficient extraction of both an ambient
signal and a front signal (also designated as "direct
signal") ;:rom. an input audio signal. For this purpose. Fig.
6 shows a block schematic diagram of a weighter or weighter
unit according to an embodiment according to the invention.
The weighy.er or weighter unit shown in Fig. 6 is designated
in its eni;irety with 600.
The weichcer or weighter unit 600 may, for example, take
the place of zhe weighter 130, of the individual weighters
270a, 270, 270c or of the weighter 430.
The weightier 500 is configured to receive a representation
of the input audio signal 610 and to provide both a
representatioi-i of an ambient signal 620 and of a front
signal or a non-ambient signal or a "direct signal" 630. It
should be noted that in some embodiments, the weighter 600
may be configured to receive a time-frequency-domain
representatiori of the input audio signal 610 and to provide
a time-frequency-domain representation of the ambient
signal 620 ar.d of the front signal or non-ambient signal
630.
However, naturally, the weighter 600 may also comprise, if
desired, a time-domain to time-frequency-domain converter
for convert.ir.g a time-domain input audio signal into a
time-frequencv-domain representation and/or one or more
time-freque:icy-domain to time-domain converters to provide
time-domain cutpuc signals.
The weighter 600 may, for example, comprise an ambient
signal weighter 640 configured to provide a representation
of the ambienc signal 620 on the basis of a representation
of the inpi.t audio signal 610. In addition, the weighter
600 may comprise a front signal weighter 650 configured to
provide a representation of the front signal 630 on the
basis of a representation of the input audio signal 610.
The weighter 600 is configured to receive a sequence of
ambient sicnal gain values 660. Optionally, the weighter
600 may be configured to also receive a sequence of front
signal gai:: values. However, in some embodiments, the
weighter 600 may be configured to derive the sequence of
front signal gain values from the sequence of ambient
signal gain values, as will be discussed in the following.
The ambient signal weighter 640 is configured to weight one
or more frequency bands (which may, for example, be
represented by one or more sub-band signals) of the input
audio signal in accordance with the ambient signal gain
values to cbrain the representation of the ambient signal
620, for example in the form of one or more weighted sub-
band signals. Similarly, the front signal weighter 650 is
configured to weight one or more frequency bands or
frequency sub-bands of the input audio signal 610, which
may, for e>;arT.ple, be represented in terms of one or more
sub-band signals, to obtain a representation of the front
signal 63C, for example, in the form of one or more
weighted suD-oand signals.
However, in some embodiments, the ambient signal weighter
640 and the, :"ront signal weighter 650 may be configured to
weight a given frequency band or frequency sub-band
(representc "i, for example, by a sub-band signal) in a
complementary way to generate the representation of the
ambient signal 620 and the representation of the front
signal 630. For example, if an ambient signal gain value
for a specific frequency band indicates that the specific
frequency band should be given a comparatively high weight
in the ambient signal, the specific frequency band is
weighted comparatively high when deriving the
representation of the ambient signal 620 from the
representdticn of the input audio signal 610, and the
specific fraquency band is weighted comparatively low when
deriving the representation of the front signal 630 from
the ' representation of the input audio signal 610.
Similr-;rly, i:' the ambient signal gain value indicates that
the specific frequency band should be given a comparatively
low weight in the ambient signal, the specific frequency
band : s given a low weight when deriving the representation
of the ambient signal 620 from the representation of the
input audio signal 510, and the specific frequency band is
given a comparatively high weight when deriving the
representattc n of the front signal 630 from the
repre;-entat.Lcn of the input audio signal 610.
In s,me embodiments, the weighter 600 may thus be
configured to obtain, on the basis of the ambient signal
gain \alues 660, the front signal gain values 652 for the
front signal, weighter 650, such that the front signal gain
valuer 652 tncrease with decreasing ambient signal gain
value?: 660 ar.d vice-versa.
Accorcjingly, in some embodiments, the ambient signal 620
and the f ron: signal 630 may be generated such that a sum
of enorgies of the ambient signal 620 and of the front
signa. 630 is equivalent to (or proportional to) an energy
of the input audio signal 610.
frequency-domain to time-domain conversion 1150, which may,
for example, be effected using a synthesis filterbank.
Thus, a time-domain representation y of the ambient
components of the input audio signal x is obtained on the
basis of the time-frequency-domain representation Yi to Yn
of the ambient components of the input audio signal.
However, it should be noted that the weighted sub-band
signals provided by the multiplication 1130, 1132 may also
serve as an output signal of the process shown in Fig. 11.
Gain value determination
In the following, the gain computation process will be
described taking reference to Fig. 12. Fig. 12 shows a
block diagram of a gain computation process for one sub-
band of the ambient signal extraction process and of the
front signal extraction process using low-level features
extraction. Different low-level features are computed (for
example designated with LLFl to LLF n) from the input
signal x. The gain factor (for example, designated with g)
is computed as a function of the low-level features (for
example, using a combiner).
Taking reference to Fig. 12, a plurality of low-level
feature computations is shown. For example, a first low-
level feature computation 1210 and a n-th low-level feature
computation 1212 are used in the embodiment shown in Fig.
12. The low-level feature computation 1210, 1212 is
performed on the basis of the input signal x. For example,
the calculation or determination of the low-level features
may be performed on the basis of the time-domain input
audio signal. However, alternatively, the computation or
determination of the low-level features may be performed on
the basis of one or more sub-band signals Xi to Xn.
Moreover, feature values (for example, quantitative feature
values) obtained from the computation or determination
1210, 1212 of the low-level features may be combined, for
exair.ple, using a combiner 1220 (which may for example be a
weighting combiner). Thus, the gain value g may be obtained
on the basis of a combination of the results of the low-
level feature determination or a low-level feature
calculation 1210, 1212.
Concept for determining weighting coefficients
In the following, a concept for obtaining weighting
coefficients for weighting a plurality of feature values,
to obtain a gain value as a weighted combination of the
feature values, will be described.
Apparatus for determining weighting coefficients - first
embodiment
Fig. 13 shows a block schematic diagram of an apparatus for
obtaining weighting coefficients. The apparatus shown in
Fig. 13 is designated in its entirety with 1300.
The apparatus 1300 comprises a coefficient determination
signal generator 1310, which is configured to receive a
basis signal 1312 and to provide, on the basis thereof, a
coefficient determination signal 1314. The coefficient
determination signal generator 1310 is configured to
provide the coefficient determination signal 1314 such that
characteristics of the coefficient determination signal
1314 with respect to ambience components and/or with
respect to non-ambience components and/or a relationship
between ambience components and non-ambience components are
known. In some embodiments, it is sufficient if an estimate
of such an information related to ambience components or
non-ambience components is known.
For example, the coefficient determination signal generator
131C may be configured to provide, in addition to the
coefficient determination signal 1314, an expected gain
value information 1316. The expected gain value information
1316 describes, for example directly or indirectly, a
relationship between ambience components and non-ambience
components of the coefficient determination signal 1314. In
other words, the expected gain value information 1316 can
be considered as a side information describing ambience-
component related characteristics of the coefficient
determination signal. For example, the expected gain value
information may describe an intensity of ambience
components in the coefficient determination audio signal
(for example for a plurality of time-frequency bins of the
coefficient determination audio signal). Alternatively, the
expected gain value information may describe an intensity
of non-ambience components in the coefficient determination
audio signal. In some embodiments, the expected gain value
information may describe a ratio between intensities of
ambience components and non-ambience components. In some
other embodiments, the expected gain value information may
describe a relationship between an intensity of an ambience
component and a total signal intensity (ambience and non-
ambience components) or a relationship between an intensity
of a non-ambience component and a total signal intensity.
However, other information derived from the above mentioned
information may be provided as the expected gain value
information. For example, an estimate of RAoliti, k) defined
below or an estimate of G(m,k) may be obtained as the
expected gain value information.
The apparatus 1300 further comprises a quantitative feature
value determinator 1320 configured to provide a plurality
of quantitative feature values 1322, 1324 describing, in a
quantitative way, features of the coefficient determination
signal 1314.
The apparatus 1300 further comprises a weighting
coefficient determinator 1330, which may, for example, be
configured to receive the expected gain value information
1316 and the plurality of quantitative feature values 1322,
1324 provided by the quantitative feature value
determinator 1320.
The weighting coefficient determinator 1320 is configured
to provide a set of weighting coefficients 1332 on the
basis of the expected gain value information 1316 and the
quantitative feature values 1322, 1324, as will be
described in detail in the following.
Weighting coefficient determinator, first embodiment
Fig. 14 shows a block schematic diagram of a weighting
coefficient determinator according to an embodiment
according to the invention.
The weighting coefficient determinator 1330 is configured
to receive the expected gain value information 1316 and the
plurality of quantitative feature values 1322, 1324.
However, in some embodiments, the quantitative feature
value determinator 1320 may be a part of the weighting
coefficient determinator 1330. Moreover, the weighting
coefficient determinator 1330 is configured to provide the
weighting coefficient 1332.
Regarding the functionality of the weighting coefficient
determinator 1330, it can generally be said that the
weighting coefficient determinator 1330 is configured to
determine the weighting coefficient 1332 such that gain
values obtained, using the weighting coefficients 1332, on
the basis of a weighted combination of the plurality of
quantitative feature values 1322, 1324 (describing a
plurality of features of the coefficient determination
signal 1314, which can be considered as an input audio
signal) approximate gain values associated with the
coefficient determination audio signal. The expected gain
values may, for example, be derived from the expected gain
value information 1316.
In other words, the weighting coefficient determinator may,
for example, be configured to determine which weighting
coefficients are required to weight the quantitative
feature values 1322, 1324 such that the result of the
weighting approximates the expected gain values described
by the expected gain value information 1316.
In other words, the weighting coefficient determinator may,
for example, be configured to determine the weighting
coefficients 1332 such that a gain value determinator
configured according to the weighting coefficients 1332
provides a gain value, which deviates from an expected gain
value described by the expected gain value information 1316
by no more than a predetermined maximum allowable
deviation.
Weighting coefficient determinator, second embodiment
In rhe following, some specific possibilities for
implementing the weighting coefficient determinator 1330
will be described.
Fig. 15a shows a block schematic diagram of a weighting
coefficient determinator according to an embodiment
according to the invention. The weighting coefficient
determinator shown in Fig. 15a is designated in its
entirety with 1500.
The weighting coefficient determinator 1500 comprises, for
example, a weighting combiner 1510. The weighting combiner
1510 may, for example, be configured to receive the
plurality of quantitative feature values 1322, 1324 and a
set of weighting coefficients 1332. Moreover, the weighting
combiner 1510 may, for example, be configured to provide a
gain value 1512 (or a sequence thereof) by combining the
quantitative feature values 1322, 1324 in accordance with
the weighting coefficients 1332. For example, the weighting
combiner 1510 may be configured to perform a similar or
identical weighting, like the weighting combiner 260. In
some embodiments, the weighting combiner 260 may even be
used to implement the weighting combiner 1510. Thus, the
weighting co-Tibiner 1510 is configured to provide a gain
value 1512 (or a sequence thereof).
The weighting coefficient determinator 1500 further
comprises a similarity determinator or difference
determinator 1520. The similarity determinator or
difference determinator 1520 may, for example, be
configured to receive the expected gain value information
1316 describing expected gain values and the gain values
1512 provided by the weighting combiner 1510. The
similarity determinator/difference determinator 1520 may,
for example, be configured to determine a similarity
measure 1522 describing, for example in a qualitative or
quantitative manner, the similarity between the expected
gain values described by the information 1316 and the gain
values 1512 provided by the weighting combiner 1510.
Alternatively, the similarity determinator/difference
determinator 1520 may be configured to provide a deviation
measure describing a deviation therebetween.
The weighting coefficient determinator 1500 comprises a
weighting coefficient adjuster 1530, which is configured to
receive the similarity information 1522 and to determine,
on Che basis thereof, whether it is required to change the
weighting coefficients 1332 or whether the weighting
coefficients 1332 should be kept constant. For example, if
the similarity information 1522 provided by the similarity
determinator/difference determinator 1520 indicates that a
difference or deviation between the gain values 1512 and
solver 1560. The equation system solver or optimization
problem solver 1560 is configured to receive an information
1316 describing expected gain values, which may be
designated with gexpected- The equation system
solver/optimization problem solver 1560 may further be
configured to receive a plurality of quantitative feature
values 1322, 1324. The equation system solver/optimization
problem solver 1560 may be configured to provide a set of
weighting coefficients 1332.
Assuming that the quantitative feature values received by
the equation system solver 1560 are designated with mi and
further assuming that weighting coefficients are, for
example, designated with 0(i and 3i, the equation system
solver may, for example, be configured to solve a non-
linear system of equations of the form:
gexpected,! i^^y designate an expected gain value for a time-
frequency bin having index 1. mi,i designates an i-th
feature value for the time-frequency bin having index 1. A
plurality of L time-frequency bins may be considered for
solving the system of equations.
Accordingly, linear weighting coefficients ai and non-
linear weighting coefficients (or exponent weighting
coefficients) Pi can be determined by solving a system of
equations.
In an alternative embodiment, an optimization can be
performed. For example, a value determined by
can be minimized by determining a set of appropriate
weighting coefficient ai, pi- Here, (.)designates a vector
of differences between expected gain values and gain values
obtained by weighting feature values mi,i. The entries of
the vector of differences may relate to different time-
frequency bins, designated with index 1 = 1...L. M • I I
designates a mathematical distance measure, for example a
mathematical vector norm.
In other words, the weighting coefficients may be
determined such that the difference between the expected
gain values and the gain value obtained from a weighted
combination of the quantitative feature values 1322, 1324
is minimized. However, it should be noted that the term
"minimized" should not be considered here in a very strict
way. Rather, the term minimizing expresses that the
difference is brought below a certain threshold.
Weighting coefficient determinator, fourth embodiment
Fig. 16 shows a block schematic diagram of another
weighting coefficient determinator, according to an
embodiment according to the invention. The weighting
coefficient determinator shown in Fig. 16 is designated in
its entirety with 1600.
The weighting coefficient determinator 1600 comprises a
neural net 1610. The neural net 1610 may, for example, be
configured to receive the information 1316 describing the
expected gain values as well as a plurality of quantitative
feature values 1322, 1324. Moreover, the neural net 1610
may, for example, be configured to provide the weighting
coefficients 1332. For example, the neural net 1610 may be
configured to learn weighting coefficients, which result,
when applied to weight the quantitative feature values
1322, 1324, in a gain value, which is sufficiently similar
to an expected gain value described by the expected gain
value information 1316.
Further details will subsequently be described.
Apparatus for determining weighting coefficients - second
embodiment
Fig. 17 shows a block schematic diagram of an apparatus for
determining weighting coefficients according to an
embodiment according to the invention. The apparatus shown
in Fig. 17 is similar to the apparatus shown in Fig. 13.
Accordingly, identical means and signals are designated
with identical reference numerals.
The apparatus 1-700 shown in Fig. 17 comprises a coefficient
determination signal generator 1310, which may be
configured to receive a basis signal 1312. In an
embodiment, the coefficient determination signal generator
1310 may be configured to add an ambient signal to the
basis signal 1312 to obtain the coefficient determination
signal 1314. The coefficient determination signal 1314 may,
for example, be provided in a time-domain representation or
in a time-frequency-domain representation.
The coefficient determination signal generator may further
be configured to provide the expected gain value
information 1316 describing expected gain values. For
example, the' coefficient determination signal generator
1310 may be configured to provide the expected gain value
information on the basis of internal knowledge regarding an
addition of the ambient signal to the basis signal.
Optionally, the apparatus 1700 may further comprise a time-
domain to time-frequency-domain converter 1316, which may
be configured to provide the coefficient determination
signal 1318 in a time-frequency-domain representation.
Moreover, the apparatus 1700 comprises a quantitative
feature value determinator 1320, which may, for example,
comprise a first quantitative feature value determinator
1320a and a second quantitative feature value determinator
1320b. Thus, the quantitative feature value determinator
1320 is configured to provide a plurality of quantitative
feature values 1322, 1324.
Coefficient determination signal generator - first
embodiment
In the following, different concepts of providing the
coefficient determination signal 1314 will be described.
The concepts described with reference to Figs. 18a, 1.8b, 19
and 20 ar.e applicable both to a time-domain representation
and to a time-frequency-domain representation of the
signal.
Fig. 18a shows a block schematic diagram of a coefficient
determination signal generator. The coefficient
determination signal generator shown in Fig. 18a is
designated in its entirety with 1800. The coefficient
determination signal generator 1800 is configured to
receive, as an input signal 1810, an audio signal with
negligible ambient signal components.
Moreover, the coefficient determination signal generator
1800 may comprise an artificial-ambient-signal generator
1820 configured to provide an artificial ambient signal on
the basis of the audio signal 1810. The coefficient-
determination-signal generator 1800 also comprises an
ambient signal adder 1830 configured to receive the audio
signal 1810 and the artificial ambient signal 1822 and to
add the artificial ambient signal 1822 to the audio signal
1810 to obtain the coefficient determination signal 1832.
Moreover, the coefficient determination signal generator
1800 may be configured to provide, for example, on the
basis of parameters used for generating the artificial
ambient signal 1822 or used for combining the audio signal
1810 with the artificial ambient signal 1822, an
information about the expected gain value. In other words,
the knowledge regarding modalities of the generation of the
artificial ambient signal and/or about the combination of
the artificial ambient signal with the audio signal 1810 is
used to obtain the expected gain value information 1834.
The artificial-ambient-signal generator 1820 may, for
example, be configured to provide, as the artificial
ambient signal 1822, a reverberation signal based on the
audio signal 1810.
Coefficient determination signal generator - second
embodiment
Fig, 18b shows a block schematic diagram of a coefficient
determination signal generator according to another
embodiment according to the invention. The coefficient
determination signal generator shown in Fig. 18b is
designated in its entirety with 1850.
The coefficient determination signal generator 1850 is
configured to receive an audio signal 1860 with negligible
ambient signal components and, in addition, an ambient
signal 1862. The coefficient determination signal generator
1850 also comprises an ambient signal adder 1870 configured
to combine the audio signal 1860 (having negligible ambient
signal components) with the ambient signal 1862. The
ambient signal adder 1870 is configured to provide the
coefficient' determination signal 1872.
Moreover, as the audio signal with negligible ambient
signal components and the ambient signal are available in
an isolated form in the coefficient determination signal
generator 1850, an expected gain value information 1874 can
be derived therefrom.
For example, the expected gain value information 1874 may
be derived such that the expected gain value information is
descriptive of a ratio of magnitudes of the audio signal
and the ambient signal. For example, the expected gain
value information may describe such ratios of intensities
for a plurality of time-frequency bins of a time-frequency-
domain representation of the coefficient determination
signal 1872 (or of the audio signal 1860). Alternatively,
the expected gain value information 1874 may comprise an
information about intensities of the ambient signal 1862
for a plurality of time-frequency bins.
Coefficient determination signal generator - third
embodiment
Taking reference now to Figs. 19 and 20, another approach
for determining the expected gain value information will be
discussed. Fig. 19 shows a block schematic diagram of a
coefficient determination signal generator according to an
embodiment according to the invention. The coefficient
determination signal generator shown in Fig. 19 is
designated in its entirety with 1900.
The coefficient determination signal generator 1900 is
configured to receive a multi-channel audio signal. For
example, the coefficient determination signal generator
1900 may be configured to receive a first channel 1910 and
a second channel 1912 of the multi-channel audio signal.
Moreover, the coefficient determination signal generator
1910 may comprise a channel-relationship based feature-
value determinator, for example, a correlation-based
feature-value determinator 1920. The channel relationship-
based feature value determinator 1920 may be configured to
provide a feature value, which is based on a relationship
between two or more of the channels of the multi-channel
audio signal.
In some embodiments, such a channel-relationship-based
feature-value may provide a sufficiently reliable
information regarding an ambience-component content of the
multi-channel audio signal without requiring additional
pre-knowledge. Thus, the information describing the
relationship between two or more channels of the multi-
channel audio signal obtained by the channel-relationship-
based feature-value determinator 1920 may serve as an
expected-gain-value information 1922. Moreover, in some
embodiments, a single audio channel of the multi-channel
audio signal may be used as a coefficient determination
signal 1924.
Coefficient determination signal generator - fourth
embodiment
A similar concept will be subsequently described with
reference to Fig. 20. Fig. 20 shows a block schematic
diagram of a coefficient determination signal generator
according to an embodiment according to the invention. The
coefficient determination signal generator shown in Fig. 20
is designated in its entirety with 2000.
The coefficient determination signal generator 2000 is
similar to the coefficient determination signal generator
1900 such that identical signals are designated with
identical reference numerals.
However, the coefficient determination signal generator
2000 comprises a multi-channel to single-channel combiner
2010 configured to combine the first channel 1910 and the
second channel 1912 (which are used for determining the
channel-relationship-based feature value by the channel-
relationship-based feature value determinator 1920) to
obtain the coefficient determination signal 1924. In other
words, rather than using a single channel signal of the
multi-channel audio signal, a combination of the channel
signals is used to obtain the coefficient determination
signal 1924.
Taking reference to the concept described with respect to
Figs. 19 and 20, it can be noted that a multi-channel audio
signal can be used to obtain the coefficient determination
signal. In typical multi-channel audio signals, a
relationship between the individual channels provides an
information with respect to an ambience-component content
of the multi-channel audio signal. Accordingly, a multi-
channel audio signal can be used for obtaining the
coefficient determination signal and for providing an
expected gain value information characterizing the
coefficient determination signal. Therefore, a gain value
determinator, which operates on the basis of a single
channel of an audio signal, can be calibrated (for example,
by determining respective coefficients) making use of a
stereo signal or a different type of multi-channel audio
signal. Thus, by using a stereo signal or a different type
of multi-channel audio signal, coefficients for an ambient
extractor can be obtained, which coefficients may be
applied (for example after obtaining the coefficients) for
the processing of a single channel audio signal.
Method for extracting an ambient signal
Fig. 21 shows a flowchart of a method for extracting an
ambient signal on the basis of a time-frequency-domain
representation of an input audio signal, the representation
representing the input audio signal in terms of a plurality
of sub-band signals describing a plurality of frequency
bands. The method shown in Fig. 21 is designated in its
entirety with 2100.
The method 2100 comprises obtaining 2110 one or more
quantitative feature values describing one or more features
of the input audio signal.
The method 2100 further comprises determining 2120 a
sequence of time-varying ambient signal gain values for a
given frequency band of a time-frequency-domain
representation of the input audio signal as a function of
the one or more quantitative feature values, such that the
gain values are quantitatively dependent on the
quantitative feature values.
The method 2100 further comprises weighting 2130 a sub-band
signal representing the given frequency band of the time-
frequency-domain representation with the time-varying gain
values.
In some embodiments, the method 2100 may be operational to
perform the functionality of the apparatus described
herein.
Method for obtaining weighting coefficients
Fig. 22 shows a flowchart of a method for obtaining
weighting coefficients for parameterizing a gain value
determinator for extracting an ambient signal from an input
audio signal. The method shown in Fig. 22 is designated in
its entirety with 2200.
The method 2200 comprises obtaining 2210 a coefficient
determination input audio signal, such that an information
about ambience -components present in the input audio signal
or an information describing a relationship between
ambience components and non-ambience components is known.
The method 2200 further comprises determining 2220
weighting coefficients such that gain values obtained on
the basis of a weighted combination, according to the
weighting coefficients, of a plurality of quantitative
feature values describing a plurality of features of the
coefficient determination input audio signal approximate
expected gain values associated with the coefficient
determination input audio signal.
The methods described herein may be supplemented by any of
the features and functionalities described also with
respect to the inventive apparatus.
Computer Programs
Depending on certain implementation requirements of the
inventive methods, the inventive methods can be implemented
in hardware or in software. The implementation can be
performed using a digital storage medium, for example a
floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an
EEPROM or a FLASH memory, having electronically readable
control signals stored thereon, which cooperate with a
programmable computer system such that the inventive method
is performed. Generally, the present invention is,
therefore, a computer program product with a program code
stored on a machine readable carrier, the program code
being operative for performing the inventive method when
the computer program product runs on a computer. In other
words, the inventive method is, therefore, a computer
program having a program code for performing the inventive
method when the computer program runs on a computer.
3 Descrip'tion of a method according to another
embodiment.
3 .1 Problem description
A method according to an embodiment aims at the extraction
of a front signal and an ambient signal suited for blind
upmixing of audio signals. The multi-channel surround sound
signal may be obtained by feeding the front channels with
the front signal and by feeding the rear channels with the
ambient signal.
Various methods for the extraction of an ambient signal
already exist:
1. using NMF .(see Section 2.1.3)
2. using a time-frequency mask depending on the
correlation of the left and right input signal (see
Section 2.2.4)
3. using PCA and a multi-channel input signal (see
Section 2.3.2)
Method 1 relies on an iterative numeric optimization
technique whereas a segment of a few seconds length (e.g.
2...4 seconds) is processed at a time. Consequently, the
method is of high computational complexity and has an al-
gorithmic delay of at least the aforementioned segment
length. In contrast, the inventive method is of low
computational complexity and has a low algorithmic delay
compared to Method 1.
Methods 2 and 3 rely on distinct differences between the
input channel signals, i. e. they do not produce an
appropriate ambience signal if all input channel signals
are identical or nearly identical. In contrast, the
inventive method is able to process mono signals or multi-
channel signals which are identical or nearly identical.
In summary, the advantages of the proposed method are as
follows:
• Low complexity
• Low delay
• Works for monophonic and nearly monophonic input
signals as well as for stereophonic input signals
3.2 Method description
A multi-channel surround signal (e.g. in 5.1 or 7.1 format)
is obtained by extracting an ambient signal and a front
signal from the input signal. The ambient signal is fed
into the rear channels. The center channel is used to
enlarge the sweet spot and plays back the front signal or
the original input signal. The other front channels play
back the front signal or the original input signal (i.e.
the left front channel plays back the original left front
signal or a processed version of the original left front
signal). Figure 10 shows a block diagram of the upmix
process.
The extraction of the ambient signal is carried out in the
time-frequency domain. The inventive method computes time-
varying weights (also designated as gain values) for each
sub-band signal using low-level features (also designated
as quantitative feature values) measuring the "ambience-
likeliness" of each subband signal. These weights are
applied prior to the re-synthesis to compute the ambient
signal. Complementary weights are computed for the front
signal.
Examples for typical characteristics of ambience are:
• Ambient sounds are rather quiet sounds compared to
direct sounds.
• Ambient sounds are less tonal than direct sounds.
Appropriate low-level features for the detection of such
characteristic are described in Section 3.3:
• Energy features measure the quietness of a signal
component
• Tonality features measure the noisiness of a signal
component
The time-varying gain factors g(o),T) with sub-band index q
and time index t are derived from the computed features
mi(u,T) using for instance Equation 1
with K being the number of features and the parameters ai
and Pi used for the weighting of the different features.
Figure 11 illustrates a block diagram of the ambience
extraction process using low-level feature extraction. The
input signal x is a one-channel audio signal. For the
processing of signals with more channels, the processing
may be applied to each channel separately. The analysis
filter-bank separates the input signal into N frequency
bands (N > 1), e.g. using for instance an STFT (Short-Term
Fourier Transform) or digital- filters. The output of the
analysis filter-bank are N sub-band signals Xi, 1 ^ i ^ N.
The gain factors gi, 1 ^ i ^ N, are obtained by computing
one ore more low-level features from sub-band signals Xi
and combining the feature values, as illustrated in Figure
11. Each sub-band signal Xi is then weighted using the gain
factor gi.
A preferred extension to the described process is the use
of groups of sub-band signals instead of single sub-band
signals: Sub-band signals can be grouped to form groups of
sub-band signals. The processing described here can be
carried out using groups of sub-band signals, i.e. low-
level features are computed from one or more groups of sub-
band signals (whereas each group contains one or more sub-
band signals) and the derived weighting factors are applied
to the corresponding sub-band signals (i.e. to all sub-
bands belonging to the particular group).
An estimate for a spectral representation of the ambience
signal is obtained by weighting one or more of the sub-
bands with the corresponding weight gi. The signal which
will feed the front channels of the multi-channel surround
signal is processed in a similar way with complementary
weights as used for the ambient signal.
The additional play-back of the ambient signal results in
more ambient signal components (compared to the original
input signal). The weights for the computation of the front
signal are computed as being in an inverse proportion to
the weights for the computation of the ambient signal.
Consequently, each resulting front signal contains less
ambient signal components and more direct signal components
compared to the corresponding original input signal.
The ambient signal is (optionally) further enhanced (with
respect to the perceived quality of the resulting surround
sound signal) using additional post-processing in the
spectral domain and resynthesized using the inverse process
of the analysis filter-bank (i.e. the synthesis filter-
bank), as shown in Figure 11.
The post-processing is detailed in Section 7. It should be
noted that some postprocessing algorithms can be carried
out in either the spectral domain or the temporal domain.
Figure 12 shows a block diagram of the gain computation
process for one sub-band (or one group of sub-band signals)
based on the extraction of low-level features. Various low-
level features are computed and combined, yielding the gain
factor.
The resulting gains can be further post-processed using
dynamic compression and low-pass filtering (both in time
and in frequency).
3.3 Features
The following section describes features that are suitable
for characterizing ambience-like signal quality. In
general, the features characterize an audio signal (broad-
band) or a particular frequency region (i.e. a sub-band) or
a group of sub-bands of an audio signal. The computation of
features in sub-bands requires the use of a filter-bank or
time-frequency transform.
The computation is explained here using a spectral
representation X(k),i) of the audio signal x[k], with u
being the sub-band index and time index x . A spectrum (or
one range of a spectrum) is denoted by Sk, with k being the
frequency index.
Feature computation using the signal spectrum may process
different representations of the spectrum, i.e. magnitudes,
energy, logarithmic magnitudes or energy or any other non-
linear processed spectrum (e.g. X°'^"^) . If not noted
otherwise, the spectral representation is assumed to be
real-valued.
Features computed in adjacent sub-bands can be subsumed' to
characterize a group of sub-bands, e.g. by averaging the
feature values of the sub-bands. Consequently, the tonality
for a spectrum can be computed from the tonality values for
each spectral coefficient of the spectrum, e.g. by
computing their mean value.
It is desired that values range of the computed features is
[0, 1] or a different predetermined interval. Some feature
computations described below do not result in values within
that range. In these cases, appropriate mapping functions
are applied, for example to map values describing a feature
to a predetermined interval. A simple example for a mapping
function is given in Equation 2.
The mapping can for example be performed using the post-
processor 530, 532.
3.3.1 Tonality Fea'tures
The term Tonality as used here describes "a feature
distinguishing noise versus tone quality of sounds".
Tonal signals are characterized by a non-flat signal
spectrum, whereas noisy signals have a flat spectrum.
Consequently, tonal signals are more periodic than noisy
signals, whereas noisy are more random than tonal signals.
Therefore, tonal signal are predictable from preceding
signal values with a small prediction error, whereas noisy
signals are not well-predicable.
In the following, a plurality .of features will be described
which can be used to quantitatively describe a tonality. In
other words, the features described here can be used to
determine a quantitative feature value, or can serve as a
quantitative feature value.
Spectral Flatness Measure: Spectral Flatness Measure
(SFM) is computed as the ratio of the geometric mean value
and the arithmetic mean value of the spectrum S.
Alternatively, Equation 4 can be used, yielding the
identical result.
(4)
A feature value may be derived from SFM(S).
Spectral Crest Factor: The Spectral Crest Factor is
computed as the ratio of the maximum value and the mean
value of the spectrum X (or S).
A quantitative feature value may be derived from SCF(S).
Tonality compu'tation using peak detection: In I SO/I EC
11172-3MPEG-1 Psychoacoustic Model 1 (recommended for
Layers 1 and 2) [IS093] a method is described to
discriminate between tonal and non-tonal components, which
is used to determine of the masking threshold for
perceptual audio coding. The tonality of a spectral
coefficient Si is determined by examining the levels of
spectral values within a frequency range Af surrounding the
frequency corresponding to Si. Peaks (i.e. local maxima)
are detected if the energy of Xj, exceeds the energies of
its surrounding values Si+k, with e.g. k e [-4, -3, -2, 2,
3, 4]. If the local maximum exceeds its surrounding values
by 7 dB or more, it is classified as tonal. Otherwise, the
local maximum may be classified as not tonal.
A feature value can be derived describing whether a maximum
is tonal or not. Also, a feature value may be derived
describing, for example, how many tonal time-frequency bins
are present within a given neighbourhood.
Tonality computation using the ratio of nonlinearly
processed copies: The non-flatness of a vector is
measured as ratio of two nonlinearly processed copies of
the spectrum S as shown in Equation 6 with a > (3.
Two particular implementations are shown in Equation 7 and
A quantitative feature value may be derived from F(S).
Tonality confutation using the ratio of differently
filtered spectra: The following tonality measure is
described in US-Patent 5,918,203 [HEG'99] .
The tonality of a spectral coefficient Sk for frequency
line k is computed from the ratio 9 of two filtered copies
of the spectrum S, whereas the first filter function H has
a differentiating characteristic and the second filter
function G has an integrating characteristic or a
characteristic which is less strongly differentiating than
the first filter, and c and d are integer constants which,
depending on the filters parameters, are chosen such that
the delays of the filters are compensated for in each case.
(9)
A particular implementation is shown in Equation 10, where
H is the transfer function of a differentiating filter.
e(k) = H(Sk^c) (10)
A quantitative feature value can be derived from 6*^ or from
e(k) .
Tonality computation using periodicity functions: The
aforementioned tonality measures use the spectrum of the
input signal and derive a measure of tonality from the non-
flatness of the spectrum. The tonality measures (from which
a feature value can be derived) can also be computed using
a periodicity function of the input time signal instead of
its spectrum. A periodicity function is derived from the
comparison of a signal with its delayed copy.
The similarity or difference of both are given as a
function of the lag (i.e. the time delay between both
signals). A high degree of similarity (or a low difference)
between a signal and its (by lag i) delayed copy indicates
a strong periodicity of the signal with period t.
Examples for periodicity functions are the autocorrelation
function and the Average Magnitude Difference Function
[dCK03] . The autocorrelation function rxx("t) of a signal x
is shown in Equation 11, with integration window size W.
Tonality computation using the prediction of spectral
coefficients: The tonality estimation using the prediction
of the complex spectral coefficients Xi from preceding
coefficients bins Xi-i and Xi-2 is described in ISO/IEC
11172-3 MPEG-1 Psychoacoustic Model 2 (recommended for
Layer 3).
The current values for the magnitude Xo(o,i) and phase
^((0,1) of the complex spectral coefficient X(a),T)
Xo («,!)£"' can be estimated from the previous values
according to Equations 12 and 13.


The normalized Euclidean distance between the estimated and
actually measured values (as shown in Equation 14) is a
measure for the tonality, and can be used to derive a
quantitative feature value.
The tonality for one spectral coefficient can also be
computed from the prediction error P(co) (see Equation 15,
with X(a),T) being complex-valued) such that large
prediction errors result in small tonality values.
P(q,t) = X(a),T) - 2X(Q,T - 1) + X(u,T - 2) (15)
Tonality computation using prediction in the time domain:
The signal x[k] a time index k can be predicted from
preceding samples using Linear Prediction, whereas the
prediction error is small for periodic signals and large
for random signals. Consequently, the prediction error is
in inverse proportion to the tonality of the signal.
Accordingly, a quantitative feature value can be derived
from the prediction error.
3.3.2 Energy features
Energy features measure the instantaneous energy within a
sub-band. The weighting factor for the ambience extraction
of a particular frequency band will be lower at times when
the energy content of the frequency band is high, i.e. the
particular time-frequency tile is very likely to be a
direct signal component.
Additionally, energy features can also be computed from
adjacent (with respect to time) sub-band samples of the
same sub-band. Similar weighting is applied if the sub-band
signal features high energy in the near past or future. An
example is shown in Equation 16. The feature M(u,t) is
computed from the maximum value of adjacent sub-band
samples within the interval T-k determining the observation window size.
M{a),T) = max([X((o,T - k) X(o,t + k) ] ) (16)
Both, the instantaneous sub-band energy and the maximum of
the sub-band energy measured in the near past or future are
treated as separate features (i.e. different parameters for
the combination as described in Equation 1 are used).
In the following, some extensions to a low-complexity
extraction of a front signal and an ambient signal from an
audio signal for upmixing will be described.
The extensions concern the feature extraction, the post-
processing of the features and the method of the derivation
of the spectral weights from the features.
3.3.3. Extensions to the feature set
In the following, optional extensions of the above
described feature set will be described.
The above description describes the usage of tonality
features and energy features. The features are computed
(for example) in the Short-term Fourier transform (STFT)
domain and are functions of time index m and frequency
index k. The representation in the time-frequency domain
(as obtained e.g. by means of the STFT) of a signal x[n]- is
written as X(m, k). In the case of processing stereo
signals, the left channel signal is termed Xi[k.] and the
right channel signal is X2[k]. The superscript "'" denotes
complex conjugation.
One or more of the following features may optionally be
used:
3.3.3.1 Features evaluating the inter-channel coherence
or correlation
Definition of coherence: Two signals are coherent if
they are equal with possibly a different scaling and delay,
i.e. their phase difference is constant.
Definition of correlation: Two signals are correlated if
they are equal with possibly a different scaling.
Correlation between two signals of length N each is often
measured by means of the normalized cross-correlation
coefficient r
where x denotes the mean value of x[k]. To track the
changes of the signal characteristic over time, the sum
operator is often replaced by a first order recursive
filter in practice, e.g. the computation of z[k]
4k] = X-- ^lj] ^^^ ^^ approximated by
z [k] = Az[k - 1] + (1 - X)x[k] (21)
with "forgetting factor" A. This computation is in the
following termed "moving average estimation (MAE)", fmae(z)-
Ambient signal components in the left and right channel of
a stereo recording are in general weakly correlated. When
recording a sound source in a reverberant room with a
stereo microphone technique, both microphone signals are
different because the paths from the sound source to the
microphones are different (mainly because of the
differences in the reflection patterns). In artificial
recordings the decorrelation is introduced by means of
artificial stereo reverberation. Consequently, an
appropriate feature for ambience extraction measures the
correlation or coherence between the left and right channel
signals.
The inter-channel short-time' coherence (ICSTC) function
described in [AJ02] is a suitable feature. The ICSTC O is
computed from the MAE of the cross-correlation $12 between
the left _ and right channel signals and the MAE of the
energies On of the left signal and O22 of the right signal.
(22)
with
(23)
In fact, the formula of the ICSTC described in [AJ02] is
nearly identical to the normalized cross-correlation
coefficient, where the only difference is that no centering
of the data is applied (centering means removing the mean
as shown in Equation 20:
^centered ~ ^ ~ X )
In [AJ02], an ambience index (that is a feature indication
the degree of "ambience-likeness") is computed from the
ICSTC by non-linear mapping, e.g. using the hyperbolic
tangent.
3.3.3.2 Inter-channel level difference
Features based on the inter-channel level differences
(ICLD) are used to determine the prominent position of a
sound source within the stereo image (panorama). A source
s[k] is amplitude-panned to a particular direction by
applying a panning coefficient a to weight the magnitude of
s[k] in xi[k] and X2[k] according to
When computed for a time-frequency bin, the ICLD-based
features deliver a cue to determine the position (and the
panning coefficient a) of the sound source which dominates
the particular time-frequency bin.
One ICLD-based feature is the panning index ¥(m,k) as
described in [AJ04].
A computationally more efficient alternative to the panning
index as described above is computed using
The additional advantage of S(m, k) compared to ^(m,k) is
that it is identical to the panning coefficient a, whereas
'P(m, k) only approximates a. The formula in Equation 27 is
inspired by the computation of the centroid (center of
gravity) of a function f(x) of the discrete variable x s
{-1, 1} and f(-l) = |Xi(m,k)| and f(l) = |X2(m,k)|.
3.3.3.3 Spectral centroid
The spectral centroid T of a magnitude spectrum or a range
of a magnitude spectrum ISrI of length N is computed
according to
The spectral centroid is a low-level feature that
correlates (when computed over the whole frequency range of
a spectrum) to the perceived brightness of a sound. The
spectral centroid is measured in Hz or dimensionless when
normalized to the maximum of the frequency range.
4 Feature grouping
Feature grouping is motivated by the desire to reduce the
computational load of the further processing of the
features and/or to evaluate the progression of the features
over time.
The described features are computed for each block of data
(from which the Discrete Fourier transform is computed) and
for each frequency bin or set of adjacent frequency' bins.
Feature values computed from adjacent blocks (which usually
overlap) might be grouped together and represented by one
or more of the following functions f(x), whereas the
feature values computed over a group of adjacent frames (a
"super-frame") are taken as arguments x:
• variance or standard deviation
• filtering (e.g. first or higher order differences,
weighted mean value or other low-pass filtering)
• Fourier transform coefficients
The feature grouping may for example be performed by one of
the combiners 930, 940.
5 Computation of the spectral weights using supervised
regression or classification
In the following, we assume that an audio signal x[n] is
additively composed of a direct signal component d[n] and
an ambient signal component a[n]
x[n] = d[n] + a[n] (29)
The present application describes the computation of the
spectral weights as a combination of the feature values
with parameters, which may for example be heuristically
determined parameters (confer, for example, section 3.2).
Alternatively, the spectral weights may be determined from
an estimate of the ratio of the magnitude of the ambient
signal components to the magnitude of the direct signal
components. We define the magnitude ratio of ambient signal
to direct siq
(30)
The ambient signal is computed using an estimate of the
magnitude ratio of ambient signal to direct signal
RAD{m,k). Spectral weights G(m,k) for the ambience
extraction are computed using
(31)
and the magnitude spectrogram of the ambient signal is
derived by spectral weighting
|A(m,k) I = G(m,k) |X(m,k) I (32)
This approach is similar to the spectral weighting (or
short-term spectral attenuation) for noise reduction of
speech signals, whereas the spectral weights are computed
from estimates of the time-varying SNR in sub-bands, see
e.g. [Sch04].
The main issue is the estimation of RAD(ni,k). Two possible
approaches are described in the following: (1) supervised
regression and (2) supervised classification.
It should be noted that these approaches are able to
process features computed from frequency bins and from sub-
bands (i.e. groups of frequency bins) together.
For example: The ambience index and the panning index are
computed per frequency bin. The spectral centroid, spectral
flatness and energy are computed for bark bands. Although
these features are computed using different frequency
resolution, there are process together using the same
classifier / regression method.
A neural net (multi-layer perceptron) is applied to the
estimation of RAD(m, k). There are two options: to estimate
RAD(m,k) for all frequency bins using one neural net or two
use more neural net whereas each neural net estimates
RAD(m,k) for one or more frequency bins.
Each feature is fed into one input neuron. The training of
the net is described in Section 6. Each output neuron is
asigned to the RAD{m,k) of one frequency bin.
5.2 Classification
Similar to the regression approach, the estimation of
RAD(m,k) using the classification approach is done by means
of neural nets. The reference values for the training are
quantized into intervals of arbitrary size, whereas each
interval represents one class (e.g., one class could
include all RAD(iTi,k) in the interval [0.2, 0.3)). With n
being the number of intervals, the number of output neurons
is n-times larger compared to the regression approach.
6. Training
The main issue for the training is the proper choice of
reference values RAD(m,k). We propose two options (whereas
the first option is the preferred one) :
1. using reference values measured from signals where the
direct signal and the ambient signal are separately
available
2. using correlation-based features computed from stereo
signals as reference values fro the processing of mono
signals
6.1 Option 1
This option requires audio signals with prominent direct
signals components and negligible ambient signal (x[n] *i
d[n]) components, e.g. signals recorded in a dry
environment.
For example, the audio signal 1810, 1860 may be considered
as such signals with dominant direct components.
An artificial reverberation signal a[n] is generated by
means of a reverberation processor or by convolution with a
room impulse response (RIR) , which might be sampled in a
real room. Alternatively, other ambient signals can be
used, e.g. recordings of applause, wind, rain, or other
environmental noises.
The reference values used for the training are then
obtained from the STFT representation of d[n] and a[n]
using Equation 30.
In some embodiments, based on a knowledge of the direct
signal component and of the ambient signal component the
magnitude ratio can be determined according to equation 30.
Subsequently, an expected gain value can be obtained on the
basis of the magnitude ration, for example using equation
31. This expected gain value can be used as the expected
gain value information 1316, 1834.
6.2 Option 2
The features based on the correlation between the left and
right channel of a stereo recording deliver powerful cues
for the ambience extraction processing. However, when
processing mono signals, these cues are not available. The
presented approach is able to process mono signals.
A valid option for choosing the reference values for
training is to use stereo signals, from which the
correlation- based features are computed and used as
reference values (for example for obtaining expected gain
values).
The reference values may for example be described by the
expected gain value information 1920, or the expected gain
value information 1920 may be derived from the reference
values.
The stereo recordings may then be down-mixed to mono for
the extraction of the other low-level features, or the low-
level features may be computed from the left and right
channel signals separately.
Some embodiments applying the concept described in this
section are shown in Figs. 19 and 20.
An alternative solution is to compute the weights G(m,k)
from the reference values RAodn, k) according to Equation 31
and to use G(m,k) as reference values for the training. In
this case, the classifier / regression method outputs the
estimates for the spectral weights G(m,k).
7. Post-processing of the ambient Signal
The following section describes appropriate post-processing
methods for the enhancement of the perceived quality of the
ambient signal.
In some embodiments, the post processing may be performed
by the post processor 700.
7.1 Nonlinear processing of sub-band signals
The derived ambient signal (for example represented by
weighted sub-band signals) does not contain ambience
components only, but also direct signal components (i.e.
the separation of ambience and direct signal components is
not perfect). The ambient signal is post-processed in order
to enhance its ambient-to-direct ratio, i.e. the ratio of
the amount of ambient components to direct components. The
applied post-processing is motivated by the observation,
that ambient sounds are rather quiet compared to direct
sounds. A simple method for attenuating loud sounds while
preserving quiet sound is to apply a non-linear compression
curve to the coefficients of the spectrogram (e.g. to the
weighted sub-band signals).
An example for an appropriate compression curve is given in
Equation 17, where c is a threshold and the parameter p
determines the degree of compression, with 0 Another example for a nonlinear modification is y = x"^,
with 0 than large values. One example for this function is y =
Vx , wherein x may for example represent values of the
weighted sub-band signals and y may for example represent
values of the post processed weighted sub-band signals.
In some embodiments, the nonlinear processing of the sub-
band signals described in this section may be performed by
the nonlinear compressor 732.
7.2 Introduction of a time delay
A few milliseconds (e.g. 14 ms) delay is introduced into
the ambient signal (for example compared to the front
signal or direct signal) to improve the stability of the
front image. This is a result of the precedence effect,
which occurs if two identical sounds are presented such
that the onset of one sound A is delayed relative to the
onset of the other sound B and both are presented at
different directions (with respect to the listener). As
long as the delay is within an appropriate range, the sound
is perceived as coming from the direction from where sound
B is presented [LCYG99].
By introducing the delay to the ambient signal, the direct
sound sources are better localized in the front of the
listener even if some direct signal components are
contained in the ambient signal.
In some embodiments, the introduction of a time delay
described in this section may be performed by the delayer
734.
7.3 Signal adaptive equalization
To minimize the timbral coloration of the surround sound
signal, the ambient signal (for example represented in
terms of weighted sub-band signals) is equalized to adapt
its long-term power spectral density (PSD) to the input
signal. This is carried out in a two-stage process.
The PSD of both, the input signal x[k] and the ambience
signal a[k] are estimated using the Welch method, yielding
l"^(co) and ll^{(j)), respectively. The frequency bins of |A(a), t)|
are weighted prior to the resynthesis using the factors
The signal adaptive equalization is motivated by the
observation that the extracted ambient signal tends to
feature a smaller spectral tilt than the input signal, i.e.
the ambient signal may sound brighter than the input
signal. In many recordings, the ambient sounds are mainly
produced by room reverberations. Since many rooms used for
recordings have smaller reverberation time for higher
frequencies than for lower frequencies, it is reasonable to
equalize the ambient signal accordingly. However, informal
listening tests have shown that the equalization to the
long-term PSD of the input signal turns out to be a valid
approach.
In some embodiments, the signal adaptive equalization
described in this section may be performed by the timbral
coloration compensator 736.
7.4 Transient Suppression
The introduction of a time delay into the rear channel
signals (see Section 7.2) evokes the perception of two
separate sounds (similar to an echo) if transient signal
components are present [WNR73] and the time delay exceeds a
signal-dependent value (the echo threshold [LCYG99]) . This
echo can be attenuated by suppressing the transient signal
components in the surround sound signal or in the ambient
signal. Additional stabilization of the front image is
achieved by the transient suppression since the appearance
of localizable point sources in the rear channels is
significantly reduced.
Considering that ideal enveloping ambient sounds are
smoothly varying over time, a suitable transient
suppression method reduces transient components without
affecting the continuous character of the ambience signal.
One method that fulfils this requirement has been proposed
in [WUD07] and is described here.
First, time instances where transients occur (for example
in the ambient signal represented in terms of weighted sub-
band signals) are detected. Subsequently, the magnitude
spectrum belonging to a detected transient region is
replaced by an extrapolation of the signal portion
preceding the onset of the transient.
Therefore all values |X{(i),Tt) I exceeding the running mean
Vi(co) by more than a defined maximum deviation are replaced
by a random variation of ij{o) within a defined variation
interval. Here, subscript t indicates frames belonging to a
transient region.
To assure smooth transitions between modified and
unmodified parts, the extrapolated values are cross-faded
with the original values.
Other transient suppression methods are described in
[WUD07].
In some embodiments, transient suppression described in
this section can be performed by the transient reducer 738.
7.5 Decorrelation
The correlation between the two signals arriving at the
left and right ear influences the perceived width of a
sound source and the ambience impression. To improve the
spaciousness of the impression, the inter-channel
correlation between the front channel signals and/or
between the rear channel signals (e.g. between two rear
channel signals based on the extracted ambient signals) is
decreased.
Various methods for the decorrelation of two signals are
appropriate and are described in the following.
Comb filtering: Two decorrelated signals are obtained by
processing two copies of a one-channel input signal by a
pair of complementary comb filters [Sch57].
Allpass filtering: Two decorrelated signals are obtained
by processing two copies of a one-channel input signal by a
pair of different allpass filters.
Filtering with flat transfer functions: Two decorrelate
signals are obtained by filtering two copies of a one-
channel input signal with two different filters with a flat
transfer function (i.e. impulse response has a white
spectrum).
The flat transfer function ensures that the timbral
coloration of the output signals is small. Appropriate FIR
filters can be constructed by using a white random numbers
generator and applying a decaying gain factor to each
filter coefficient.
An example is shown in Equation 19, where hk, k filter coefficients, rk are outputs of a white random
process, and a and b are constant parameters determining
the envelope of hk such that b ^ aN
hk = rk(b - ak) (19)
Adaptive Spectral Fanoramization: Two decorrelated signals
are obtained by processing two copies of a one-channel
input signal by ASP [VZA06] (see Section 2.1.4). The
application of ASP for the decorrelation of the rear
channel signals and of the front channel signals is
described in [UWI07].
Delaying the sub-band signals: Two decorrelated signals
are obtained by decomposing the two copies of a one-channel
input signal into sub-bands (e.g. using a filter-bank of a
STFT), introducing different time delays to the sub-band
signals and re-synthesizing the time signals from the
processed sub-band signals.
In some embodiments, the decorrelation described in this
section may be performed by the signal decorrelator 740.
In the following, some aspects of embodiments according to
the invention will be briefly summarized.
Embodiments according to the invention create a new method
for the extraction of a front signal and an ambient signal
suited for blind upmixing of audio signals. The advantages
of some embodiments of the method according to the
invention are multi-faceted: Compared to a previous method
for one-to-n upmixing, some methods according to the
invention are of low computational complexity. Compared to
previous methods for two-to-n upmixing, some methods
according to the invention perform successfully even if
both input channel signals are identical (mono) or nearly
identical. Some methods according to the invention do not
depend on the number of input channels and are therefore
well-suited for any configuration of input channels. Some
methods according to the invention are preferred by many
listeners when listening to the resulting surround sound
signal in listening tests.
To summarize, some embodiments are related to a Low-
complexity extraction of a front signal and an ambient
signal from an audio signal for upmixing.
8 Glossary
ASP Adaptive Spectral Panoramization
NMF Non-negative Matrix Factorization
PCA 'Principal Component Analysis
PSD Power spectral density
STFT Short-term Fourier Transform
TFD Time-frequency Distribution
References
[AJ02] Carlos Avendano and Jean-Marc Jot. Ambience
extraction and synthesis from stereo signals for
multi-channel audio upmix. In Proc. of the
ICASSP, 2002.
[AJ04] Carlos Avendano and Jean-Marc Jot. A frequency-
domain approach to multi-channel upmix. J. Audio
Eng. Soc., 52, 2004.
[dCK03] Alain de Cheveigne and Hideki Kawahara. Yin, a
fundamental frequency estimator for speech and
music. Journal of the Acoustical Society of
America, 111 (4):1917-1930, 2003.
[DreOO] R. Dressier. Dolby Surround Pro Logic 2 Decoder:
Principles of operation. Dolby Laboratories
Information, 2000.
[DTS] DTS. An overview of DTS NEo:6 multichannel,
http://www.dts.com/media/uploads/pdfs/DTS%20Neo6%
20Overview.pdf.
[Fal05] C. Faller. Pseudostereophony revisited. In Proc.
of the AES 118nd Convention, 2005.
[GJ07a] M. Goodwin and Jean-Marc Jot. Multichannel
surround format conversion and generalized upmix.
In Proc. of the AES 30th Conference, 2007.
[GJ07b] M. Goodwin and Jean-Marc Jot. Primary-ambient
signal decomposition and vector-based
localization for spatial audio coding and
enhancement. In Proc. of the ICASSP, 2007.
[HEG+99] J. Herre, E. Eberlein, B. Grill, K. Brandenburg,
and H. Gerhauser. US-Patent 5,918,203, 1999.
[lAOl] R. Irwan and R. M. Aarts. A method to convert
stereo to multichannel sound. In Proc. of the AES
19th Conference, 2001.
[IS093] ISO/MPEG. ISO/IEC 11172-3 MPEG-1. International
Standard, 1993.
[Kar] Harman Kardon. Logic 7 explained. Technical
report.
[LCYG99] R. Y. Litovsky, H. S. Colburn, W. A. Yost, and S.
J. Guzman. The precedence effect. JAES, 1999.
[LD05] Y. Li and P.F. Driessen. An unsupervised adaptive
filtering approach of 2-to-5 channel upmix. In
Proc. of the AES 119th Convention, 2005.
[LMTG7] M. Lagrange, L.G. Martins, and G. Tzanetakis.
Semi-automatic mono to stereo upmixing using
sound source formation. In Proc. of the AES 122th
'Convention, 2007.
[MPA+05] J. Monceaux, F. Pachet, F. Armadu, P. Roy, and A.
Zils. Descriptor based spatialization. In Proc.
of the AES 118th Convention, 2005.
[Sch04] G. Schmidt. Single-channel noise suppression
based on spectral weighting. Eurasip Newsletter,
2004.
[Sch57] M. Schroeder. An artificial stereophonic effect
obtained from using a single signal. JAES, 1957.
[Sou04] G. Soulodre. Ambience-based upmixing. In Workshop
at the AES 117th Convention, 2004.
[UWHH07] C. Uhle, A. Walther, 0. Hellmuth, and J. Herre.
Ambience separation from mono recordings using
Non-negative Matrix Factorization. In Proc. of
the AES 30th Conference, 2007.
[UWI07] C. Uhle, A. Walther, and M. Ivertowski. Blind
one-to-n upmixing. In AudioMostly, 2007.
[VZA06] V. Verfaille, U. Zolzer, and D. Arfib. Adaptive
digital audio effects (A-DAFx): A new class of
sound transformations. IEEE Transactions on
Audio, Speech, and Language Processing, 2006.
[WNR73] H. Wallach, E.B. Newman, and M.R. Rosenzweig. The
precedence effect in sound localization. sJ. Audio
Eng. Soc, 21:817-826, 1973.
[WUD07] A. Walther, C. Uhle, and S. Disch. Using
transient suppression in blind multi-channel
upmix algorithms. In Proc. of the AES 122nd
Convention, 2007.
In the following, some embodiments according to the invention
will be described.
An embodiment according to the invention comprises an
apparatus 100 for extracting an ambient signal 112 on the
basis of a time-frequency-domain representation of an input
audio signal 110, the time-frequency-domain representation
representing the input audio signal 110 in terms of a
plurality of sub-band signals 132 describing a plurality of
frequency bands. The apparatus comprises a gain-value
determinator 112 configured to determine a sequence 122 of
time-varying ambient signal gain-values for a given frequency
band of the time-frequency-domain representation of the input
audio signal 110 in dependence on the input audio signal. The
apparatus also comprises a weighter 130 configured to weight
one of the sub-band signals 132 representing the , given
frequency band of the time-frequency-domain representation
with the time-varying ambient signal gain-values 122, to
obtain a weighted sub-band signal 112. The gain value
determinator 120 . is configured to obtain one or more
quantitative feature values describing one or more features or
characteristics of the input audio signal 110 and to provide
the gain values 122as a function of the one or more
quantitative feature values, such that the gain values are
quantitatively dependent on the quantitative feature values,
to allow for a fine-tuned extraction of the ambient components
from the input audio signal. The, gain value determinator 120
also is configured to provide the gain values such that
ambience components are emphasized over non-ambience
components in the weighted sub-band signal 112. Furthermore,
the gain value determinator 120 is configured to obtain a
plurality of different quantitative feature values describing
a plurality of different features or characteristics of the
input audio signal and to combine the different quantitative
Description pages containing deleted claims for all countries except EP
feature values to obtain the sequence 122 of time-varying gain
values, such that the gain-values are quantitatively dependent
on the quantitative feature values. The gain value
determinator also is configured to weight the different
quantitative feature values differently according to weighting
coefficients. Moreover, the gain value determinator is
configured to combine at least a tonality feature . value
describing a tonality of the input audio signal and an energy
feature value describing an energy within a sub-band of the
input audio signal, to obtain the gain values.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain at least one quantitative
feature value describing an ambience-likeliness of the sub-
band signal representing the given frequency band.
In one embodiment of the apparatus 100, the gain value
determinator is configured to scale the different quantitative
feature values in a non-linear way.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain at least one quantitative
single-channel feature value describing a feature of a single
audio signal channel, to provide the gain values using the
single channel feature value.
In one embodiment of the apparatus 100, the gain value
determinator is configured to provide the gain values on the
basis of a single audio channel.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain a multi-band feature
value describing the input audio signal over a frequency range
comprising a plurality of frequency bands.
Description pages containing deleted claims for all countries except EP
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain a narrow- band feature
value describing the input audio signal over a frequency range
comprising a single frequency band.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain a broad-band feature
value describing the input audio signal over a frequency range
comprising an entirety of frequency bands of the time-
frequency-domain representation.
In one embodiment of the apparatus 100, the gain value
determinator is configured to combine different feature values
describing portions of the input audio signal having different
bandwidths, to obtain the gain values.
In one embodiment of the apparatus 100, the gain value
determinator is configured to preprocess the time-frequency-
domain representation of the input audio signal in a non-
linear way, and to obtain a quantitative feature value on the
basis of the preprocessed time-frequency-domain
representation.
In one embodiment of the apparatus 100, the gain value
determinator is configured to post process the obtained
feature values in a non-linear way, to limit a range of values
of the feature values, to obtain post processed feature
values.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain a quantitative feature
value describing a tonality of the input audio signal, to
determine the gain values.
Description pages containing deleted claims for all countries except EP
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain one or more quantitative
channel-relationship values describing a relationship between
two or more channels of the input audio signal.
In one embodiment of the apparatus 100, one of the one or more
quantitative channel-relationship values describes a
correlation or a coherence between two channels of the input
audio signal.
In one embodiment of the apparatus 100, one of the one or more
quantitative channfel-relationship values describes an inter-
channel short-time coherence.
In one embodiment of the apparatus 100, one of the one or more
quantitative channel-relationship values describes a position
of a sound source on the basis of two or more channels of the
input audio signal.
In one embodiment of the apparatus 100, one of the one or more
quantitative channel-relationship values describes an inter-
channel level difference between two or more channels of the
input audio signal.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain, as one of the one or
more quantitative channel-relationship values, a panning
index.
In one embodiment of the apparatus 100, the gain value
determinator is configured to determine a ratio between a
spectral value difference and a spectral value sum for a given
time-frequency bin, to obtain a panning index for the given
time-frequency bin.
In one embodiment of the apparatus 100, the gain value
determinator is configured to obtain a spectral-centroid
feature-value describing a spectral centroid of a spectrum of
the input audio signal or of a portion of the spectrum of the
input audio signal.
In one embodiment of the apparatus 100, the gain value
determinator is configured to provide a gain value, for
weighting a given one of the sub-band signals, in dependence
on a plurality of sub-band signals represented by the time-
frequency-domain representation.
In one embodiment of the apparatus 100, the weighter is
configured weight a group of sub-band signals with a common
sequence of time-varying gain-values.
In one embodiment of the apparatus 100, the apparatus further
comprises a signal post processor configured to post process
the weighted sub-band signal or a signal based thereon, to
enhance an ambient-to-direct radio and to obtain a post
processed signal in which an ambient-to-direct ratio is
enhanced. The signal post processor is configured to attenuate
loud sounds in the weighted sub-band signal or in the signal
based thereon while preserving quite sounds, to obtain the
post processed signal, or the signal post processor is
configured to apply a non-linear compression to the weighted
sub-band signal or to the signal based thereon.
In one embodiment of the apparatus 100, the apparatus further
comprises a signal post processor configured to post process
the weighted sub-band signal or a signal based thereon, to
Description pages containing deleted claims for all countries except EP
obtain a post- processed signal, wherein the signal post
processor is configured to delay the weighted sub-band signal
or the signal based thereon in a range between 2 milliseconds
and 70 milliseconds, to obtain a delay between a front signal
and an ambient signal based on the weighted sub-band signal.
In one embodiment of the apparatus 100, the apparatus further
comprises a signal post processor configured to post process
the weighted sub-band signal or a signal based thereon, to
obtain a post processed signal, wherein the post processor is
configured to perform a frequency-dependent equalization with
respect to an ambient signal representation based on the
weighted sub-band signal, to counteract a timbral coloration
of the ambient signal representation.
In one embodiment of the apparatus 100, the post processor is
configured to perform the frequency dependent equalization
with respect to the ambient signal representation based on the
weighted sub-band signal, to obtain, as the post processed
ambient signal representation, an equalized ambient signal
representation, wherein the post processor is configured to
perform the frequency dependent equalization to adapt a long
term power spectral density of the equalized ambient signal
representation to the input audio signal.
In one embodiment of the apparatus 100, the apparatus further
comprises a signal post processor configured to post process
the weighted sub-band signal or a signal based thereon, to
obtain a post processed signal, wherein the signal post
processor is configured to reduce transients in the weighted
sub-band signal or in the signal based thereon.
In one embodiment of the apparatus 100, the apparatus further
comprises a signal post processor configured to post process
Description pages containing deleted claims for all countries except E?
the weighted sub-band signal or a signal based thereon, to
obtain a post processed signal, wherein the post processor is
configured to obtain, on the basis of the weighted sub-band
signal or the signal based thereon, a left ambient signal and
a right ambient signal, such that the left ambient signal and
the right ambient signal are at least partially de-correlated.
In one embodiment of the apparatus 100, the apparatus is
configured to also provide a front signal on the basis of the
input audio signal, wherein the weighter is configured to
weight one of the sub-band signals representing the given
frequency band of the time-frequency-domain representation
with varying front-signal gain-values, to obtain a weighted
front-signal sub-band signal, wherein the weighter is
configured such that the time-varying front-signal gain-values
decrease with increasing ambient-signal gain-values.
In one embodiment of the apparatus 100, the weighter is
configured to provide the time'-varying front-signal gain-
values such that the front-signal gain-values are
complementary to the ambient-signal gain-values.
In one embodiment of the apparatus 100, the apparatus
comprises a time-frequency-domain to time-domain converter
configured to provide a time-domain representation of the
ambient signal in dependence on the one or more weighted sub-
band signals.
In one embodiment of the apparatus 100, the apparatus is
configured to extract the ambient signal on the basis of a
mono input audio signal.
An embodiment according to the invention comprises a multi-
channel audio signal generator for providing a multi-channel
Description pages containing deleted claims for all countries except EP
audio signal comprising at least one ambient signal on the
basis of one or more input audio signals. The multi-channel
audio signal generator comprises an ambient signal extractor
1010 configured to extract an ambient signal on the basis of a
' time-frequency-domain representation of the input audio
signal, the time-frequency-domain representation representing
the input audio signal in terms of a plurality of sub-band
signals describing a plurality of frequency bands. The ambient
signal extractor comprises a gain-value determinator
configured to determine a sequence of time-varying ambient
signal gain-values for a given frequency band of the time-
frequency-domain representation of the input audio signal in
dependence on the input audio signal, and a weighter
configured to weight one of the sub-band signals representing
the given frequency band of the time-frequency-domain
representation with the time-varying gain-values, to obtain a
weighted sub-band signal. The gain value determinator is
configured to obtain one or more quantitative feature values
describing one or more features or characteristics of the
input audio signal and to provide the gain values as a
function of the one or more quantitative feature values, such
that the gain values are quantitatively dependent on the
quantitative feature values to allow for a fine-tuned
extraction of the ambient components from the input audio
signal. The gain value determinator also is configured to
provide the gain values such that ambience components are
emphasized over non-ambience components in the weighted sub-
band signal. Furthermore, the gain value determinator 120 is
configured to obtain a plurality of different quantitative
feature values describing a plurality of different features or
characteristics of the input audio signal and to combine the
different quantitative feature values to obtain the sequence
122 of time-varying gain values, such that the gain-values are
• quantitatively dependent on the quantitative feature values.
Description pages containing deleted claims for all countries except EP
The gain value determinator also is configured to weight the
different quantitative feature values differently according to
weighting coefficients. Moreover, the gain value determinator
is configured to combine at least a tonality feature value
describing a tonality of the input audio signal and an energy
feature value describing an energy within a sub-band of the
input audio signal, to obtain the gain values. The multi-
channel audio signal generator further comprises an ambient
signal provider 1020 configured to provide the one or more
ambient signals on .the basis of the weighted sub-band signal.
In one embodiment of the multi-channel audio signal generator,
the multi-channel audio signal generator is configured to
provide the one or more ambient signals as one or more rear
channel audio signals.
In one embodiment of the multi-channel audio signal generator,
the multi-channel audio signal generator is configured to
provide one or more front channel audio signals on the basis
of the one or more input audio signals.
An embodiment according to the invention comprises an
apparatus 1300 for obtaining, on the basis of a coefficient
determination input audio signal, weighting coefficients for
parameterizing a gain-value determinator for extracting an
ambient signal from an input audio signal. The apparatus 1300
comprises a weighting coefficient determinator 1300 configured
to determine the weighting coefficients such that gain values
obtained on the basis of a weighted combination, using the
weighting coefficients, of a plurality of different
quantitative feature-values 1322, 1324 describing a plurality
of. different features or characteristics of the coefficient-
determination input audio signal, the feature values
comprising at least a tonality feature value describing a
Description pages containing deleted claims for all countries except EP
tonality of the input audio signal and an energy feature value
describing an energy within a subband of the input audio
signal, approximate expected gain values 1310 associated with
the coefficient determination audio signal, wherein the
expected gain values describe an intensity of ambience
components or of non-ambience components in the coefficient
determination input audio signal, or an information derived
therefrom, for a plurality of time-frequency bins of the
coefficient-determination input audio signal.
In one embodiment of the apparatus 1300, the apparatus
comprises a coefficient-determination-signal generator
configured to provide the coefficient-deterraination-signal on
the basis of a reference audio signal comprising only
negligible ambient signal components. The coefficient-
determination-signal generator is configured to combine the
reference audio signal with ambient signal components, to
obtain the coefficient determination signal, and to provide an
information describing the ambient signal components or a
relationship between the ambient signal components and direct
signal components of the reference audio signal to the
weighting- coefficient determinator, to describe the expected
gain values.
In one embodiment of the apparatus 1300, the coefficient-
determination-signal generator comprises an artificial
ambient-signal generator configured to provide the ambient
signal components on the basis of the reference audio signal.
In one embodiment of the apparatus 1300, the apparatus
comprises a coefficient-determination-signal generator,
wherein the coefficient-determination-signal generator is
configured to provide the coefficient-determination-signal and
an information describing the expected gain values on the
Description pages containing deleted claims for all countries except EP
basis of a multi-channel reference audio signal. The
coefficient-determination-signal generator is configured to
determine an information describing a relationship between two
or more channels of the multi-channel reference audio signal
to provide the information describing the expected gained
values,
In one embodiment of the apparatus 1300, the coefficient-
determination-signal generator is configured to determine a
correlation-based quantitative feature value describing a
correlation between two or more of the channel signals of the
multi-channel reference audio signal to provide the
information describing the expected gained values.
In one embodiment of the apparatus 1300, the coefficient-
determination-signal generator is configured to provide one
channel of the multi-channel reference audio signal as the
coefficient- determination-signal.
In one embodiment of the apparatus 1300, the coefficient
determination signal generator is configured to combine two or
more of the channels of the multi-channel reference audio
signal to obtain the coefficient-determination-signal.
In one embodiment of the apparatus 1300, the weighting
coefficient determinator is configured to determine the
weighting coefficients using a regression method, a
classification method or a neural net, wherein the
coefficient-determination-signal is used as a training signal,
wherein the expected gain values serve as reference values and
wherein the coefficients are determined.
We claim:
An apparatus (100) for extracting an ambient signal (112)
on the basis of a time-frequency-domain representation of
an input audio signal (110), the time-frequency-domain
representation representing the input audio signal (110)
in terms of a plurality of sub-band signals (132)
describing a plurality of frequency bands, the apparatus
comprising:
a gain-value determinator (112) configured to determine a
sequence (122) of time-varying ambient signal gain-values
for a given frequency band of the time-frequency-domain
representation of the input audio signal (110) in
dependence on the input audio signal;
a weighter (130) configured to weight one of the sub-band
signals (132) representing the given frequency band of the
time-frequency-domain representation with the time-varying
ambient signal gain-values (122), to obtain a weighted
sub-band signal (112);
wherein the gain value determinator (120) is configured to
obtain one or more quantitative feature values describing
one or more features or characteristics of the input audio
signal (110) and to provide the gain values (122)as a
function of the one or more quantitative feature values,
such that the gain values are quantitatively dependent on
the quantitative feature values, to allow for a fine-tuned
extraction of the ambient components from the input audio
signal; and
wherein the gain value determinator (120)is configured to
provide the gain values such that ambience componen'ts are
emphasized over non-ambience components in the weighted
sub-band signal (112);
wherein the gain value determinator (120) is configured
to obtain a plurality of different quantitative feature
values describing a plurality of different features or
characteristics of the input audio signal and to combine
the different quantitative feature values to obtain the
sequence (122) of time-varying gain values, such that the
gain-values are quantitatively dependent on the
quantitative feature values;
wherein the gain value determinator is configured to
weight the different quantitative feature values
differently according to weighting coefficients; and
wherein the gain value determinator is configured to
combine at least a tonality feature value describing a
tonality of the input audio signal and an energy feature
value describing an energy within a sub-band of the input
audio signal, to obtain the gain values.
The apparatus according to claim 1, wherein the gain value
determinator is configured to determine the time-varying
gain values on the basis of the time-frequency-domain
representation of the input audio signal.
The apparatus according to claim 1 or 2, wherein the gain
value determinator is configured to combine the different
feature values using the relationship

to obtain the gain values,
wherein co designates a sub-band index,
wherein x designates a time index,
wherein i designates a running variable,
wherein K represents a number of feature-values to be
combined,
wherein mi(co,T) designates a i-th feature value for a sub-
band having frequency index co and a time having time
index x,
wherein ai designates a linear weighting coefficient for
the i-th feature value,
wherein Pi designates an exponential weighting coefficient
for the i-th feature value,
wherein g ((B,x) designates a gain value for a sub-band
having frequency index © and a time having time index x.
The apparatus according to one of claims 1 to 3, wherein
the gain value determinator comprises a weight adjuster
configured to adjust weights of different features to be
combined.
The apparatus according to one of claims 1 to 4, wherein
the gain value determinator is configured to combine at
least the tonality feature value, the energy feature value
and a spectral centroid feature value describing a
spectral centroid of a spectrum of the input audio signal
or of a portion of the spectrum of the input audio signal,
to obtain the gain values.
The apparatus according to one of claims 1 to 5, wherein
the gain value determinator is configured to combine a
plurality of feature values describing identical features
or characteristics associated with different time-
frequency-bins of the time-frequency domain
representation, to obtain a combined feature value.
The apparatus according to claim 6, wherein the gain value
determinator is configured to obtain, as the quantitative
feature value describing the tonality,
a spectral flatness measure, or
a spectral crest factor, or
a ratio of at least two spectral values obtained using
different non-linear processing of copies of a spectrum of
the input audio signal, or
a ratio of at least two spectral values obtained using
different non-linear filtering of copies of a spectrum of
the input signal, or
a value indicating a presence of a spectral peak.
a similarity value describing a similarity between the
input audio signal and a time-shifted version of the input
audio signal, or
a prediction error value describing a difference between a
predicted spectral coefficient of the time-frequency-
domain representation and an actual spectral coefficient
of the time-frequency-domain representation.
The apparatus according to one of claims 1 to 9, wherein
the gain value determinator is configured to obtain at
least one quantitative feature value describing an energy
within a sub-band of the input audio signal, to determine
the gain values.
The apparatus according to claim 8, wherein the gain value
determinator is configured to determine the gain values
such that the gain value for a given time-frequency bin of
the time-frequency-domain description decreases with
increasing energy in the given time-frequency bin, or with
increasing energy in a time-frequency bin within an
neighborhood of the given time-frequency bin.
The apparatus according to claim 8 or 9, wherein the gain
value determinator is configured to treat an energy in a
given time-frequency bin and a maximum energy or average
energy in a predetermined neighborhood of the given time-
frequency bin as separate features.
The apparatus according to claim 10, wherein the gain
value determinator is configured to obtain a first
quantitative feature value describing an energy of the
given time-frequency bin and a second quantitative feature
value describing a maximum energy or an average energy in

a predetermined neighborhood of the given time-frequency
bin, and to combine the first quantitative feature value
and the second quantitative feature value to obtain the
gain value.
2. The apparatus according to one of claims 1 to 11, wherein
the gain value determinator is configured to obtain one or
more quantitative channel-relationship values describing a
relationship between two or more channels of the input
audio signal.
3. The apparatus according to one of claims 1 to 12, wherein
the apparatus is configured to also provide a front signal
on the basis of the input audio signal,
wherein the weighter is configured to weight one of the
sub-band signals representing the given frequency band of
the time-frequency-domain representation with varying
front-signal gain-values, to obtain a weighted front-
signal sub-band signal,
wherein the weighter is configured such that the time-
varying front-signal gain-values decrease with increasing
ambient-signal gain-values.
4. A multi-channel audio signal generator for providing a
multi-channel audio signal comprising at least one ambient
signal on the basis of one or more input audio signals,
the apparatus comprising:
an ambient signal extractor (1010) configured to extract
an ambient signal on the basis of a time-frequency-domain
representation of the input audio signal, the time-
frequency-domain representation representing the input


audio signal in terms of a plurality of sub-band signals
describing a plurality of frequency bands,
the ambient signal extractor comprising:
a gain-value determinator configured to determine a
sequence of time-varying ambient signal gain-values for a
given frequency band of the time-frequency-domain
representation of the input audio signal in dependence on
the input audio signal, and
a weighter configured to weight one of the sub-band
signals representing the given frequency band of the time-
frequency-domain representation with the time-varying
gain-values, to obtain a weighted sub-band signal,
wherein the gain value determinator is configured to
obtain one or more quantitative feature values describing
one or more features or characteristics of the input audio
signal and to provide the gain values as a function of the
one or more quantitative feature values, such that the
gain values are quantitatively dependent on the
quantitative feature values to allow for a fine-tuned
extraction of the ambient components from the input audio
signal, and
wherein the gain value determinator is configured, to
provide the gain values such that ambience components are
emphasized over non-ambience components in the weighted
sub-band signal;
wherein the gain value determinator (120) is configured to
obtain a plurality of different quantitative feature
values describing a plurality of different features or

characteristics of the input audio signal and to combine
the different quantitative feature values to obtain the
sequence (122) of time-varying gain values, such that the
gain-values are quantitatively dependent on the
quantitative feature values;
wherein the gain value determinator is configured to
weight the different quantitative feature values
differently according to weighting coefficients; and
wherein the gain value determinator is configured to
combine at least a tonality feature value describing a
tonality of the input audio signal and an energy feature
value describing an energy within a sub-band of the input
audio signal, to obtain the gain values; and
an ambient signal provider (1020) configured to provide
the one or more ambient signals on the basis of the
weighted sub-band signal.
5. An apparatus (1300) for obtaining, on the basis of a
coefficient determination input audio signal, weighting
coefficients for parameterizing a gain-value determinator
for extracting an ambient signal from an input audio
signal, the apparatus comprising:
a weighting coefficient determinator (1300) configured to
determine the weighting coefficients such that gain values
obtained on the basis of a weighted combination, using the
weighting coefficients, of a plurality of different
quantitative feature-values (1322, 1324) describing a
plurality of different features or characteristics of the
coefficient-determination input audio signal, the feature
values comprising at least a tonality feature value

describing a tonality of the input audio signal and an
energy feature value describing an energy within a subband
of the input audio signal, approximate expected gain
values (1310) associated with the coefficient
determination audio signal, wherein the expected gain
values describe an intensity of ambience components or of
non-ambience components in the coefficient determination
input audio signal, or an information derived therefrom,
for a plurality of time-frequency bins of the coefficient-
determination input audio signal.
16. The apparatus according to claim 15, wherein the apparatus
comprises a coefficient-determination-signal generator
configured to provide the coefficient-determination-signal
on the basis of a reference audio signal comprising only
negligible ambient signal components,
wherein the coefficient-determination-signal generator is
configured to combine the reference audio signal with
ambient signal components, to obtain the coefficient
determination signal, and
to provide an information describing the ambient signal
components or a relationship between the ambient signal
components and direct signal components of the reference
audio signal to the weighting- coefficient determinator,
to describe the expected gain values.
17. The apparatus according to claim 15 or 16, wherein.the
apparatus comprises a coefficient-determination-signal
generator, wherein the coefficient-determination-signal
generator is configured to provide the coefficient-
determination-signal and an information describing the
expected gain values on the basis of a multi-channel
reference audio signal,
wherein the coefficient-determination-signal generator is
configured to determine an information describing a
relationship between two or more channels of the multi-
channel reference audio signal to provide the information
describing the expected gained values.
.8. A method (2100) for extracting an ambient signal on the
basis of a time-frequency-domain representation of an
input audio signal, the time-frequency-domain
representation representing the input audio signal in
terms of a plurality of sub-band signals describing a
plurality of frequency bands, the method comprising:
obtaining (2110) a plurality of different quantitative
feature-values describing one or more features or
characteristics of the input audio signal;
determining (2120) a sequence of time-varying ambient-
signal gain-values for a given frequency band of the time-
frequency-domain representation of the input audio signal
as a function of the one or more quantitative feature-
values, such that the gain-values are quantitatively
dependent on the quantitative feature-values;
wherein determining the sequence of time-varying ambient-
signal gain-values comprises combining the different
quantitative feature values, wherein the different
quantitative feature values are weighted differently
according to weighting coefficients, and
wherein at least a tonality feature value describing a
tonality of the input audio signal and an energy feature
value describing an energy within a sub-band of the input
audio signal are combined, to obtain the gain values; and
weighting (2130) a sub-band signal representing the given
frequency band of the time-frequency-domain representation
with the time-varying gain-values.
A method (2200) for obtaining weighting coefficients for
parameterizing a gain value determination for extracting
an ambient signal from an input audio signal, the method
comprising:
obtaining (2210) a coefficient-determination-signal, such
that an information about ambient components present in
the coefficient-determination-signal or an information
describing a relationship between an ambient-component and
a non-ambient component is known; and
determining (2220) the weighting coefficients such that
gain-values obtained on the basis of a weighted
combination, according to the weighting coefficients, of a
plurality of different quantitative feature-values,
describing a plurality of different features or
characteristics of the coefficient- determination-signal,
approximate expected gain-values associated with the
coefficient-determination-signal,
wherein the expected gain values describe an intensity of
the ambient components or of non-ambience components' in
the coefficient-determination-signal, or an information
derived therefrom, for a plurality of time-frequency bins
of the coefficient-determination signal, and

wherein the feature values comprise at least a tonality
feature value describing a tonality of the input audio
signal and an energy feature-value describing an energy
within a subband of the input audio signal.
20. A computer program for performing a method according to
claim 18 or 19, when the computer program runs, on a
computer.


An apparatus for extracting an ambient signal from an input
audio signal comprises a gain-value determinator configured
to determine a sequence of time-varying ambient signal gain
values for a given frequency band of the time-frequency
distribution of the input audio signal in dependence on the
input audio signal. The apparatus comprises a weighter
configured to weight one of the sub-band signals
representing the given frequency band of the time-
frequency-domain representation with the time-varying gain
values, to obtain a weighted sub-band signal. The gain-
value determinator is configured to obtain one or more
quantitative feature-values describing one or more features
of the input audio signal and to provide the gain-value as
a function of the one or more quantitative feature values
such that the gain values are quantitatively dependent on
the quantitative values. The gain value determinator is
configured to determine the gain values such that ambience
components are emphasized over non-ambience components in
the weighted sub-band signal.

Documents:

http://ipindiaonline.gov.in/patentsearch/GrantedSearch/viewdoc.aspx?id=5Nd41xwPiH3KLOfkN1dc3g==&loc=wDBSZCsAt7zoiVrqcFJsRw==


Patent Number 270770
Indian Patent Application Number 1115/KOLNP/2010
PG Journal Number 04/2016
Publication Date 22-Jan-2016
Grant Date 18-Jan-2016
Date of Filing 26-Mar-2010
Name of Patentee FRAUNHOFER-GESELLSCHAFT ZUR FÖRDERUNG DER ANGEWANDTEN FORSCHUNG E.V.
Applicant Address HANSASTRASSE 27C, 80686 MÜNCHEN, GERMANY
Inventors:
# Inventor's Name Inventor's Address
1 JUERGEN HERRE HALLESTRASSE 24 91054 BUCKENHOF, GERMANY
2 FALKO RIDDERBUSCH ADAM-KRAFT-STRASSE 57 90419 NUERNBERG, GERMANY
3 ANDREAS WALTER BIRKENGRABEN 14A 96052 BAMBERG, GERMANY
4 OLIVER MOSER TENNENLOHERSTRASSE 32A 91058 ERLANGEN, GERMANY
5 CHRISTIAN UHLE STINZINGSTRASSE 29 91056 ERLANGEN, GERMANY
6 STEFAN GEYERSBERGER OTTO-ROTH-STRASSE 90 97076 WUERZBURG, GERMANY
PCT International Classification Number H04S 5/00
PCT International Application Number PCT/EP2008/002385
PCT International Filing date 2008-03-26
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 60/975,340 2007-09-26 U.S.A.