A tool to insert audio into a specified audio (i.e. RTP) stream was created in
the August - September 2006 timeframe. The tool is named rtpinsertsound. It 
was tested on a Linux Red Hat Fedora Core 4 platform (Pentium IV, 2.5 GHz),
but it is expected this tool will successfully build and execute on a variety
of Linux distributions. The first distribution of the tool is: v1.1.

v2.0 is an upgrade produced in October 2006 to directly support the input
of certain wave (i.e. .wav) files into the tool as the source of audio
to insert, in addition to the input of audio in the form of a tcpdump
formatted file (i.e. G.711 RTP/UDP/IP/ETHERNET captures) which was
supported in v1.1.

    Copyright (c)  2006  Mark D. Collier/Mark O'Brien
    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2
    or any later version published by the Free Software Foundation;
    with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
    A copy of the license is included in the section entitled "GNU
    Free Documentation License".
 
Authors:  Mark D. Collier/Mark O'Brien   10/10/2006  v2.0
	  Mark D. Collier/Mark O'Brien   08/14/2006  v1.1
          www.securelogix.com - mark.collier@securelogix.com
          www.hackingexposedvoip.com

This tool was produced with honorable intentions, which are:

  o To aid owners of VoIP infrastructure to test, audit, and uncover security
    vulnerabilities in their deployments.

  o To aid 3rd parties to test, audit, and uncover security vulnerabilities
    in the VoIP infrastructure of owners of said infrastructure who contract
    with or otherwise expressly approve said 3rd parties to assess said
    VoIP infrastructure.

  o To aid producers of VoIP infrastructure to test, audit, and uncover security
    vulnerabilities in the VoIP hardware/software/systems they produce.

  o For use in collective educational endeavors or use by individuals for
    their own intellectual curiosity, amusement, or aggrandizement - absent
    nefarious intent.
   
Unlawful use of this tool is strictly prohibited.

The following open-source libraries of special note were used to build
rtpinsertsound:

1) libnet v1.1.2.1 (tool requires at least this version)
2) libpcap v0.9.4  (tool will probably work with some earlier versions)
3) hack_library [ e.g. utility routine - Str2IP( ) ]

    Note: The Makefile for the rtpinsertsound presumes
          that hack_library.o and hack_library.h reside in
          a folder at ../hack_library relative to the Makefile
          within the rtpinsertsound directory.

4) a G.711 codec conversion library based upon open-source code
    from SUN published in the early 1990's and updated by 
    Borge Lindberg on 12/30/1994.

    Note: The Makefile for the rtpinsertsound tool presumes
          that the original g711.c file has been renamed
          g711conversions.c and a g711conversions.h file
          has been added. The rtpinsertsound tool Makefile
          presumes the header and object for this library
          reside in a folder at ../g711conversions relative
          to the folder where rtpinsertsound is built. The
          following comment is extracted from the source for
          your information:

    /*
     * December 30, 1994:
     * Functions linear2alaw, linear2ulaw have been updated to correctly
     * convert unquantized 16 bit values.
     * Tables for direct u- to A-law and A- to u-law conversions have been
     * corrected.
     * Borge Lindberg, Center for PersonKommunikation, Aalborg University.
     * bli@cpk.auc.dk
     *
     */

Install and build the libraries in accordance with their respective
instructions. Then change to the rtpinsertsound_v2.0 directory and simply
type: make

then:

[root@localhost rtpinsertsound_v2.0]# ./rtpinsertsound
 
Error: 6 command line parameters are mandatory
 
rtpinsertsound - Version 2.0
                 October 10, 2006
 Usage:
 Mandatory -
        interface (e.g. eth0)
        source RTP IPv4 addr
        source RTP port
        destination RTP IPv4 addr
        destination RTP port
        pathname of file whose audio is to be mixed into the
            targeted live audio stream. If the file extension is
            .wav, then the file must be a standard Microsoft
            RIFF formatted WAVE file meeting these constraints:
              1) header 'chunks' must be in one of two sequences:
                   RIFF, fmt, fact, data
                     or
                   RIFF, fmt, data
              2) Compression Code = 1 (PCM/Uncompressed)
              3) Number of Channels = 1 (mono)
              4) Sample Rate (Hz) = 8000
              5) Significant Bits/Sample =
                      signed,   linear 16-bit or
                      unsigned, linear  8-bit
            If the file name does not specify a .wav extension,
            then the file is presumed to be a tcpdump formatted
            file with a sequence of, exclusively, G.711 u-law
            RTP/UDP/IP/ETHERNET messages
            Note: Yep, the format is referred to as 'tcpdump'
                  even though this file must contain udp messages
 Optional -
        -f spoof factor - amount by which to:
             a) increment the RTP hdr sequence number obtained
                from the ith legitimate packet to produce the
                RTP hdr sequence number for the ith spoofed packet
             b) multiply the RTP payload length and add that
                product to the RTP hdr timestamp obtained from
                the ith legitimate packet to produce the RTP hdr
                timestamp for the ith spoofed packet
             c) increment the IP hdr ID number obtained from the
                ith legitimate packet to produce the IP hdr ID
                number for the ith spoofed packet
           [ range: +/- 1000, default: 2 ]
        -j jitter factor - the reception of a legitimate RTP
             packet in the target audio stream enables the output
             of the next spoofed packet. This factor determines
             when that spoofed packet is actually transmitted.
             The factor relates how close to the next legitimate
             packet you'd actually like the enabled spoofed packet
             to be transmitted. For example, -j 10 means 10% of
             the codec's transmission interval. If the transmission
             interval = 20,000 usec (i.e. G.711), then delay the
             output of the spoofed RTP packet until the time-of-day
             is within 2000 usec (i.e. 10%) of the time the next
             legitimate RTP packet is expected. In other words,
             delay 100% minus the jitter factor, or 18,000 usec
             in this example. The smaller the jitter factor, the
             greater the risk you run of not outputting the current
             spoofed packet before the next legitimate RTP packet
             is received. Therefore, a factor > 10 is advised.
           [ range: 0 - 80, default: 80 = output spoof ASAP ]
        -h help - print this usage
        -v verbose output mode
 
Note: If you are running the tool from a host with multiple
      ethernet interfaces which are up, be forewarned that
      the order those interfaces appear in your route table
      and the networks accessible from those interfaces might
      compel Linux to output spoofed audio packets to an
      interface different than the one stipulated by you on
      command line. This should not affect the tool unless
      those spoofed packets arrive back at the host through
      the interface you have specified on the command line
      (e.g. the interfaces have connectivity through a hub).
[root@EquinoxLX rtpinsertsound_v2.0]#

This tool does NOT presume it is running as Man-In-The-Middle
(MITM), however, it does presume that target audio (i.e. RTP) packet
streams of interest can be received by the specified Ethernet interface
in promiscuous mode (e.g. the host running the tool is connected to
a hub through which target audio packet streams are flowing).

The tool presently supports inserting audio into an audio stream
(i.e. RTP/UDP/IP/Ethernet) bearing G.711 u-law payloads only.
The RTP header of the target audio packets must be a standard
RFC 3550 12-byte RTP header. The tool does NOT automatically
detect and compensate for audio session modifications. The tool
does NOT presently support 802.1q (i.e. layer 2 VLAN/priority
tagging) within the 802.3 IEEE Ethernet header. The tool presumes
it is running on a little-endian platform.

Use Ethereal/Wireshark or some appropriate sniffer to determine the 
stream into which you'd like to insert an audio playback. You must
know the source IPv4 address, source UDP port, destination IPv4
address, and destination UDP port of the stream into which you'd like
to insert audio. This tool is unidirectional. If the insertion of the
audio is successful, the targeted destination will be persuaded to
accept the RTP packets inserted by this tool and reject the legitimate
audio packets that continue to stream from the legitimate source to the
target destination. In other words, audio from the legitimate source
will be muted during the duration of the playback. Perhaps it is more
technically correct to state that the pre-recorded bogus audio being
played back by the tool is being "interlaced" into the target audio
stream. 

Playback is rather arbitrarily limited to 30 seconds. You may change
the source code if you require a longer playback interval. 

The sound (i.e. audio) to insert into an audio stream must be in
one of two forms as stipulated by the usage printout appearing
above.

If a wave file you'd like to input to the tool does not comply with
the constraints imposed by the tool, you will need to use an audio
conversion utility to massage the file into a form acceptable by the
tool. For example, many wave files on the Net are in this format:

Compression Code:        1
Channels:                1
Sample Rate (Hz):        11025
Avg. Bytes/sec:          11025
Block Align:             1
Significant Bits/sample: 8
 
A sample rate of 11025 is not presently supported by the tool.
The Linux sox command might be used to convert the file to 
the required 8000 Hz sample rate. If the file is named
swclear.wav then:

sox -V swclear.wav -r 8000 swclear_resample.wav resample -ql

converts swclear.wav to swclear_resample.wav with the
following format:

Compression Code:        1
Channels:                1
Sample Rate (Hz):        8000
Avg. Bytes/sec:          8000
Block Align:             1
Significant Bits/sample: 8

The sox command can also be used to convert multi-channel
audio to mono, covert different compression codes to the
PCM/uncompressed format required by the tool, and convert
the number of significant bits/sample, among many other
conversions.

Unfortunately, sox does not support the conversion of wave files
from MPEG format to the format required by the tool. If you 
attempted a similar sox command to the one above for the a
MPEG Layer 3 formatted file you'd get this error:

sox: Failed reading khan.wav: Sorry, this WAV file is in MPEG Layer 3 format.

For tcpdump formatted input files, the file must be composed of
sequential RTP/UDP/IP/Ethernet messages, where the RTP payloads
are encoded using the G.711 u-law codec (i.e. PCMU). Our sound
files were produced using the Asterisk open-source IP PBX. Asterisk
"call files" were used to call a VoIP phone that was configured
with a preference to receive audio processed by the G.711 u-law
codec. The call file stipulated the sound file to play once the
call was connected. The Ethereal/Wireshark network analyzer tool
was used to capture the G.711 packets flowing from the Asterisk
IP PBX to the phone. These were saved into a standard tcpdump
file. There are, no doubt, many other mechanisms to produce such
a file.

Note: For operation of the open-source Asterisk IP PBX and 
         an explanation of "call files", see: Asterisk: The Future of
         Telephony, by Jim Van Meggelen, Jared Smith, and Leif Madsen.
         Copyright 2005 O'Reilly Media, Inc., 0-596-00962-3.
 
         A softcopy of that book is available on-line as a legitimately
         free download.

A later version of the tool might be capable of inputting a
greater variety of audio file formats.
 
When the tool is executed, it first loads the pre-recorded audio into 
memory. Then it attempts to detect a packet from the audio stream
designated on the command line. The output of bogus audio interlaced
into the legitimate audio stream is close-looped with the reception of
legitimate audio packets.

The optional spoof factor value might be specified on the command line
(i.e. default = 2). As reported by the tool's usage printout, the
spoof factor is used to adjust key RTP header and IP header values
in an inserted audio packet relative to those values in the legitimate
audio packet triggering the insertion of that bogus audio packet.
Adjusting those key header values slightly higher (or lower) relative
to the last legitimate packet may persuade the target destination to
accept the inserted packets and reject the legitimate packets it
continues to receive.

The optional jitter factor value might be specified on the command line
(i.e. default 80% = ASAP). The jitter factor determines exactly when
the next bogus audio packet is inserted relative to the received audio
packet triggering the output of the bogus packet. The default value
outputs the bogus packet ASAP. A value less than 80% requires the
bogus packet to be output closer to when the next legitimate packet
is expected. The factor is expressed as a percentage of the ideal
codec transmission interval, which is every 20 ms for G.711 u-law.
So, for G.711:

jitter factor       how close to the next legitimate packet
      %             the bogus packet is transmitted
-------------       -------------------------------------------
      80            close to 20 ms (i.e. ASAP - within a couple
                    of hundred usec after the trigger packet)

      70            14 ms (i.e.  6 ms after the trigger packet)
      60            12 ms (i.e.  8 ms after the trigger packet)
      50            10 ms (i.e. 10 ms after the trigger packet)
      40             8 ms (i.e. 12 ms after the trigger packet)
      30             6 ms (i.e. 14 ms after the trigger packet)                   
      20             4 ms (i.e. 16 ms after the trigger packet)
      10             2 ms (i.e. 18 ms after the trigger packet)
       5             1 ms (i.e. 19 ms after the trigger packet)

When a jitter factor other than 80 is specified, the execution priority
of the tool is increased to the maximum. You'll probably note that other 
applications and GUI's running on the same platform will decrease
in responsiveness (e.g. Ethereal). Only one VoIP hard phone model has
been encountered by the authors thus far (out of 8) that requires a jitter
factor other than the default value. The timing is not as precise as
the table might indicate. A jitter factor too close to 0 usually
results in the tool failing, at some point in the playback, to output a
bogus packet before the next legitimate packet is received. The tool
detects that condition and halts with an appropriate error message.

Example:

./rtpinsertsound eth0 10.1.101.40 39120 10.1.101.60 64006 g711CaptureAlphabetRecitation -f 1 -j 10

In this example, the audio from the tcpdump file named
g711CaptureAlphabetRecitation within the rtpinsertsound folder is
inserted into the G.711 audio stream from the VoIP source at
10.1.101.40:39120 to the VoIP destination at 10.1.101.60:64006.
Each bogus audio packet is transmitted approximately 18 ms after
the prior legitimate audio packet is received by the tool. The
factor to apply to manipulate key RTP header and IP header
values in a bogus packet, relative to its legitimate trigger
packet is: 1

Alternatively, an appropriate wave file could be used:

./rtpinsertsound eth0 10.1.101.40 39120 10.1.101.60 64006 AlphabetRecitation.wav -f 1 -j 10

If the tool pauses for a noticeable interval when initially attempting
to sync to the audio stream, it very likely means one or more of the 
following conditions exist:

a) the stream is not present at the specified Ethernet interface
b) the audio stream does not exist (i.e. the call has ended or changed state)
c) the user has not entered the IPv4 addresses or UDP ports properly

Since the output of bogus audio is close-looped to the reception of the
target audio stream, the tool stalls if the target audio stream ends
or changes state during the playback.

A compilation directive determines whether the object code of the tool is
produced with Ethernet layer spoofing or whether IP layer spoofing is
sufficient. Testing to-date has demonstrated that Ethernet layer spoofing
is NOT required. The tool executes faster when it is not required to spoof
at the Ethernet layer. Ethernet layer spoofing is not recommended.

Mark O'Brien (10/11/2006)