AVT T. Schierl Internet-Draft Fraunhofer HHI Intended status: Informational J. Lennox Expires: April 30, 2009 Vidyo October 27, 2008 Multi-Session and Multi-Source Transmission in the Real-Time Transport Protocol (RTP) draft-schierl-avt-rtp-multi-session-transmission-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 30, 2009. Abstract In this draft, we discuss problems related to multi-session and multi-source transmission using the Real-Time Transport Protocol (RTP). Most of the input to this draft is taken from email discussion. Multi-session and multi-source transmission is motivated by media data which allows for different transport layer treatment of parts of the media. This is typically the case for layered media. Multi-session transmission is when media data from a single media source is split over multiple RTP sessions. Single-session multi- source transmission (from now on just called "multi-source transmission") is when data from a single media source is sent as Schierl & Lennox Expires April 30, 2009 [Page 1] Internet-Draft RTP Multi-Session Transmission October 2008 several RTP streams in the same RTP session. The main problems discussed are the mechanisms used for data alignment and source correlation. This draft gives further an overview of payload formats using multi-sessions/multi-source transmission and highlights other transport related issues. The draft concludes with recommendations for the discussed problems. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Existing Users of Multi-Session and Multi-Source Transmission . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Progressive Video with Hybrid (PVH) . . . . . . . . . . . 5 4.2. H.264 Scalable Video Coding (SVC) . . . . . . . . . . . . 6 4.3. H.264 Multi-View Coding (MVC) . . . . . . . . . . . . . . 6 4.4. G.718: Embedded Variable Bit-Rate (EV-VBR) Speech/Audio Codec . . . . . . . . . . . . . . . . . . . . 6 4.5. MPEG Surround . . . . . . . . . . . . . . . . . . . . . . 7 4.6. RTP Forward Error Correction . . . . . . . . . . . . . . . 7 4.7. RTP Retransmission . . . . . . . . . . . . . . . . . . . . 7 5. Topology Overview . . . . . . . . . . . . . . . . . . . . . . 8 6. Requirements for multi-session transmission . . . . . . . . . 8 6.1. Requirements on Data Alignment . . . . . . . . . . . . . . 8 6.2. Requirements on Source Correlation . . . . . . . . . . . . 9 7. Review of techniques for Data Alignment . . . . . . . . . . . 9 7.1. NTP Timestamp Alignment using RTCP Sender Report (SR) Packets . . . . . . . . . . . . . . . . . . . . . . . . . 9 7.1.1. Identified problems . . . . . . . . . . . . . . . . . 10 7.2. Review of other potential techniques for Data Alignment . 12 7.2.1. RTP Timestamp Alignment . . . . . . . . . . . . . . . 12 7.2.2. Initial RTP Timestamp or RTP Timestamp Offset Signaling . . . . . . . . . . . . . . . . . . . . . . 12 7.2.3. CCM message - need NTP update . . . . . . . . . . . . 13 7.2.4. Multiple early RTCP SRs . . . . . . . . . . . . . . . 13 7.2.5. Codec-Specific Mechanisms . . . . . . . . . . . . . . 13 7.2.6. RTP header extension . . . . . . . . . . . . . . . . . 14 8. Review of techniques for Source Correlation . . . . . . . . . 14 8.1. Source Correlation using CNAME in SDES . . . . . . . . . . 14 8.2. Review of other potential techniques for Source Correlation . . . . . . . . . . . . . . . . . . . . . . . 15 8.2.1. Single SSRC Space . . . . . . . . . . . . . . . . . . 15 8.2.2. SSRC Groups . . . . . . . . . . . . . . . . . . . . . 15 8.2.3. CNAME in Source Attributes . . . . . . . . . . . . . . 16 8.2.4. Application-specific Inference of Association . . . . 16 9. Summary of RTP solution for Data Alignment and Source Schierl & Lennox Expires April 30, 2009 [Page 2] Internet-Draft RTP Multi-Session Transmission October 2008 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 16 9.1. Data Alignment in RTP . . . . . . . . . . . . . . . . . . 16 9.2. Source Correlation in RTP . . . . . . . . . . . . . . . . 16 9.3. Dependency signaling . . . . . . . . . . . . . . . . . . . 17 10. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 11. Other transport related issues for multi-session transmission . . . . . . . . . . . . . . . . . . . . . . . . . 18 11.1. Inter-session Jitter . . . . . . . . . . . . . . . . . . . 18 11.2. Inter-session Interleaving . . . . . . . . . . . . . . . . 18 12. Security Considerations . . . . . . . . . . . . . . . . . . . 18 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 14.1. Normative References . . . . . . . . . . . . . . . . . . . 18 14.2. Informative References . . . . . . . . . . . . . . . . . . 19 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 20 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 Intellectual Property and Copyright Statements . . . . . . . . . . 22 Schierl & Lennox Expires April 30, 2009 [Page 3] Internet-Draft RTP Multi-Session Transmission October 2008 1. Introduction Multi-session transmission is when media data from a single media source is split over multiple Real-Time Transport Protocol (RTP) [RFC3550] sessions. This is usually done because different transport layer treatment is desired for different aspects of the media source, e.g., different multicast groups or different traffic classes. If the traffic is being sent using multicast routing, this is often known as "layered multicast." Single-session multi-source transmission (from now on just called "multi-source transmission") is when data from a single media source is sent as several RTP streams in the same RTP session. In this case, the streams need to be treated differently by RTP (e.g. with separate RTCP statistics, or selective forwarding by RTP translators) but do not need different transport characteristics. This is often referred to as "SSRC multiplexing", after the synchronization source identifier (SSRC) which distinguishes sources in an RTP session. Such techniques are often used for "layered" or "embedded" codecs (the former term is typically used for video, the latter for audio). A lower-bitrate, and often lower-complexity, stream (known as the "base"), often backward-compatible with older codecs, provides basic media quality, while one or more additional streams (known as "enhancements") provide richer media or otherwise provide an enhanced user experience. Various layered and embedded codecs are discussed in Section 4. Multi-session and multi-source transmission are also used for stream robustness. Both RTP Forward Error Correction [RFC5109] and RTP Retransmission [RFC4588] use multi-session transmission, and the latter can optionally use multi-source transmission as well. For both multi-session and multi-source transmission, two issues arise: how streams are correlated, i.e. how receivers determine which base and enhancement streams carry data for the same media source; and how streams are aligned, i.e. how receivers determine which packets of the base stream are associated with which packets of the enhancement stream. 2. Definitions multi-session transmission: In multi-session transmission, media data from a single media source is split over multiple RTP sessions. The term "layered multicast" is equivalent to multi- session transmission for sessions using multicast addresses. Schierl & Lennox Expires April 30, 2009 [Page 4] Internet-Draft RTP Multi-Session Transmission October 2008 multi-source transmission: In multi-source transmission, data from a single media source is sent as several RTP streams in the same RTP session. The sources contained in an RTP session are identified by their synchronization source identifiers (SSRCs) or, if combined by a RTP mixer, by their contributing source identifiers (CSRCs), as defined in RTP [RFC3550]. associated multimedia streams: Associated multimedia streams are independent media sources from the same session participant, e.g. audio and video sources, or multiple cameras from a single participant. Each source can have an independent media clock, reflecting the device that captured the media. For live media, these clocks will often drift relative to each other, over and above their often inherently-different clock rates. In RTP, each stream has separate initial RTP timestamps and sequence numbers. Related sources are associated using the RTCP Canonical Name (CNAME) Source Description (SDES) field. A common time base may be computed using NTP timestamps, based on information carried in RTCP Sender Report (SR) packets. The sources are typically synchronized ("lip-synced") by receivers when rendered, based on the computed NTP timestamps. Data Alignment: Assembling data of the same media frame which is transferred in different sessions or as different sources in the same session as part of a layered media. The assembly of the media frame must be achieved before decoding, otherwise the decoding process typically fails or may be only possible at a reduced quality. Source Correlation: The logical association of RTP streams transferred as multiple separate sessions or as multiple sources in the same session to one layered media. 3. Terminology "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 4. Existing Users of Multi-Session and Multi-Source Transmission 4.1. Progressive Video with Hybrid (PVH) Progressive Video with Hybrid transform (PVH) [McCa96] was used in the initial demonstration of multi-session transmission. PVH was the initial driver for adding text on layered multicast to the Real-Time Transport Protocol (RTP) [RFC3550]. Data Alignment was done using packets' RTP timestamps. Schierl & Lennox Expires April 30, 2009 [Page 5] Internet-Draft RTP Multi-Session Transmission October 2008 4.2. H.264 Scalable Video Coding (SVC) H.264 Scalable Video Coding (SVC) [I-D.ietf-avt-rtp-svc] extends the H.264 [RFC3984] video standard to provide spatial, temporal, and quality (signal-to-noise) enhancements. The base layer of SVC is backward-compatible with existing H.264 decoders. A base layer sent separately using the H.264 [RFC3984] payload format can be received and processed by existing devices. The Payload Format for SVC uses the multi-session transmission approach. Currently two basic modes are defined in the SVC Payload Format for decoding order recovery of media data received from multiple sessions: Data Alignment based on NTP timestamps: This method is used in the NI-T and NI-TC mode defined in [I-D.ietf-avt-rtp-svc]. These modes currently rely on exact NTP timestamp alignment in order to recover the decoding order. Cross-Session Decoding Order Number (CS-DON): This method is used in the NI-C, NI-TC and I-C modes defined in [I-D.ietf-avt-rtp-svc]. These modes rely on a number (CS-DON) which is associated to packets indicating the decoding order across sessions. 4.3. H.264 Multi-View Coding (MVC) H.264 Multi View Coding (MVC) [I-D.wang-avt-rtp-mvc] extends the H.264 [RFC3984] video standard to provide multiple views of a video stream, for multi view and 3D applications. MVC is similarly to SVC an extension of H.264 and has a backward compatible base view, which can be also decoded by existing H.264 receivers. Thus it is possible to provide the base view of a multi sessions transmission in a compatible way using the H.264 [RFC3984] as Payload Format. Since the new coding approach is mainly based on exploiting temporal references to other frames of the same view or different views, there is not always the need to receive the base view in order to decode a desired view. The payload format will rely on the same approaches as defined in the RTP Payload Format for SVC video [I-D.ietf-avt-rtp-svc] for decoding order recovery when receiving data from multiple sessions. 4.4. G.718: Embedded Variable Bit-Rate (EV-VBR) Speech/Audio Codec G.718, the Embedded Variable Bit-Rate (EV-VBR) speech/audio codec [I-D.lakaniemi-avt-rtp-evbr] provides an embedded speech-rate encoder. This codec also allows for multi-session transmission. The current draft mandates RTP SR for Data Alignment in multi-session transmission. Schierl & Lennox Expires April 30, 2009 [Page 6] Internet-Draft RTP Multi-Session Transmission October 2008 4.5. MPEG Surround MPEG Surround (Spatial Audio Coding, SAC) [I-D.ietf-avt-rtp-mps] enhances MPEG two-channel audio with multi-channel surround sound while maintaining backward compatibility with two-channel receivers. The payload relies on NTP timestamp alignment for multi-session transmission. The audio codec typically has different sampling rates for base and enhancements. 4.6. RTP Forward Error Correction RTP Generic Forward Error Correction [RFC5109] allows a supplemental stream to provide additional data for recovery from packet loss using a separate session for transmitting the FEC stream. The repair stream is typically sent as a separate RTP session. A special case is when the FEC stream is being sent as a secondary codec in the redundant encoding format. In this case the FEC stream is sent as a separate source in the same session as the redundant codec. Data Alignment is achieved using sequence numbers of the FEC protected packets. FEC Grouping Issues in Session Description Protocol [I-D.begen-mmusic-fec-grouping-issues] describes a grouping framework for FEC and media streams based on the Grouping of Media Lines in the Session Description Protocol (SDP) [RFC3388] framework. The framework relies on transmitting the FEC streams in separate sessions. Data Alignment is achieved by the FEC Framework and relies on the used FEC scheme, i.e. there is a specific solution for associating data of the protected and the protecting packet stream. 4.7. RTP Retransmission RTP Retransmission [RFC4588] allows senders to retransmit RTP packets indicated by the receiver as lost. The re-sent packets are transported in a separate stream and may be transmitted within a separate RTP session or may be transmitted as a separate source in the same session as the media stream. If multi-source (i.e., single-session) transmission is being used, retransmitted packets are sent with a different SSRC. Source association in this case done by sources' CNAMEs, with the further requirement that a receiver MUST NOT have two outstanding requests for the same packet sequence number in two different original streams before the association is resolved. Schierl & Lennox Expires April 30, 2009 [Page 7] Internet-Draft RTP Multi-Session Transmission October 2008 5. Topology Overview A number of different RTP Topologies [RFC5117] are relevant for consideration for multi-source and multi-session transmission. [Ed. TBD: more text on the relation between the approaches presented in the memo and the mentioned topologies.] o Point-to-point - Two endpoints communicating using unicast. o Point-to-multipoint via multicast - Using a multicast transport mechnisms to send packets of one participant to all the other participants in the multicast group. o Point-to-multipoint via RTP translator - Using [RFC3550] translators to send packets of one participant to other participants of a group. Packets of one or more participants may be forwarded to the group. o Point-to-multipoint via RTP mixer - Using [RFC3550] mixers to send packets of one participant to other participants of a group. Packets of one or more participants may be forwarded to the group. o Point-to-multipoint via Video Switching MCUs - Allows for sending packets from one participant to the other participants in a group. But typically only one participant's video data is forwarded at a time to the other participants. o Point-to-multipoint via RTCP-terminating MCUs - Each participant is running a point-to-point session with the MCU. Typically, only one participant's video data is forwarded at a time to the other participants. o Point-to-multipoint without a feedback channel - These channels typically provide IP multicast over a broadcast transmission medium, which naturally do not provide a bi-directional channel. This is the case, e.g. for DVB channels using IP over MPE over MPEG-2 Transport Stream as for DVB-H or the emerging DVB-SH. 6. Requirements for multi-session transmission 6.1. Requirements on Data Alignment Synchronization of media streams received from multiple sessions is typically used for lip-synchronization of audio and video data. For this case, RTP provides a strong tool, which is the presence of (RTP) timestamps for each media frame, generated from individual clocks for each session. Additionally, RTCP Sender Report packets are sent periodically in each session containing (NTP) timestamps from a wallclock common across all of the sessions, plus a reference to the corresponding (RTP) timestamp that would be generated for a media frame with the signaled wallclock time. The interval between transmission of RTCP SRs is typically in the range of multiple Schierl & Lennox Expires April 30, 2009 [Page 8] Internet-Draft RTP Multi-Session Transmission October 2008 seconds. For a more detailed review of RTP synchronization techniques, see Section 7.1. For the reception of layered media, either on multiple sessions or as multiple sources, it is absolutely essential to allow for immediate Data Alignment. That is, the Data Alignment must be applied before the decoding process of the layered media. If Data Alignment is not applied before decoding, the decoder may not be able to decode the media at all, or may only be able to produce a media representation at reduced quality. 6.2. Requirements on Source Correlation For the reception of layered media, whether on multiple sessions or as multiple sources, it is absolutely essential to find out prior to decoding which sessions and sources are correlated. That is, the receiver needs to know, prior to Data Alignment and decoding, the inter-session and the inter-source dependency. Notably, for cases in which multiple independent media sources are transmitted as layered media in the same session or set of sessions, miscorrelation of sources could lead to a decoder attempting to use one source's base layer with another source's enhancement layer. 7. Review of techniques for Data Alignment 7.1. NTP Timestamp Alignment using RTCP Sender Report (SR) Packets The inter-media synchronization mechanism defined in [RFC3550] uses RTP timestamps in the RTP packets and a combination of RTP timestamp and NTP wallclock carried in the RTCP Sender Report (SR) packets. The RTCP SR packet contains a RTP timestamp in the media timescale and as reference to an absolute wallclock time the NTP timestamp. The definitions for timestamp generation and synchronization in section 5.1 and 6.4.1 of [RFC3550] are summarized in the following list: o The timestamp reflects the sampling instant of the first octet in the RTP data packet. o The sampling instant MUST be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations (see Section 6.4.1). o The resolution of the clock MUST be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter (one tick per video frame is typically not sufficient). o If RTP packets are generated periodically, the nominal sampling instant as determined from the sampling clock is to be used, not a reading of the system clock. Schierl & Lennox Expires April 30, 2009 [Page 9] Internet-Draft RTP Multi-Session Transmission October 2008 o RTP timestamps from different media streams may advance at different rates and usually have independent, random offsets. Therefore, although these timestamps are sufficient to reconstruct the timing of a single stream, directly comparing RTP timestamps from different media is not effective for synchronization. Instead, for each medium the RTP timestamp is related to the sampling instant by pairing it with a timestamp from a reference clock (wallclock) that represents the time when the data corresponding to the RTP timestamp was sampled.. o Receivers should expect that the measurement accuracy of the timestamp may be limited to far less than the resolution of the NTP timestamp. o On a system that has no notion of wallclock time but does have some system-specific clock such as "system uptime", a sender MAY use that clock as a reference to calculate relative NTP timestamps. o It is important to choose a commonly used clock so that if separate implementations are used to produce the individual streams of a multimedia session, all implementations will use the same clock. o [Ed. : The RTP timestamp in the SR] corresponds to the same time as the NTP timestamp (above), but in the same units and with the same random offset as the RTP timestamps in data packets. o This correspondence may be used for intra- and inter-media synchronization for sources whose NTP timestamps are synchronized, and may be used by media-independent receivers to estimate the nominal RTP clock frequency. o Rather, it MUST be calculated from the corresponding NTP timestamp using the relationship between the RTP timestamp counter and real time as maintained by periodically checking the wallclock time at a sampling instant. To summarize, the definitions in [RFC3550]: the RTCP SR is used for deriving the media timestamp using the RTP timestamp and the NTP wallclock. If this synchronization mechanism is correctly implemented and there is no clock jitter in neither the media clock nor in the clock thus it can be always guaranteed, that a RTP timestamp and its NTP wallclock timestamp are perfectly aligned, the RTP approach should work fine for Data Alignment. [Ed. : need more text for summary / review of text above ] 7.1.1. Identified problems 7.1.1.1. Synchronization Delay Since [RFC3550] mandates RTCP SRs to be sent in intervals of multiple seconds, Data Alignment based on this information may introduce a delay to this process, which may lead to delayed tune-in for the Schierl & Lennox Expires April 30, 2009 [Page 10] Internet-Draft RTP Multi-Session Transmission October 2008 decoding process. This is typically not the case for decoding media transferred in exactly one session and source, since synchronization is not required for decoding, but only for playout. A delay for playout or lip synchronization does not usually pose a fundamental problem. 7.1.1.2. Losing synchronization information The loss of RTCP SR packets may introduce additional delay to the Data Alignment process, thus a more robust mechanism would be desirable. 7.1.1.3. Clock Skew Clock skew between the NTP/system clock and the media clock will affect the NTP media timestamp generation derived from RTCP SRs and RTP timestamps. That typically results in different NTP timestamps for packets of the same media frame transmitted in the different sessions or transferred as different sources, and leads to misalignment for the Data Alignment. As far as we know, there is no way to always guarantee the presence of perfect clocks for media and NTP/system clock. From the standardization point of view this may seem to be an implementation issue. However, if this implementation issue puts a burden on the senders like the presence of a perfect clocks for generating timestamps, this issue needs to be solved in an easy and general way. Following the RTP philosophy, clock skew can be estimated by observing several RTCP SRs. The receiver may use the observation to compensate for the clock skew. However, this is only possible if there is no requirement for immediate synchronization of the sort which is essential for Data Alignment of layered codecs. The case of clock skew between in media and NTP/system clocks may be overcome by using the same clock instance, e.g. the system clock, for RTP as well as NTP timestamp generation. However, this is not compliant with RTP, since [RFC3550] mandates the use of a media clock which is different from the system clock (see definitions in RTP as cited above in Section 7.1). Indeed, for many codecs, notably audio, correct decoding requires that the timestamp difference between subsequent frames exactly correspond to the amount of data sent in each frame. 7.1.1.4. Accuracy of clocks Assuming that we have clocks without skew, there is still the question of accuracy of the clock used for generating the timestamps. Notably, the Windows system clock is only updated on each system Schierl & Lennox Expires April 30, 2009 [Page 11] Internet-Draft RTP Multi-Session Transmission October 2008 clock tick, typically every 10 or 15 milliseconds on Windows XP and Vista. RTP says that a receiver should not make any assumption on this, but an implementation which may have to cope with rounding done in the low-order microsecond cannot simply compare two NTP timestamps for being identical. An application may have to compare "ranges" of timestamps in order to get rid of rounding problems. However, in some cases the ranges of NTP timestamps required may indeed be greater than the time interval between consecutive media frames. 7.1.1.5. Existing RTCP SR implementations As far as we know, existing RTCP SR implementations show a wide range of alignment problems for generating exact NTP media timestamps for Data Alignment. NTP alignment issues can be modeled for existing RTCP senders by capturing an NTP and RTP timestamps in consecutive SR packets, projecting the NTP timestamp in one SR packet based on the RTP timestamp in that SR packet, the NTP and RTP timestamps in the previous SR packet, and the codec's nominal clock rate. Initial experiments have shown NTP timestamp alignment problems on the order of 40-50 milliseconds for several implementations. 7.2. Review of other potential techniques for Data Alignment 7.2.1. RTP Timestamp Alignment The idea here is to signal the same RTP timestamp for packets containing data of the same media time instance in the different sessions. That is the same clock would have to be used for the multiple sessions and the same RTP random offset would have to be used. This method is backward compatible with using NTP timestamps for inter-media synchronization as well as for jitter calculation. Furthermore, this is the only alternative used up to our knowledge (see Section 4.1) for layered transmission of media. 7.2.1.1. Identified problems Using the same RTP timestamp random offset may lead to getting weak initialization vectors for the encryption method defined in [RFC3550] if keys are shared across the sessions or streams. Additionally, that it may be unnatural for some codecs to use the same clockrate for the multiple sessions, for example an audio wideband enhancement layer enhancing a narrow-band base layer. 7.2.2. Initial RTP Timestamp or RTP Timestamp Offset Signaling Signaling the initial RTP timestamp or the initial offsets as an media or source level attribute in SDP associated with each stream. This could be done, e.g., using Schierl & Lennox Expires April 30, 2009 [Page 12] Internet-Draft RTP Multi-Session Transmission October 2008 [I-D.ietf-mmusic-sdp-source-attributes]. 7.2.2.1. Identified problems This may have an implication for implementations, since one needs to know packet stream related information as initial RTP timestamp, or offset between RTP timestamps during while offering a session. This may be a problem for sessions where multiple senders are present: it may not always be possible for an SDP creator to include all initial offsets / timestamps for all participants for sessions with multiple sending parties. 7.2.3. CCM message - need NTP update In this case, a receiver would request for immediate synchronization information. This method may reduce the initial delay, but just work for topologies with bi-directional channels. 7.2.3.1. Identified problems This method is only feasible for topologies with bidirectional and reasonably rapid communication channels, i.e. unicast or small-group multicast. This method also assumes that the NTP timestamp alignment always works. 7.2.4. Multiple early RTCP SRs In this case, the sender would generate more RTCP SRs than typically required and send them at an early point in the session. This method does also work for topologies with uni-directional communication channels. 7.2.4.1. Identified problems This method may overflow the RTCP bandwidth. Enhancing the RTCP sender bandwidth may be achieved using SDP bandwidth parameters. This method may require an adjustment of the RTCP bandwidth of the session depending on the number of participants and senders. Further, this approach does not solve the problem for receivers tuning in to the session after it begins ("random entry"). This method also assumes that the NTP timestamp alignment always works. 7.2.5. Codec-Specific Mechanisms This mechanism exploits signaling contained within the payload's data sections in order to allow the Data Alignment. Example is the Cross Session Decoding Order Number (CS-DON) as defined in [I-D.ietf-avt-rtp-svc] or as proposed in Schierl & Lennox Expires April 30, 2009 [Page 13] Internet-Draft RTP Multi-Session Transmission October 2008 [I-D.hannuksela-avt-rtp-svc], where a timestamp or a timestamp delta of the RTP packet to be aligned is carried by payload specific means. 7.2.5.1. Identified problems A payload independent solution for the basic functionality of Data Alignment is desirable. 7.2.6. RTP header extension The RTP header extension may be used to add generic signaling about Data Alignment to RTP packets. 7.2.6.1. Identified problems RTP header extensions are required to be ancillary information which can safely be discarded by receivers which do not understand them. Data alignment mechanisms do not satisfy this requirement. 8. Review of techniques for Source Correlation 8.1. Source Correlation using CNAME in SDES In RTP, associated multimedia streams (e.g., audio and video sources from a single participant) have different SSRCs, and are associated using SDES CNAME fields. While in principle the same technique can be used to associate streams for multi-session or multi-source transmission, several issues arise. Startup latency: while slow lipsync convergence of multimedia streams is often tolerable, layered sources have to be associated from the start in order to be decodable, particularly for codec types such as video with inter-frame decoding dependencies. If multiple sources are sent from the same participant on the same session or family of sessions, e.g. multiple video cameras, they will have the same CNAME, because they are synchronized with each other and with any other sources for the session. This makes it impossible to definitively associate base and enhancement sources, as there may be more than one of each with the same CNAME. This potential for confusion is the reason for RTP retransmission's restriction on multiple outstanding RTP NACKs before stream association has completed, as described in Section 4.7. Schierl & Lennox Expires April 30, 2009 [Page 14] Internet-Draft RTP Multi-Session Transmission October 2008 8.2. Review of other potential techniques for Source Correlation 8.2.1. Single SSRC Space Motivated by the problems with CNAME association, RTP [RFC3550] specifies instead a single SSRC space for layered multicast (multiple-session transmission). Furthermore, as described in Section 9.2, it specifies that SSRC collision detection is performed only in the base layer. Applying SSRC collision detection in just the base layer in case of using multi-session transmission seems to work for current codec implementations. By definition one of the multiple views possible in MVC media Section 4.3 is the base view and this view is backward compatible to H.264. Decoding a view other than the base view may not require the presence of the base view. Although MVC is by its nature a layered codec, it may not always be reasonable to require the reception of the base layer for collision detection, even when it is not required for decoding. Currently, we do not see major relevance for the MVC codec format, due to its lack in coding efficiency, thus we tend not to take MVC as the killer application for new Source Correlation functionalities. This means without taking MVC into account, the current solution of using the base layer for SSRC collision detection seems to be still appropriate. If needed, collision detection could instead be performed across all, or a subset of, the sessions used for multi-session transmission. However, it is not entirely clear how this would work for senders or receivers that are only participating in a subset of the sessions, and this would require further study. 8.2.2. SSRC Groups The Internet-Draft [I-D.ietf-mmusic-sdp-source-attributes] specifies a mechanism by which related sources can be described as grouped in SDP. For multi-source (single-session) transmission, this can provide an alternative way to provide source association. Clearly, this will only be effective in topologies and signaling architectures in which the SDP author can know about every source in the session that will be used for multi-source transmission, and the SDP can be updated on the addition of new sources or SSRCs collisions. Schierl & Lennox Expires April 30, 2009 [Page 15] Internet-Draft RTP Multi-Session Transmission October 2008 8.2.3. CNAME in Source Attributes The draft [I-D.ietf-mmusic-sdp-source-attributes] also provides a mechanism for sources' SSRCs to be associated to their CNAMEs in SDP. This can eliminate the startup latency of stream association for the mechanism described in Section 8.1, though it does not solve the problem of multiple sources for a session. It also has the same architectural limitations as Section 8.2.2 in terms of using SDP. 8.2.4. Application-specific Inference of Association As described in Section 4.7, it is in some cases possible to use mechanisms specific to a particular codec or mechanism to determine stream associations. For retransmission, for instance, a NACK of a packet with sequence N with SSRC A, followed by a retransmission of a packet with sequence N on SSRC B, indicates that SSRC B is the retransmission stream for SSRC A. Such techniques are mechanism- specific and cannot easily be generalized. 9. Summary of RTP solution for Data Alignment and Source Correlation 9.1. Data Alignment in RTP The text on layered multicast in [RFC3550] does not discuss Data Alignment among the media data carried in the different RTP sessions. We assume that the intention of the RTP specification was to use NTP timestamp alignment. However, Vic, the demonstration code for layered multicast using PVH, used RTP timestamp alignment for this purpose. 9.2. Source Correlation in RTP The text in section 8.3 of [RFC3550] mandates a single SSRC to be used for multiple sessions containing data of the same layered media source. Further, the text mandates the detection of SSRC collisions using the CNAME item in SDES packets carried in the base layer: For layered encodings transmitted on separate RTP sessions (see Section 2.4), a single SSRC identifier space SHOULD be used across the sessions of all layers and the core (base) layer SHOULD be used for SSRC identifier allocation and collision resolution. When a source discovers that it has collided, it transmits an RTCP BYE packet on only the base layer but changes the SSRC identifier to the new value in all layers. ... Schierl & Lennox Expires April 30, 2009 [Page 16] Internet-Draft RTP Multi-Session Transmission October 2008 9.3. Dependency signaling For signaling the dependency of data transmitted using layered multicast, SDP [RFC4566] contains rudimentary support, in that it allows for signaling a range of transport addresses in a certain media description. By definition, a higher transport address identifies a higher layer in the one- dimensional hierarchy. A receiver needs only to decode data conveyed over this transport address and lower transport addresses to decode this Operation Point. When the media data of one source is transmitted in multiple RTP sessions, the mechanism defined in Signaling media decoding dependency in Session Description Protocol (SDP) [I-D.ietf-mmusic-decoding-dependency] can also be used to indicate the relationship between the multiple sessions of the same media type. Currently, this mechanism is inherited by the new Payload Formats allowing multi-session transmission: [I-D.ietf-avt-rtp-svc], [I-D.wang-avt-rtp-mvc], [I-D.ietf-avt-rtp-mps], and [I-D.lakaniemi-avt-rtp-evbr] . By definition the base layer is signaled as the RTP session which does not depend on any other session. Since [RFC3550] mandates the correlation of one layered media with the same source, there is no mechanism to indicate dependencies of multiple sources. 10. Recommendations We recommend for Data Alignment of media data from the same source, that the same RTP timestamp is used for packets of the same time instance as defined in [I-D.lennox-avt-rtp-layered-encoding-timestamps]. This method comes for free and can be implemented in a backward compatible way, since NTP timing for synchronizing different types of media is not affected. This further requires the use of the same timescale of the sessions of an multi-session or multi-source transmission, which is anyway the case if the layered media is identified as a unique source. Mandating the same timescale for each of the sessions in a multi-session transmission may need to be discussed with respect to the audio codec described in Section 4.5. For Source Correlation, we suggest to keep the mechanism defined in [RFC3550], i.e. all layers of a layered media source have the same SSRC and the base layer is used for SSRC collision detection. Further, it may be useful to have a signaling mechanism, which indicates the RTP session to be used for SSRC collision detection. Schierl & Lennox Expires April 30, 2009 [Page 17] Internet-Draft RTP Multi-Session Transmission October 2008 11. Other transport related issues for multi-session transmission 11.1. Inter-session Jitter The transport of media of the same source in different sessions may introduce different jitter behaviors in the different sessions. We call this issue inter-session jitter. Inter-session jitter may be caused by sessions taking different network paths or by any other packet reordering within the network outside the control of the user. RTP implementations typically use buffers for de-jittering each of the sessions separately. In a simple A/V transmission scenario, de- jittering the audio and the video input queue separately is not problematic, since the synchronization is achieved after the decoder during playout. Using multi-session transmission, de-jittering and synchronization (Data Alignment) is required before decoding instead of synchronizing the data after decoding at playout time. And the Data Alignment via NTP timestamp must be 100% exact on a micro second base, otherwise the synchronization fails. This is definitely different from doing synchronization for lip synchronized playout of audio and video. 11.2. Inter-session Interleaving Using multi-session transmission allows for data interleaving, while the data transmitted within one session can still be sent in decoding order. Inter-session interleaving may be also realizable using Data Alignment via timestamps. 12. Security Considerations [Ed. TBD] 13. IANA Considerations No action by IANA is required. 14. References 14.1. Normative References [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. Schierl & Lennox Expires April 30, 2009 [Page 18] Internet-Draft RTP Multi-Session Transmission October 2008 14.2. Informative References [I-D.begen-mmusic-fec-grouping-issues] Begen, A., "FEC Grouping Issues in Session Description Protocol", draft-begen-mmusic-fec-grouping-issues-00 (work in progress), February 2008. [I-D.hannuksela-avt-rtp-svc] Hannuksela, M. and Y. Wang, "Session Multiplexing for SVC Video", draft-hannuksela-avt-rtp-svc-01 (work in progress), July 2008. [I-D.ietf-avt-rtp-mps] Bont, F., Doehla, S., Schmidt, M., and R. Sperschneider, "RTP Payload Format for Elementary Streams with MPEG Surround multi- channel audio", draft-ietf-avt-rtp-mps-01 (work in progress), October 2008. [I-D.ietf-avt-rtp-svc] Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis, "RTP Payload Format for SVC Video", draft-ietf-avt-rtp-svc-14 (work in progress), September 2008. [I-D.ietf-mmusic-decoding-dependency] Schierl, T. and S. Wenger, "Signaling media decoding dependency in Session Description Protocol (SDP)", draft-ietf-mmusic-decoding-dependency-04 (work in progress), October 2008. [I-D.ietf-mmusic-sdp-source-attributes] Lennox, J., Ott, J., and T. Schierl, "Source-Specific Media Attributes in the Session Description Protocol (SDP)", draft-ietf-mmusic-sdp-source-attributes-01 (work in progress), February 2008. [I-D.lakaniemi-avt-rtp-evbr] Lakaniemi, A. and Y. Wang, "RTP payload format for G.718 speech/audio", draft-lakaniemi-avt-rtp-evbr-04 (work in progress), October 2008. [I-D.lennox-avt-rtp-layered-encoding-timestamps] Lennox, J., Schierl, T., and S. Ganesan, "Real-Time Transport Protocol (RTP) Timestamps for Layered Encodings", draft-lennox-avt-rtp-layered-encoding-timestamps-00 (work in progress), June 2008. Schierl & Lennox Expires April 30, 2009 [Page 19] Internet-Draft RTP Multi-Session Transmission October 2008 [I-D.wang-avt-rtp-mvc] Wang, Y. and T. Schierl, "RTP Payload Format for MVC Video", draft-wang-avt-rtp-mvc-02 (work in progress), August 2008. [McCa96] McCanne, S., "Scalable Compression and Transmission of Internet Multicast Video", Report No. UCB/CSD-96-928, December 1996. Ph.D. Dissertation, University of California Berkeley. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3388] Camarillo, G., Eriksson, G., Holler, J., and H. Schulzrinne, "Grouping of Media Lines in the Session Description Protocol (SDP)", RFC 3388, December 2002. [RFC3984] Wenger, S., Hannuksela, M., Stockhammer, T., Westerlund, M., and D. Singer, "RTP Payload Format for H.264 Video", RFC 3984, February 2005. [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session Description Protocol", RFC 4566, July 2006. [RFC4588] Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R. Hakenberg, "RTP Retransmission Payload Format", RFC 4588, July 2006. [RFC5109] Li, A., "RTP Payload Format for Generic Forward Error Correction", RFC 5109, December 2007. [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, January 2008. Appendix A. Acknowledgements Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Further, the author Thomas Schierl of Fraunhofer HHI is sponsored by the European Commission under the contract number FP7-ICT-214063, project SEA. The authors want to thank Colin Perkins, Ye-Kui Wang, Randell Jesup, Ingemar Johansson, Gerard Babonneau, Alex Eleftheriadis, Stefan Doehla, and Roni Even for their valuable comments on the mailing list. Schierl & Lennox Expires April 30, 2009 [Page 20] Internet-Draft RTP Multi-Session Transmission October 2008 Authors' Addresses Thomas Schierl Fraunhofer HHI Einsteinufer 37 D-10587 Berlin Germany Phone: +49-30-31002-227 Email: mail@thomas-schierl.de Jonathan Lennox Vidyo, Inc. 433 Hackensack Avenue Sixth Floor Hackensack, NJ 07601 US Email: jonathan@vidyo.com Schierl & Lennox Expires April 30, 2009 [Page 21] Internet-Draft RTP Multi-Session Transmission October 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Schierl & Lennox Expires April 30, 2009 [Page 22]