Internet Engineering Task Force Yunhong Gu Internet Draft Robert L. Grossman Document: draft-gg-udt-01.txt University of Illinois at Chicago Expires: February 2005 August 2004 UDT: A Transport Protocol for Data Intensive Applications Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document proposes a new UDP based Data Transfer protocol, also known as UDT. UDT can be used in high bandwidth-delay product (BDP) networks to utilize the abundant network resource efficiently and fairly. It is expected to be used in distributed data intensive applications over high-speed wide area networks, such as scientific computing on computational grids. The current most widely used TCP version (TCP NewReno/SACK) does not work well in such environments due to its slow discovery and recovery to the available bandwidth as the network BDP increases. In addition, it exhibits "unfairness" when concurrent TCP flows have different round trip time (RTT). UDT is an application layer solution by introducing new congestion control and reliability control. The protocol if defined upon UDP. The congestion control combines both rate-based and window-based methods and uses bandwidth estimation techniques to dynamically configure the control parameters. Gu, Grossman Expires - February 2005 [Page 1] UDT August 2004 Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Table of Contents 1. Introduction...................................................2 2. Target Use Scenarios and Design Goals..........................3 3. Protocol Specification.........................................4 3.1 Overview...................................................4 3.2 Packet Structures..........................................4 3.3 Timing.....................................................6 3.4 The Sender's Algorithm.....................................7 3.5 The Receiver's Algorithm...................................8 3.6 Rate Control Algorithm....................................11 3.7 Flow Control Algorithm....................................12 3.8 Connection Setup and shutdown.............................12 3.9 Loss Information Compression Scheme.......................13 4. Efficiency and Fairness Characters............................14 Security Considerations..........................................14 Normative References.............................................15 Informative References...........................................15 Acknowledgments..................................................15 Author's Addresses...............................................16 1. Introduction As the network bandwidth-delay product (BDP) increases, conventional TCP implementations become inefficient. This is because its AIMD (additive increase multiplicative decrease) algorithm reduces the TCP congestion window drastically but fails to recover it to the available bandwidth quickly. Theoretical flow level analysis has shown that TCP becomes more vulnerable to packet loss as the BDP increases higher [LM97]. Additionally, the unfairness of RTT bias inherent in TCP congestion control also becomes more serious in distributed data intensive applications. Concurrent TCP flows with different RTT will tend to unfairly share the available bandwidth. Although small BDP networks will share bandwidth relatively fairly using conventional TCP implementations, networks with large BDP tend to suffer severe unfairness issues under conventional TCP. This RTT bias severely limits the effectiveness of distributed computations based on high- speed wide-area networks as, for example, Grid-based computations in Gu, Grossman Expires - February 2005 [Page 2] UDT August 2004 an Internet2-based context. As of today, improvements on the standard TCP still fails to satisfy the efficiency and fairness (especially the RTT bias problem) objectives in high BDP environments. Such TCP modifications as presented in RFC 1423 (high-performance extensions), RFC 2018 (SACK), RFC 2582 (New Reno), RFC 2883 (D-SACK), and RFC 2988 (RTO calculation) do improve the efficiency more or less, but the fundamental problem of the TCP AIMD algorithm is unsolved. HS TCP (RFC 3649) achieved high utilization of bandwidth in high BDP networks by radically changing the TCP congestion control algorithm, but the fairness problem still remains. Considering the background above, new transport protocol is needed to support the high performance bulk data transfer in high BDP networks. We propose a new application level transport protocol, named UDT, or UDP based Data Transfer Protocol, together with a new congestion control algorithm. This document presents the two orthogonal parts of the UDT protocol and the UDT congestion control algorithm. The application level protocol, layering over UDP, can use other congestion control algorithms; whereas the algorithm defined in this document can be implemented in other protocols, such as TCP. An example implementation of the protocol can be found at [UDT]. Detailed performance analysis of the congestion control algorithm can be found in [GHG04]. 2. Target Use Scenarios and Design Goals UDT is expected to be used in the situations where a small number of bulk sources share abundant bandwidth. Typical use scenario is the computational grids [FKT01] built on wide area optical networks. A few institutes run their data intensive distributed applications on the networks, such as remote access to scientific instruments, distributed data mining, and high-resolution media streams. The main objectives of UDT are efficiency, fairness, and stability. Single, or small number of, UDT flows should utilize all the available bandwidth provided by the high-speed links, even if the bandwidth changes drastically. Meanwhile, all concurrent flows must share the bandwidth fairly, independent of the different bottleneck bandwidth, start time, or RTT. Stability requires that the packet sending rate should always converge to the available bandwidth very quickly, and congestion collapse must be avoided. UDT is NOT used to replace TCP in the Internet where the bottleneck Gu, Grossman Expires - February 2005 [Page 3] UDT August 2004 bandwidth is relatively small and there are large amounts of multiplexed short life flows. However, UDT aims to be TCP friendly - when coexisting with TCP, the bandwidth that UDT allocates should not exceed its fair share according to the max-min rule. (Note that the max-min rule allows UDT to allocate the available bandwidth that TCP fails to utilize in high BDP lossy links.) We feel that TCP friendliness is critical since TCP will always be in use in these high BDP networks, and UDT applications may sometimes run in the Internet. 3. Protocol Specification 3.1 Overview UDT is duplex. Each UDT entity has two parts: the sender and the receiver. The sender sends (and retransmits) application data according to flow control and rate control. The receiver receives both data packets and control packets, and sends out control packets according to the received packets as well. Both the sender and the receiver share one UDP port for packet sending/receiving. The receiver is also responsible for triggering and processing all control events, including congestion control and reliability control, and their related mechanisms such as RTT estimation, bandwidth estimation, acknowledging and retransmission. UDT always tries to pack application data into fixed size packets, unless there is not enough data to be sent. Similar to TCP, this fixed packet size is named MSS (maximum segment size). Since UDT is supposed to be used to transfer bulk data stream, we assume that there is only a very small portion of irregular sized packets in a UDT session. MSS can be set up by applications (see Section 3.8) and the optimal value is the path MTU (including all packet headers). The UDT congestion control algorithm combines rate control and window control (flow control), where the former tunes the packet-sending period and the latter limits the number of unacknowledged packets. The parameters used in rate control are updated by bandwidth estimation technique, which is derived from the receiver based packet pair method [P99], in real time. Meanwhile, the rate control period is constant in order to eliminate RTT bias. The flow control parameters depend on the data arrival speed at the peer side, in addition to the receiver's free buffer size. 3.2 Packet Structures UDT has two kinds of packets: the data packets and the control Gu, Grossman Expires - February 2005 [Page 4] UDT August 2004 packets. They are distinguished by the 1st bit (flag bit) of the packet header. The flag bit of a data packet is 0, whereas it is 1 for a control packet. The data packet structure is shown as below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Packet Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Application Data Payload ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Packet sequence number is the only content in the header of a UDT data packet. It is an unsigned integer using the following 31 bits after the flag bit. UDT uses packet based sequencing, i.e., the sequence number is increased by 1 for each non-retransmitted data packet. The sequence number is wrapped after it is increased to the maximum number (2^31 - 1). Following this 32-bit packet header is the payload for application data. The control packet has the following structure. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|type | Reserved | ACK Seq. No. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Control Information Field ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ There are 6 types of control packets in UDT and the type information is put in bit field 1 - 3 of the header. The first 32 bits must exist in the packet header. The control Information Field contains zero (i.e., it does not exist; in this case it is noted as "none" in the specification) or multiple 32-bit unsigned integers, depending on the packet type. UDT uses ACK sub-sequencing. Each ACK/ACK2 packet has a unique 16-bit sequence number, which is independent of the data packet sequence number. It uses bits 16 - 31 in the control packet header. The ACK sequence number ranges from 0 to (2^16 - 1). Bits 16 - 31 are undefined in other control packets. Gu, Grossman Expires - February 2005 [Page 5] UDT August 2004 TYPE 000: Protocol Connection Handshake Control Info: 1) 32 bits: UDT version 2) 32 bits: initial sequence number 3) 32 bits: MSS (bytes) 4) 32 bits: maximum flow window size (bytes) TYPE 001: Keep-alive Control Info: None TYPE 010: Acknowledgement (ACK) bits 16-31: ACK sequence number Control Info: 1) 32 bits: The packet sequence number to which (excluding) all the previous packets have been received 2) 32 bits: RTT (microseconds) 3) 32 bits: RTT variance, or RTTVar (microseconds) 4) 32 bits: Flow window size (number of packets) 5) 32 bits: Estimated link capacity (number of packets per second) TYPE 011: Negative Acknowledgement (NAK) Control Info: 32-bit integer array of compressed loss information (see Section 3.9). TYPE 100: Reserved. This type of control message is reserved for congestion warning that can be sent from the receiver to the sender. A congestion warning can be triggered by ECN, or a measurement of increasing trend in the packet delay. TYPE 101: Shutdown Control Info: None TYPE 110: Acknowledgement of Acknowledgement (ACK2) bits 16-31: ACK sequence number Control Info: None TYPE 111: Explained by bits 4 - 15, reserved for future use. Note that for both data and control packets, the actual packet sizes can be ascertained from UDP header [UDP]. The packet size information can be used to derive the size of the payload in a data packet and the size of control information field in an NAK control packet. 3.3 Timing UDT uses four timers in the receiver to trigger different periodical events, including rate control, acknowledgement, loss report (negative acknowledgement), and retransmission / connection maintenance. Timers in UDT use the system time as origins and can process wrapping if the system time wraps. The UDT receiver actively queries the system time to check whether a timer is expired or not. For a certain Gu, Grossman Expires - February 2005 [Page 6] UDT August 2004 timer T in UDT with a period of TP, suppose variable t is used to record the most recent time when T is set or reset. If T is set or reset at system time t0 (t = t0), then at any time t1, (t1 - t >= TP) is the condition to check if T is expired. The four timers are 1) RC timer, 2) ACK timer, 3) NAK timer, and 4) EXP timer. Their periods (or timeout values) are RCTP, ATP, NTP, and ETP, respectively. RC timer is used to trigger periodical rate control. ACK timer is used to trigger periodical selective acknowledgements (ACK packets). Both RCTP and ATP are constant values: RCTP = ATP = 0.01 seconds. NAK is used to trigger negative acknowledgements (NAK packets). Retransmission timer is used to trigger a data packet's retransmission and maintain connection status. Their periods depend on the estimated RTT value. The ETP value also depends on the number of continuous EXP timeouts. The recommended initial value of RTT in UDT is 0.1 seconds and the values of NTP and ETP are initialized as: NTP = 3 * RTT, ETP = 3 * RTT + ATP. The system time is queried after each time bounded UDP receiving (there will be additional necessary data processing time if a UDP packet is received) to check if any of the four timers is expired. The recommended granularity of their periods is microseconds. The timeout value of UDP receiving is a choice of implementation, depending on the tradeoff between the system burden caused by the looped queries on system time and the accuracy of event periods. The rate control event updates the packet-sending period, STP, which is used to schedule data packet sending in the UDT sender. Suppose a packet is to be sent at time t0, then the next packet sending is scheduled to (t0 + STP). In the other word, if the previous packet sending takes t' time, the sender will wait for (STP - t') for the next sending (if STP - t' < 0, no waiting is needed). This waiting interval needs a high precision implementation. It is recommended that CPU clock cycle granularity be used for this timer. 3.4 The Sender's Algorithm Data Structures and Variables: 1) SND PKT History Window: A circular array that records the departure time of each data packet. 2) Sender's Loss List: The sender's loss list is a linked list used to store the sequence numbers of the lost packets fed back by the receiver in NAK packets. The numbers are stored in increasing order. Data Sending Algorithm: Gu, Grossman Expires - February 2005 [Page 7] UDT August 2004 1) If the sender's loss list is not empty, retransmit the first packet in the list and remove it from the list. Go to 5). 2) Wait until there is application data to be sent. 3) a. If the number of unacknowledged packets exceeds the flow window size, go to 1). b. Pack a new data packet and send it out. 4) If the sequence number of the current packet is 16n, where n is an integer, go to 2). 5) Record the packet sending time in SND PKT History Window. 6) If this is the first packet sent after the last time when the sending rate is decreased, wait SYN time. 7) Wait (STP - t) time, where t is the total time used by step 1 to step 4. Go to 1). 3.5 The Receiver's Algorithm Data Structures and Variables: 1) Receiver's Loss List: It is a linked list of tuple whose values include: the sequence number of a lost data packet, the latest feedback time of the lost packet, and the number of times the packet has been fed back. Values are stored in the increasing order of packet sequence numbers. 2) ACK History Window: A circular array of each sent ACK and the time it is sent out. By "circular" it means that the most recent value will overwrite the oldest one if there is no more free space in the array. 3) RCV PKT History Window: A circular array that records the arrival time of each data packet. 4) Packet Pair Window: A circular array that records the time interval between each probing packet pair. 5) LRSN: A variable to record the largest received data packet sequence number. LRSN is initialized to the initial sequence number (to be sent) minus 1. 6) exp-count: A variable to record number of continuous EXP timeout events and its initial value is 1. Data Receiving Algorithm: 1) Query the system time to check if RC, ACK, NAK, or EXP timer is expired. If there is any, process the event accordingly (as described below in this section) and reset the expired timers. 2) Start time bounded UDP receiving. If no packet arrives, go to 1). 3) Set exp-count to 1 and update ETP: ETP = RTT + 4 * RTTVar + ATP. 4) If all sent data packets have been acknowledged, reset the EXP time variable. 5) Check the flag bit of the packet header. If it is a control packet, process it according to its type and go to 1). 6) If the sequence number of the current data packet is 16n + 1, Gu, Grossman Expires - February 2005 [Page 8] UDT August 2004 where n is an integer, record the time interval between this packet and the last data packet in Packet Pair Window. 7) Record the packet arrival time in PKT History Window. 8) a. If the sequence number of the current data packet is greater than LRSN + 1, put all the sequence numbers between (but excluding) these two values into the receiver's loss list and send them to the sender in an NAK packet. b. If the sequence number is less than LRSN, remove it from the receiver's loss list. 9) Update LRSN. Go to 1). On RC timer expired: Update STP by rate control algorithm (see Section 3.6). On ACK timer expired: 1) Find the sequence number prior to (excluding) which all the packets have been received by the receiver (ACK number) according to the following rule: if the receiver's loss list is empty, the ACK number is LRSN + 1; otherwise it is the smallest sequence number in the receiver's loss list. 2) If (a) the ACK number is not greater than the largest ACK number ever acknowledged by ACK2, or (b) it is equal to the ACK number in the last ACK and the time interval between this two ACK packets is less than (RTT + 4 * RTTVar), stop (do not send this ACK). 3) Assign this ACK a unique increasing ACK sequence number. It is recommended that the ACK sequence number should be increased by 1 for each sent ACK in the sending order (start with 0) and wrapped after reaching the maximum value. 4) Calculate the packet arrival speed according to the following algorithm: Calculate the median value of the last 16 packet arrival intervals (AI) using the values stored in PKT History Window. In these 16 values, remove those either greater than AI*8 or less than AI/8. If more than 8 values are left, calculate the average of the left values AI', and the packet arrival speed is 1/AI' (number of packets per second); Otherwise, return 0. 5) Calculate flow control window for the peer side (W) according to Section 3.7. Then calculate effective flow window size as: max(min(W, available receiver buffer size), 2). 6) Calculate the estimated link capacity according to the following algorithm: If the flow control quick start phase (see Section 3.7) is still ongoing, return 0. Otherwise, Calculate the median value of the last 16 packet pair intervals (PI) using the values in Packet Pair Window, and the link capacity is 1/PI (number of packets per second). 7) Pack the ACK sequence number, ACK number, RTT, RTT variance, effective flow window size, and estimated link capacity into the Gu, Grossman Expires - February 2005 [Page 9] UDT August 2004 ACK packet and send it out. 8) Record the ACK sequence number, ACK number and the departure time of this ACK in the ACK History Window. On NAK timer expired: 1) Search the receiver's loss list, find out all those sequence numbers whose last feedback time is (k * (RTT + 4 * RTTVar)) before, where k is the number of feedback times of this packet plus 1. If there is no loss to be fed back, stop. 2) Compress the sequence numbers obtained in step 1 (according to Section 3.9) and send them back to the sender in an NAK packet. 3) Stop flow control quick start phase if it not. On EXP timer expired: 4) If the sender's loss list is not empty, stop. 5) Put all the unacknowledged packets into the sender's loss list. 6) If (exp-count > 16) and the total time since last time a packet from the peer side is received has exceeded 3 seconds, or this time is greater than 3 minutes, which is supposed to be an indication that the connection is broken, close the UDT connection. 7) If there is no data packet that is not acknowledged, send a keep- alive packet to the peer side; otherwise, put all the sequence numbers of the unacknowledged packets into the sender's loss list. 8) Update exp-count: exp-count = exp-count + 1. 9) Update ETP: ETP = exp-count * (RTT + 4 * RTTVar) + ATP. On ACK packet received: 1) Update the largest acknowledged sequence number. 2) Update RTT and RTTVar by: RTT = rtt RTTVar = rv where rtt and rv are the RTT and RTTVar values carried in the ACK. 3) Update NTP and ETP: NTP = RTT + 4 * RTTVar ETP = exp-count * (RTT + 4 * RTTVar) + ATP 4) Update estimated link capacity: B = (B * 7 + b) / 8, where b is the value carried in the ACK. 5) Update flow window size to the value carried in this ACK. 6) Send back an ACK2 with the same ACK sequence number in this ACK. 7) Reset the EXP timer. On NAK packet received: 1) Add all sequence numbers carried in the NAK into the sender's loss list. 2) Update STP by rate control (see Section 3.6). 3) Reset the EXP timer. On ACK2 packet received: Gu, Grossman Expires - February 2005 [Page 10] UDT August 2004 1) Locate the related ACK in the ACK History Window according to the ACK sequence number in this ACK2. 2) Update the largest ACK number ever been acknowledged. 3) Calculate new rtt according to the ACK2 arrival time and the ACK departure time, and update the RTT and RTTVar values as: RTTVar = (RTTVar * 3 + abs(rtt - RTT)) / 4 RTT = (RTT * 7 + rtt) / 8. The initial values of RTT and RTTVar are 0.1 seconds and 0.05 seconds, respectively. 4) Update NTP and ETP: NTP = RTT, ETP = (exp-count + 1) * RTT + ATP. On Keep-alive packet received: Do nothing. On Handshake/Shutdown packet received: See Section 3.8. 3.6 Rate Control Algorithm Rate Control Quick Start: The STP is initially set to the minimum time precision (e.g., 1 CPU clock cycle or 1 microsecond). This is the quick start phase of UDT and it stops as soon as an ACK is received the estimated bandwidth carried by it is greater than zero. The packet sending period is then set as 1/W, where W is the flow window size carried by this ACK. The quick start occurs only at the beginning of a UDT connection, and it never executed again in the UDT connection life. After quick start, the following algorithms start to work. On RC timer expired: 1) If there is no ACK received during last RCTP, stop. 2) Calculate the loss rate during the last RCTP according to the number of packets sent and the number of total lost packets fed back in NAK. If the loss rate is greater than 0.1%, stop. 3) The number of sent packets to be increased in the next RCTP (inc) is calculated as: if (B <= C) inc = 1/MSS else inc = max(10^(ceil(log10((B-C)*MSS*8))) * Beta/MSS, 1/MSS) where B is the estimated link capacity and C is the current sending speed. Both are counted as packets per second. MSS is counted in bytes. Beta is a constant value of 0.0000015. 4) Update STP: STP = (STP * RCTP) / (STP * inc + RCTP). 5) Compute the real data sending period (rsp) from SND PKT History Window, If (STP < 0.5 * rsp), set STP to (0.5 * rsp). 6) If (STP < 1.0), set STP to 1.0. Gu, Grossman Expires - February 2005 [Page 11] UDT August 2004 On NAK packet received: Data Structures and Variables: 1) LSD: The largest sequence number that has been sent when last rate decrease occurred. 2) NumNAK: Number of NAKs since last time the LSD is updated. 3) AvgNAK: The moving average number of NAKs between two events when the largest sequence number is greater than LSD. 4) DR: A random number that satisfies the uniform distribution on [1, AvgNAK]. The rate control algorithm on NAK received: 1) If the largest lost sequence number in the NAK is greater than LSD: Increase the STP by 1/8: STP = STP * (1 + 1/8). Update AvgNAK: AvgNAK = (AvgNAK * 7 + NumNAK) / 8. Update DR. Reset NumNAK = 0; Record LSD. 2) Otherwise, increase NumNAK by 1, and if NumNAK % DR = 0: Increase the STP by 1/8: STP = STP * (1 + 1/8). Record LSD. 3.7 Flow Control Algorithm The flow control window size (W) is 16 initially. On ACK timer expired: 1) Flow Control Quick Start: If no NAK has been ever generated, or, W has not reached or exceeded 16 packets and AS > 0, the flow window size is updated to the total number of acknowledged packets. 2) Otherwise, if (AS > 0), where AS is the packet arrival speed, W is updated as: W = ceil(W * 0.875 + AS * (RTT + ATP) * 0.125) 3) Limit W to the maximum flow window size of the peer side. 3.8 Connection Setup and shutdown One UDT entity starts first as the server, and its peer side (the client) will send a handshake packet when it wishes to connect. The client should keep on sending the handshake packet every constant interval, (implementations should decide this interval according to the balance between response time and system overhead), until it receives a response handshake from the server or timeout occurs. The handshake packet has the following information: 1) UDT version: this value is for compatibility purpose. The current version is 2. Gu, Grossman Expires - February 2005 [Page 12] UDT August 2004 2) Initial Sequence Number: this is the initial data packet sequence number that the UDT entity that sends this handshake will use to send out data packets. This must be a random value between 1 and (2^31 - 1). In addition, it is recommended that the value should not repeat within a reasonable time history window. 3) MSS: the size of a data packet (measured by IP payload). 4) Maximum Flow Window Size: This is the maximum flow window size the UDT entity that receives this handshake is allowed to have. The window size is generally limited by the sizes of data structures in the receiver. The server, when receiving a handshake packet, compares the MSS value with its own value and set its own values as the smaller one. The resulting value is also sent back to the client by a response handshake packet, together with the server's version, initial sequence number, and maximum flow window size. The version value is used to check the compatibility of the two peers. The initial sequence number and maximum flow window size are used to initialize the parameters of the UDT entity that receives this handshake packet. The server is ready for sending/receiving data right after this step is completed. However, it must send back response packets as long as it receives any further handshakes from the same client. The client can start sending/receiving data once it gets a response handshake packet from the server, sets its own MSS as the value carried in the handshake packet, and initialize its parameters related to initial sequence number and maximum flow window size. Further response handshake messages, if received any, should be omitted. If one of the connected UDT entities is being closed, it will send a shutdown message to the peer side. The peer side, after receiving this message, will also be closed. This shutdown message, delivered using UDP, is only sent once and not guaranteed to be received. If the message is not received, the peer side will be closed according to the timeout mechanism (see Section 3.5). 3.9 Loss Information Compression Scheme The loss information carried in an NAK packet is an array of 32-bit integers. If a number in the array is a normal sequence number (1st bit is 0), it means that the packet with this sequence number is lost; if the 1st bit is 1, it means all the packets starting from (including) this number to (including) the next number in the array (whose 1st bit must be 0) are lost. Gu, Grossman Expires - February 2005 [Page 13] UDT August 2004 For example, the following information carried in an NAK: 0x00000002, 0x80000006, 0x0000000B, 0x0000000E means packets with sequence number 2, 6, 7, 8, 9, 10, 11, and 14 are lost. 4. Efficiency and Fairness Characters UDT can utilize close to the full available bandwidth independent of link capacity, RTT, and background coexisting flows, given the link bit error rate (BER) of the current wired networks. There is a constant time needed for UDT to increase from 0bits/s to 90 percent of the available bandwidth when there is no loss, which is 7.5 seconds. UDT is not suitable for wireless networks. UDT strictly satisfies max-min fairness in single bottleneck network topologies. In multi-bottleneck scenarios, it guarantees that flows over smaller bottleneck links obtain at least half of their fair share according to max-min rule. RTT has little impact on the fairness. When coexisting with concurrent bulk TCP flow, TCP can occupy more bandwidth than UDT does, except in one of the following 3 situations: 1) The network BDP is very large that TCP fails to utilize their fair share bandwidth. In this situation, UDT will occupy the bandwidth that TCP cannot utilize. 2) The link capacity is so small that UDT's bandwidth estimation techniques do not work optimally. Simulation shows that this threshold link capacity is about 100 kb/s. 3) In networks where FIFO queues are used along the paths, if the queue size is greater than BDP, TCP's bandwidth share decreases as the queue size increases. However, to reach about equal sharing with UDT, the queue size needed generally exceeds the amount that a practical router/switch will provide. When short (timewise) web-like TCP flows coexist with a small number of concurrent UDT flows, the effect of UDT on TCP flows is very small. More analysis can be found in [GHG03]. Security Considerations UDT has not its specific security mechanism, whereas it depends on the application to provide authentication and the lower layer to provide security mechanisms, if necessary. Gu, Grossman Expires - February 2005 [Page 14] UDT August 2004 However, UDT implementations should check that all arrived packets are from the expected source, since UDP is connectionless. This is inherently accomplished because in the socket API a "connected" UDP socket only accepts packets from a certain source the UDP socket is "connected" to. Normative References [UDP] J. Postel, RFC 768: User Datagram Protocol, Aug. 1980. Informative References [FKT01] I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations", International J. Supercomputer Applications, 15(3), 2001. [GHG04] Y. Gu, X. Hong, and R. L. Grossman, "Experiences in Design and Implementation of a High Performance Transport Protocol". SC 2004, Pittsburgh, PA, Nov. 6 -12, 2004. [LM97] T. V. Lakshman and U. Madhow, "The Performance of TCP/IP for Networks with High Bandwidth-Delay Products and Random Loss", IEEE/ACM Trans. on Networking, vol. 5 no 3, July 1997, pp. 336- 350. [P99] V. Paxson, "End-to-End Internet Packet Dynamics", IEEE/ACM Transactions on Networking, Vol.7, No.3, June 1999, pp. 277-292. [UDT] UDT Source Code, URL http://sourceforge.net/projects/dataspace. Acknowledgments The authors thank Dr. Xinwei Hong for his cooperated work in the SABUL protocol, an early prototype of UDT protocol. Dr. Marco Mazzucco and Dr. Sivakumar Harinath also contribute to the earlier idea of SABUL. We appreciate those researchers, developers, and system administrators inside or outside LAC/NCDM for their helpful feedbacks on using UDT. We are grateful to StarLight, SARA, and Canarie for providing us the high-speed networking infrastructures. We are also grateful for the feedback and comments from the following individuals: Guy Almes, Cees de Laat, Freek Dijkstra, Rajkumar Kettimuthu, and Johan Blom. Gu, Grossman Expires - February 2005 [Page 15] UDT August 2004 Author's Addresses Yunhong Gu Laboratory for Advanced Computing University of Illinois at Chicago 700 SEO, M/C 249, 851 S Morgan St Chicago, IL 60607, USA Phone: +1 (312) 996-0305 Email: yunhong@lac.uic.edu Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago 727 SEO, M/C 249, 851 S Morgan St Chicago, IL 60607, USA Phone: +1 (312) 413-2176 Email: grossman@uic.edu Gu, Grossman Expires - February 2005 [Page 16]