Benchmarking The Session Initiation Protocol (SIP)

Transcription

Benchmarking the Session Initiation Protocol (SIP)Yueqing Zhang, Arthur Clouet, Oluseyi S. Awotayo, Carol DavidsVijay K. GurbaniIllinois Institute of TechnologyEmail: ids@iit.eduBell Laboratories, Alcatel-LucentEmail: vkg@bell-labs.comAbstract—Measuring and comparing performance of an Internet multimedia signaling protocol across varying vendorimplementations is a challenging task. Tests to measure theperformance of the protocol as it is exhibited on a vendor deviceand conditions of the laboratory setup to run the tests haveto be normalized such that no special favour is accorded to aparticular vendor implementation. In this paper, we describe aperformance benchmark to measure the performance of a devicethat includes a Session Initiation Protocol (SIP) proxy function.This benchmark is currently being standardized by the InternetEngineering Task Force (IETF). We implemented the algorithmthat has been proposed in the IETF to measure the performanceof a SIP server. We provide the test results of running thebenchmark on Asterisk, the popular open source SIP privatebranch exchange (PBX).I. I NTRODUCTIONThe Session Initiation Protocol, (SIP [1]) is an IETFstandardized protocol for initiating, maintaining and disconnecting media sessions. The protocol been adopted by manysectors of the telecommunications industry: IP Private BrancheXchanges (PBX’es) and SIP Session Border Controllers(SBCs) are used for Business VoIP phone services as wellin the delivery of Next Generation 911 services over theEmergency Services IP Network (ESINet [2]); LTE mobilephone systems are designed to the SIP protocol and replace thecircuit switched telephone service currently provided over cellular networks; SIP-to-SIP and SIP-PSTN-SIP ecosystems areused to reduce dependence on and supplement the traditionallong distance trunk circuits of the Public Switched TelephoneNetwork (PSTN). Many commercial systems and solutionsbased on SIP are available today to meet the industry’srequirements. As a result, there is a strong need for a vendorneutral benchmarking methodology to allow a meaningfulcomparison of SIP implementations from different vendors.The term performance has many meanings in many different contexts. To minimize ambiguity and provide a levelfield to interpret the results in, we conduct our work in theBenchmarking Working Group (BMWG 1 ) within the InternetEngineering Task Force (IETF 2 ). BMWG does not attemptto produce benchmarks in live, operational networks for thesimple reason that conditions in such networks cannot becontrolled as they would be in a laboratory environment.Furthermore, benchmarks produced by BMWG are vendor1 http://datatracker.ietf.org/wg/bmwg/charter/2 http://www.ietf.orgneutral and have universal applicability to a given technologyclass.The specific work on SIP in the BMWG consists of twoInternet-Drafts, a terminology draft [3] and a methodologydraft [4]. The methodology draft uses the established terminology from the terminology draft to define concrete testcases and to provide an algorithm to benchmark the test casesuniformly. We note that the latest versions of the terminologyand methodology drafts are awaiting publication as a RequestFor Comment (RFC) documents. The work we describe inthis paper corresponds to earlier versions of these drafts (morespecifically to version -09 of these drafts [5], [6]).The remainder of this paper is organized as follows. SectionII contains the problem statement. Section III describes relatedwork. Section IV contains a brief background on the SIPprotocol. Section V describes the associated benchmarkingalgorithm from [6] and the script that we wrote to implementthe algorithm. This script is available for public download atthe URL indicated in [7]. Section VI contains a descriptionof the test-bed. Section VII describes the test harness and thebase results that can be obtained without the SIP server in themix, and Section VIII takes a look at the results when a SIPserver is added to the ecosystem. Section IX describes ourplans for future work.II. P ROBLEMSTATEMENTPerformance can be defined in many ways. Generally theterm is used to describe the rate of consumption of systemresources. Time or processing latency is often the metric usedto quantify the resource consumption. The probability that anoperation or transaction will succeed is another metric thathas been proposed. The metrics are generally collected whilethe Device Under Test (DUT) is processing a well-definedoffered load. The characteristics of the offered load — whetheror not media is being used, the specific transport used fortesting, codecs used, etc. — are an important part of the metricdefinition.In this study we use a different type of metric and adifferent type of load. Loads are chosen whose individualcalls have a constant arrival rate and a constant duration. Thevalue assigned to a given load is the number of calls persecond, also referred to as session attempt rate. The loads arenot necessarily designed to emulate any naturally occurringoffered load, rather they are designed to be easily reproducedand capable of exercising the DUT to the point of failure. We

“test to failure” looking for the performance knee or breakpoint of the DUT while applying a well calibrated pre-definedload. The measure of performance is the value of the highestoffered load that when applied to the DUT produces zerofailures, the next highest rate attempted having produced atleast one failure.The rationale for such an approach has to do with the natureof SIP servers, the applications they are used to create andthe environment in which they are deployed. SIP is used tocreate a diverse and growing variety of applications. Therewill be a diverse set of user communities each with its owncharacteristic use pattern. The characteristics of the offeredloads will differ and are not easily predicted. Testing to zerofailure using a standard pre-defined load is a reproducible wayto learn about how a system behaves under a steady load. Theapproach resembles the use of statistical measures such asmeans and standard deviations to predict future behavior. It isassumed that the resulting metric will be useful as a predictorof behavior under more realistic loads.III. R ELATED WORKThere is a growing body of research related to variousaspects of the performance of SIP proxy servers. In manyof these, performance is measured by the rate at which theDUT consumes resources when presented with a given workload. SIPstone [8] describes a benchmark by which to measurethe request-handling capacity of a SIP proxy or cluster ofproxies. The work outlines a method for creating an offeredload that will exercise the various components of the proxy.A mix of registrations and call initiations, both successful andunsuccessful, are described and the predicted performance ismeasured by the time it takes the proxy to process the variousrequests. SPEC SIP [9], is a software benchmark product thatseeks to emulate the type of traffic offered to a SIP-enabledVoice of IP service provider’s network on which the serversplay the role of SIP proxies and registrars. The metric definedby SPEC SIP is the number of users that complete 99.99%of their transactions successfully. Successful transactions aredefined to be those that end before the expiration of therelevant SIP timers and with the appropriate 200, 302 and 487final responses. Cortes et al. [10] define performance in termsof processing time, memory allocation, CPU usage, threadperformance and call setup time.A common theme in these studies is the creation of arealistic traffic load. In contrast, the work described in thispaper defines a way to use simple, easily produced (and reproducible) loads to benchmark the performance of SIP proxiesin a controlled environment. Such benchmarks can be used tocompare the performance across vendor implementations.IV. SIP BACKGROUNDThe base SIP protocol defines five methods, of which four— REGISTER, INVITE, ACK and BYE — are used in ourwork. The REGISTER method registers a user with the SIPnetwork while the INVITE method is used to establish aFigure 1: SIP REGISTER and subsequent INVITE methodssession. A third method, ACK, is used as the last methodin a 3-way handshake. BYE is used to tear down a session.SIP ecosystem consists of SIP user agent clients (UACs)and user agent servers (UASs), registrars and proxies (seeFigure 1). A UAS registers with a registrar, and a proxy,possibly co-located with the registrar, routes requests from aUAC towards the UAS when a session is established. Proxiestypically rendezvous the UAC with the UAS and then drop outof the chain; subsequent communications go directly betweenthe UAC and UAS. The protocol, though, has means that allowall signaling messages to proceed through the proxy. Figure1 also shows media traversing directly between the UAC andUAS; while this is preferred, in some cases it is necessaryfor the media to traverse a media gateway or media relaycontrolled by the proxy (e.g., conferencing or upgrading amedia codec).SIP defines six response classes, with 1-xx class beingprovisional responses and 2-xx to 6-xx class responses beingconsidered final. A SIP transaction is defined as an aggregationof a request, one or more provisional responses and a finalresponse (see Figure 1). While transactions are associatedon a hop-by-hop basis, dialogues are end-to-end constructs.The dialogue is a signaling relationship necessary to managethe media session. For SIP networks where media traversesthrough a proxy, there will be a dialogue established betweenthe UAC and the proxy and another one between the proxyand the UAS.SIP can be used to create calls or sessions that resembletraditional telephone calls. SIP can be used to create otherservices and applications, but the present work only examines its use in creating audio/video sessions, which are alsoreferred to as “calls”. The terms “session” and “call” are usedinterchangeably in this paper. But the term session is moreapplicable in the context of the Internet and in recognition ofthe functionality beyond phone calls that SIP offers. One canconsider a session to be a three-part vector with a signalingcomponent (SIP), a media component (Real-time protocol orRTP [11]) and a management component (Real-time Control

Algorithm 1 Benchmarking algorithm from [6]Figure 2: SIP session as a 3-part vectorProtocol or RTCP [11]). Figure 2 illustrates this interpretation.With this interpretation a traditional “call” is a SIP sessionwith a non-zero component in the media plane. This interpretation enables the study of maximum rates of SIP loads otherthan INVITEs. A load consisting only of REGISTER requests,for example, will create sessions with components only inthe signaling plane. Such rates, particularly registration rates,are of great interest to the SIP community. A load consistingof INVITE requests and including media that traverses theDUT,on the other hand, will require more resources since theDUT will handle the RTP as well as the SIP messages on theDUT.V. T HE S ESSION E STABLISHMENT R ATE A LGORITHMThe algorithm in Algorithm 1 was programmed using theUnix Bash shell scripting language [12] and is availablefor public download at [7]. The algorithm finds the largestsession attempt rate at which the DUT can process requestssuccessfully for a pre-defined, extended period of time withzero-failures. The name given to this session rate is “SessionEstablishment Rate” (SER [6]). The period of time is largeenough that the DUT can process with zero errors whileallowing the system to reach steady state.The algorithm is defined as an iterative process. A startingrate of r 100 sessions/second (sps) is used and calls areplaced at that rate until n 5000 calls have been placed.If all n calls are successful, the rate is increased to 150 spsand again, calls continue at that rate until n 5000 callshave been placed. The attempt rate is continuously rampedup until a failure is encountered before n 5000 calls havebeen placed. Then an attempt rate is calculated that is higherthan the last successful attempt rate by a quantity equal to halfthe difference between the rate at which failures occurred andthe last successful rate. If this new attempt rate also resultsin errors, a new attempt rate is tried that is higher than thelast successful attempt rate by a quantity equal to half thedifference between the rate at which failures occurred and thelast successful rate. Continuing in this way, an attempt ratewithout errors is found. The tester can specify margin of errorusing the parameter G, the granularity, which is measured inunits of sps (sessions/sec). Any attempt rate that is within anacceptable tolerance of G can be used as a SER.{Parameters of test; adjust as needed}n 5000 {local maximum; used to figure out largest value(number of sessions attempted)}N 50000 {global maximum; once largest session rate hasbeen established, send this many requests before calling thetest a success}m {.} {other attributes affecting testing, media forinstance}r 100 {Initial session attempt rate (sessions/sec)}G 5 {granularity of results; margin of error insessions/sec}C 0.05 {calibration amount: how much to back downif we have found candidate s but cannot send at rate s fortime T without zero-failure}{End parameters of test.}f f alse {set when test is done}c 0 {upper limit}repeatsend traffic(r, m, n) {Send r req/sec with m mediacharacteristics until n requests have been sent}if all requests succeeded thenr′ r {save candidate value of metric}if (c 0) thenr r (0.5 r)else if (c 1) and (r′′ r′ ) 2 G) thenr r (0.5 (r′′ r)else if (c 1) and (r′′ r′ ) 2 G) thenf trueelse{one or more requests fail}c 1 {found upper bound for metric}r′′ r {save new upper bound}r r (0.5 (r r′ ))end ifend ifuntil (f 6 true)As an example, assume it is known (through the vendor)that a SIP proxy (the DUT) exhibits a SER of 269 sps. Toindependently verify this, the benchmark algorithm starts withr 100 sps and continue ramping this rate until failures areseen. At, say, 400 sps, the DUT will exhibit failures and thealgorithm will reduce the next attempted rate to 300 sps, whichis half of the difference between the last failure (400) and thelast known success (200). The algorithm continues in this formuntil stable state is reached.VI. E XPERIMENTAL TEST BEDThe test bed, shown in Figure 3 consists of five elements: aswitch, hosts generating SIP traffic (with and without RTP), ahost to act as a DUT and host to collect traffic using Wireshark.The test bed is on a private LAN isolated from the Internetand a mirror port is configured on the switch to capture tracesof the calls in the load without contributing to the resourceconsumption of any of the functional elements of the test.

Figure 3: Physical architecture of the testbedThe SIP traffic is generated by SIPp [13], an open sourceSIP traffic generator. One hosts acts as a SIP UAC and theother host is the SIP UAS. A host running the open sourceAsterisk server [14]. The UAC is an Intel Core 2 6420, with2.13 Ghz speed and 4 Gb Memory; the UAS and the DUT runon Intel Core 2 6320 with 1.86 Ghz speed and 4 Gb memory.The Wireshark trace host is a Macbook Pro, with a speed of2Ghz Intel i7, and 8 Gb memory.VII. T ESTINGTHE TEST HARNESSBefore testing any DUT, it is necessary to discover the SERof the test harness itself. The harness includes the Bash testscript, the SIPp load generator, the platforms upon which theserun, and the switch that connects them all. Use of a slowmachine to run the SIPp elements, for example, would limit therate at which calls are generated, so that carrier-grade DUTsthat operate at or near line speed, might not be testable usinga test harness unable to generate the load necessary to producefailures.Following the methodology in [6], the SER of the testharness with and without RTP associated with the SIP callswas measured. The algorithm of Section V is a two-stepprocess: A test with a short duration is run and a candidateSER is found. That candidate becomes the starting rate for amuch longer duration test. The duration of the first test is thelength of time it would take to attempt 5000 calls; the secondtest lasts the length of time it would take to attempt 50,000calls. SIPp, the software used to implement our tests, offerstwo ways to end a single test run: in one, the test ends after acertain number of call attempts; in the other, the test ends aftera fixed time has elapsed. The methodology described in [6]uses the first method. The tests described here use the secondone.A. Testing the test harness without RTPFirst the test harness was tested using SIP calls with noassociated RTP. All SIP calls were set to last for 9 secondsand used a granularity of 3 (G 3, c.f. Section V). Test resultsare displayed in Figure 4 and analyzed below.1) Impact of the value of the granularity parameter:Comparing the results in Figure 4 for different granularities,we see that for a 90 second test duration, a granularity of1, produced an SER of 2,874; a granularity of 2, producedFigure 4: Test harness performance without RTP (Breakpointon y-axis is SER)an SER of 2,623; and a granularity of 3, an SER of 2,740.The granularity of 1 produced the highest SER. This is to beexpected because the granularity of 1 produces an SER thatis only one sps lower than the rate at which the first failureoccurs. Note however that the granularity of 2 produced alower value than the granularity of 3, a result we cannot yetexplain.Tests with a higher granularity find an SER more quicklythan those with a granularity of 1. Testing to determine theSER takes a long time. Each individual test loop of thealgorithm lasts a fixed time and the number of iterations fora long test-duration can cause the test process to last morethan 24 hours. As an example, if the test-duration is set to20 minutes and the iterative process is begun using a call rateof 100 calls per second, the first pass through the loop lasts20 minutes. If no errors occur, the rate is increased to 200calls per second and, the second pass through the loop lastsanother 20 minutes. After another 20 minutes, the rate is againincreased to 400 calls per second, and so on. It takes many20-minute periods to reach the 2,855 SER recorded in thetable. Even though the tests used a granularity of 3, and useda relatively short test-duration, the time to obtain the SER wasgreater than 6 hours.2) Impact of longer test-durations: It is expected thatlonger test-durations will produce lower SERs. This is becausethe longer the DUT sustains a call load at any rate, the morelikely it is that a failure such as a SIP time-out will occur. TheDUT’s operating system as well as its code and hardware arestressed, and the time to perform memory management andother functions will accumulate making delays and eventualfailures more likely the longer the test runs. The data show thatfor a granularity of 1 and test durations of 30s, 60s and 90s,the SERs were respectively 3,941 sps, 3,467 sps and 2,874 sps.Testing with a granularity of 2 and 3, for the same series oftest durations produced a similar steep decline in the value ofthe SERs. Tests for a granularity of 3 were conducted using anextended series of test-durations, 5-, 10- and 20-minute testdurations were attempted and the SERs produced were 2,799,2,803, and 2,855 respectively. The slight rise in values may

be attributed to the non-deterministic nature of the operatingsystems of the platforms on which the SIPp applications ran.The conclusion that can be drawn from these data is that thetest harness can deliver a load of around 2,700 sps when thereis no associated RTP and when the call duration is set for 9seconds. This means that a system that can support a highercall rate than 2,700 sps, cannot be tested with this harness.B. Testing the test harness with RTPThe next

May 12, 2015 · lular networks; SIP-to-SIP and SIP-PSTN-SIP ecosystems are used to reduce dependence on and supplement the traditional long distance trunk circuits of the Public Switched Telephone Network (PSTN). Many commercial systems and solutions based on SIP are available today to meet the industry’s