The Lync “Certified” SIP Trunk that wasn’t….

This is a true story of a recent ‘struggle’ I had with a SIP Trunk Provider. This is a re-enactment of my troubleshooting, written in the style of a who’dunnit. I’ll try and keep you guessing to the end.

“Ladies and gentlemen: the story you are about to hear is true. Only the names have been changed to protect the innocent.” – Dragnet

I was introduced to a company that were having ‘one-way audio’ issues. (whatever you are thinking, it’s wrong, but nice try… keep guessing).

The Lync environment was ruled out, because the problem was seen on multiple deployments with SIP Trunks from the same Service Provider. A Lab with a single Lync Server with it’s own SIP Trunk was used to reproduce the issue and troubleshoot.

Here’s a run-down of the symptoms.

  • No change to deployment, it had ‘just started’ recently.
  • One-way Audio on some calls, the remote party (i.e. PSTN caller) always heard the Lync User. But nothing inbound. Everything works correctly for the majority of the company’s calls.
  • Affects inbound only, ALL outbound calls worked correctly, even to the same ‘affected’ numbers.
  • Placing the call on hold, and then resuming would immediately fix the issue.
  • Or, waiting for anywhere between 10 and 20 seconds (randomly) the audio would start on its own.

The first task was to establish why only certain calls were affected, they only had full details of one confirmed caller to go on, it was a mobile call from the husband of a member of staff, their calls always exhibited the issue, but other mobile calls worked correctly… It turned out to be ALL calls from the “Three Mobile” network, also ALL calls from TalkTalk landlines, and COLT ISDN both exhibited the exact same issue.

Either way, all calls were coming from the same SIP Trunk, bizarre right? I can tell you’re hooked already…

Lets look at some traffic, starting with the SIP INVITE, here is a successful call for comparison.

And here’s it’s evil twin INVITE.

I’ll save you the trouble, they are both ok, and considering the fact that the call rings, and gets picked up in all scenarios… it’s safe to assume there isn’t a problem with signalling.

Let’s now take a look at the incoming SDP, again we’ll start with a successful call for comparison.

The important lines are the connection information c= showing the correct IP Address, the media descriptor line starting with m=audio that has a valid port (59404) and protocol (RTP/AVP), and the following formats offered ITSP are 8 = PCMA (aka G.711a), 18 = G.729, 0 = PCMU (aka G.711u), and 101 = telephone-event which is DTMF (RFC 4733 – “RTP Payload for DTMF Digits, Telephony Tones, and Telephony Signals”).

So we’re getting offered G.711a as the preferred format, but also G.711u which are both fully supported RTP audio codecs by the Lync Mediation role, and as expected the call works.

And now here’s the evil twin again, this is the incoming SDP from the SIP INVITE from a ‘broken’ call.

Visibly different, this time we get an a=rtpmap line for each format, but as they are standard formats anyway, they are not strictly necessary… So, looking again at the m=audio line, we can see that these calls are offering different formats in a different order, firstly they are offering 9 = G.722, which is nice of them, it offers better audio quality because it’s a wideband codec, compared to G.711 which was designed to only handle the limited frequency range of the old copper POTS lines. The next one is 113 = AMR-WB (aka G.722.2) a similar quality wideband codec to G.722 but using totally different compression technologies, AMR-WB stands for Adaptive Multirate Wideband, allowing for higher compression and therefore lower bitrates, but can also adapt to varying network congestion to alter the compression and bitrate to suit.

Basically, calls on the incoming trunk that originate from the offending providers (Three, TalkTalk, COLT) all offer better audio quality first, but that’s no use to us as the Lync Mediation role doesn’t support them.

Could that be the issue? Wrong codec negotiation??? Nope, we still get 8 = PCMA (G.711a), and 0 = PCMU (G.711u) in the media descriptor line of the SDP, so both ends can agree on a common codec. And I prove this later.

Hold and Resume

Now lets look at what happens when the call is put on hold and resumed, the SIP INVITE shows nothing unusual again, and the call does indeed get put on hold and is also resumed, as we said earlier signalling is working.

Here is the SDP from a ‘broken’ call, initiated from the Lync side.

Lync reiterates the media type, port, protocol, and format for it’s current audio stream for this SIP Session on the m=audio line. The important part here is a=inactive, basically the stream is going to stop, this SDP is saying while the details of the stream are staying the same, don’t expect to receive any actual RTP packets (and not to send any either).

Here’s the corresponding SDP from the body of the SIP OK message from the ITSP.

Here we see it’s sending back a=inactive, because there’s no point sending audio whilst the call is on hold. So both sides are confirming that they have temporarily stopped transmitting, and are both not expecting to receive anything.

Now lets look at the ‘un-hold’ messages. At the point where the call resumes, the audio instantly works, are there any clues here?

Here we see a=sendrecv, saying that the media stream should now send and receive packets once again. All RTP details including ports and formats are the same as before, nothing new here…

And the ITSP replies with a SIP OK message containing an SDP with the same m=audio line as the original SDP from the very first SIP INVITE… showing the same port, same protocol, and same audio formats (not ALL formats, but the ones that match the previous SDP, because we’re not negotiating any more, there’s no point adding all those other formats and codecs). This now shows a=sendrecv confirming that RTP media flow is being sent, and is also expecting to receive audio too.

Which is all correct so far, no issues… It’s also not lost on me that I’ve just made you look at several valid SIP and SDP messages. But it’s all part of the story. It wouldn’t be as fun or interesting if I just told you the culpret right at the top.

So, why does this work AFTER hold-resume, if it’s nothing to do with SIP nor SDP? I needed to dig deeper…

To double check the ITSP was negotiating correctly, and sending the right RTP codec to Lync – Wireshark came to the rescue, it confirmed…

  • RTP Audio is using the correct port in both directions according to the SDP.
  • PCMA is being used in both directions and is flowing from the very start of the call, it was even possible to playback both sides of the conversation.
  • RTP Audio packets stop being transmitted whilst the call is put on hold, and start again when the call is resumed.
  • The same ports and PCMA codec is used when the call is resumed.

Time to ramp up Lync Logging, as no errors were captured from SIPStack or S4 at any level (as expected really, like I’ve said, signalling is correct). So by adding….

  • MediaStack_RTP
  • MediaStack_RTCP

I finally saw something interesting… lots of yellow warnings…

SSRC:0x<number here> PT:8 packet dropped, SSRC in BYE state

Certified-SIP-Trunk-SSRC-in-BYE-State

Every RTP packet received, whilst the call was having the one-way audio issue, was actually getting dropped by the Lync Mediation role… intentionally.

RTP and RTCP

To explain… Each and every RTP packet has a header that describes the payload including format used, and other important details. One of those items is an SSRC (Synchronisation Source), a unique identifier used to identify the stream that each RTP packet belongs to.

RTCP accompanies the RTP stream to provide regular updates with information on the media being sent which can be used for diagnostic and reporting purposes. A Receiver Report lets the other end know how many packets it received. A Sender Report lets the other end know how many packets it sent. A Sender Report is also used for a device that is sending AND receiving. So it’s common to see both ends only transmitting Sender Reports to each other.

A particular type of RTCP message is a ‘Goodbye’, this is intended to let the remote party know that the stream referenced has effectively come to an end, and therefore it’s safe to drop the packets that match the SSRC ID, rather than wasting time processing late packets etc.

SSRC in BYE State

Back to Wireshark for the finale… This time filtering only SIP and RTCP messages so you can see the order the conversation happens in.

Certified-SIP-Trunk-Wireshark-Goodbye

You can see here, caught red handed, as soon as the SIP INVITE is dealt with, and early media starts, the ITSP sends us an RTCP Packet telling us that the SSRC is no longer active.

Certified-SIP-Trunk-Wireshark-RTCP-Goodbye

The final part of the Goodbye is reserved for a ‘reason code’, and it even says in plain text ‘session stopped’. Go figure.

In a nutshell

So there it is.. plain as day. As soon as the call is initiated, and media begins to flow, the very first RTCP packet Lync receives from the ITSP is a RTCP Goodbye Message specifying the SSRC ID, which is the same one as it goes on to use for it’s RTP audio stream. Basically letting Lync know in no uncertain terms, that particular SSRC is no longer active. We can even see the Session Description Identifier (CNAME) contains the FQDN of the device that created this RTCP Packet.

Why did it work at all?

Basically, any change to the audio stream, i.e. placing the call on hold and resuming, prompts the devices to resend an RTCP Message which contains the SSRC again, but this time no ‘Goodbye’ command, so Lync puts it back into service and starts to accept it, and then all the subsequent RTP packets get processed and make it through to the Lync Client.

The final piece of the puzzle also falls into place – waiting a random time, and doing nothing. The RTCP Messages are sent at regular intervals for the duration of the call, (yet each call picks the timer based on an algorithm which is designed to send enough RTCP packets for both sides to maintain an up-to-date view of how many packets have been sent and received, to work out if any networking issue may be causing packet loss, so appropriate action can be taken. This duration turns out to be roughly the same time as it took for the audio to automatically resume on those calls. Because as the new RTCP Sender Report arrives, saying that the ITSP is sending RTP packets with that SSRC again.

This is where the trouble started

Up to now, it didn’t take too long to track down the issue and get the Wireshark and Lync trace evidence together, the bit that took so long was banging my head against my desk whilst dealing with the ITSP, constantly repeating myself to each person I spoke to, and trying to re-explain this issue. Only to be told at every turn, that their SIP captures didn’t show any issues, and that it must be an issue with Lync itself, not their infrastructure. I pointed out that ‘this problem’ was because Lync was following the RFC Standards for RTCP and RTP to the letter.

The whole process hit a low point when the ITSP providing this so-called ‘Lync Certified’ SIP Trunk quite adamantly said… and I quote…

Beyond that there is not a lot we can do. The codecs offered in the SDP are controlled by the network into us, we cannot change these or transcode so we just pass them through

Woah there horsey… You’re providing a Lync Qualified SIP Trunk, and adhering to all the test and checks that Microsoft stipulate must be followed to in order to qualify for full interoperability…… yet you have no control over the the traffic on your network that’s delivered to Lync… Shocking.

The ITSP went on to explain that they had a ‘SIP interconnect’ directly into BT’s IP Exchange, just like Three, TalkTalk, and COLT. I was told BT operate a sort of dual vendor platform, redundant infrastructures using Genband and Acme Packet devices. Therefore any calls originating from the IP Exchange by any of the directly connected providers on the Genband network took a different route compared to calls that originated on the PSTN and came into BT’s IP Exchange via Acme Packet.

After jumping over the usual hurdles getting to talk directly to BT Global Services. Somebody admitted a recent ‘firmware’ update to the Genband part of their infrastructure had introduced a discrepancy, and that a protocol change had been implemented to resolve it, of course this was ‘totally unrelated’ to the issues I saw. A few weeks passed, with little to no information from BT, I finally had an update from the ITSP, saying that BT ‘think’ they have found the device(s) at the source of this rogue RTCP Goodbye and would implement a fix that evening. This actually worked, the issue disappeared.

Conclusion

This was possibly the most obscure reason for audio issues I have ever come across.

Whilst waiting for an update from the ISTP, I testing with an Asterisk PBX to act as an SBC infront of Lync, also my colleague tested a trial of TE-Systems’ AnyNode virtual SBC (Lync Certified), both seemed to ‘ignore’ the RTCP Goodbye message and worked despite the issue from the ITSP.

Don’t have nightmares, and be sure to choose your Lync Certified SIP Trunk Provider carefully…. And always use a Session Border Controller.

Tweet about this on TwitterShare on LinkedInShare on Facebook
Pin on PinterestShare on Google+Digg thisShare on RedditShare on StumbleUponEmail this to someone

About Graham Cropley

Working as a Senior Consultant for Skype for Business, Exchange, and Office 365.

7 Comments

  1. Anthony Caragol

    I had this EXACT issue with a provider in the US. I could never get them to admit fault with the BYE packet and in the end I configured the SBC to only accept G.729 and transcoded it back to G.711 despite the provider “supporting” G.711. This is an excellent write up, thank you!

  2. Hello,

    Firstly thanks for your useful post.We have used SBC and Sip Trunk for 2 years.And everything was ok until last week.We cant hold and transfer calls with Lync server.I catch SIP “413 – Request entity too large” and talk with sip provider.They said me to your sdp content-length sizes are too much for us and eduse them. I don’t know how can i reduce content-length before hold action with lync and sbc. Our sbc is audiocodes sbc 1000.

    thanks

  3. Hi Graham, This is one of the best articles I have ever read, thanks much for sharing such a depth of knowledge and experience.

  4. Great !
    I am having no audio issue with Direct-SIP to twilio
    From my mediation pool behind NAT!
    Wireshark Capture shows destination unreachable (Port unreachable) from twilio media IP!
    Any hint?

  5. Hello Graham,

    Thanks so much for this wonderful write up. I am actually dealing with a different issue where one way audio starts from PSTN to Lync after the call is either transferred or resumed after putting on hold. This is happening on one site only out of around 60 different sites.
    This article has provided an in-depth knowledge on most of the key points when dealing with audio issues. I will post the resolution as well when I find it. Thanks again.

  6. Wow.. I was poking around the Internet for “one-way” or “ghost” calls and stumbled on your nice article. I have troubleshooted my ways through some VoIP issues but never would have considered the RTCP (goodbye in this case) to actual control the “signalling” aspect of a call? That is, I had always thought RTCP was just a companion “reporting” protocol for the session and that it was SIP that would actually control all aspects (i.e. tear a call down) of the calls behavior? Who knew? thanks Keith

Leave a Reply

Your email address will not be published. Required fields are marked *