Hot!VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13

Author
kallbrandt
Silver Member
  • Total Posts : 86
  • Scores: 18
  • Reward points: 0
  • Joined: 2016/05/21 11:21:05
  • Status: offline
2018/03/09 12:45:08 (permalink)
0

VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13

Hello,
An odd error - A lot of services suddenly went offline yesterday evening at a client's datacenter. Almost nothing regarding NAT worked. Most of the VIPs was dead - The logs are empty! No traffic! (Lots of users, webpages etc. Incoming traffic 24/7.) Failing over to other fw makes it work for a while. Same with reboots. Editing the VIP, like changing the public IP and then save might make it work for a while. The same with IP-Pools - Changing the pool in any way makes it work, for a while. The only outgoing NAT that actually works all the time is the interface address. All virtual addresses are totally unreliable. No strange traffic or load of any kind.
 
ISP has no problems with routing, the prefixes are advertised, and we did a failover to backup router (VRRP/BGP) that's located in another DC - Same problem. Other vdoms has internet access and SNAT/DNAT also, and works. Other equipment (VPN-concentrator etc) works flawlessly, so think the ISP side of things are ok. Switches are ok.
 
execute clear system arp table
 
Did actually work a few times.
 
Any ideas gentlemen? A bit lost with this one...
 
(Will open a high prio case with TAC)

Richie
NSE7
#1

14 Replies Related Threads

    ericli_FTNT
    Gold Member
    • Total Posts : 125
    • Scores: 4
    • Reward points: 0
    • Joined: 2018/02/08 11:12:27
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/09 14:16:41 (permalink)
    0
    Hi Richie, failure of device without any log left is always not good.
     
    Did you double check the log setting? Do you deploy and central logging device like FortiAnalyzer? Your case is critical for us. Please keep updated. Thanks!
    #2
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/09 15:00:18 (permalink)
    0
    Yes, FortiAnalyzer is deployed. Have logs some 120 days back. But nothing for the VIPs when they go offline. That's why we though this might be an ISP issue with ARP in the on-premise router. It for sure looks like no traffic is reaching the Fortigate.
    Outgoing NAT:ed traffic is showing up as timeouts. Very weird it seems to work when you change IP-Pool. And it is random - One IP-Pool that worked earlier might be dead the next time you try. Although high numbers in the public /24 we use seems to work better then the low ones. How about that?!
    post edited by kallbrandt - 2018/03/09 15:05:15

    Richie
    NSE7
    #3
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/09 15:02:00 (permalink)
    0
    Opened a case w. TAC. Customer is going to get a bunch of new Fortigates soonish (the 800c cluster is closing in on 5 years), but it would be grand if we could keep the old ones alive for some 4 months more...
     

    Richie
    NSE7
    #4
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/11 01:47:05 (permalink)
    0
    Update:
     
    If I set all VIPs to bi-di NAT (set src-nat-vip enable) they start to work.
    And if I map the non-working IP-pools to the interface (set arp-intf xxx) they start to work.
     
    So ARP-issue of some sort.

    Richie
    NSE7
    #5
    Antonio Milanese
    Bronze Member
    • Total Posts : 60
    • Scores: 6
    • Reward points: 0
    • Joined: 2012/12/15 06:11:02
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/11 04:16:58 (permalink)
    0
    Hello Richie,

    I can feel your pain..data plane issues are really nasty expecially when you can control only one side of the moon devices!

    Anyway from your description of the issue it seems to me that the problem it's on mac tables (re)learning and/or ARP gone snafu and blackholing your traffic, may be

    1) on the "wan edge" (SWs and indeed the ISP routers) someone do not relearn FGT interface VMAC for VIPs outsite initials GARPs and so when the MAC on the CAM it's aged out (tipically 300s) or ARP (aging ?) you have a blackhole until a new FGT ARP response or a GARP (due to updating the vip,failover,reboot) will properly repopulate the MACs and/or ARP tables
    2) on the FGT it self not replying to arp requests for VIPs so on upstream devices the vip mac and arp simply are aging out

    This could be for the most disparate reasons :
    1) at the "wan edge" : wrong ARP inspection/checks behaviour, countinuos premature CAM flushes due to STP TCNs, BU flooding, proxy arp, ecc
    2) at the FGT : "internal errors" ^_^ that are preventing correctly ARP reponses for VIPs

    from your first post i can infer that you have a DCI maybe with stretched VLANs so here we have another source of potential issues related to the DCI tech (VPLS,VXLAN,OTV,EVPN)

    My humble suggestions are:
    1) take a snapshot (interface pktcap,show arp,show mac) on both sides (ie FGT,switchets and maybe if the ISP it's collaborative on edge routers) when thigs are working and when they are not and compare the two
    2) try arp-ping VIPs when things are not working to see if the problem it's on FGT side
    3) try to see if gratuitous-arp-interval !=0 solve/mitagate the problem on the FGT

    Good work and best regards,

    Antonio
    #6
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/11 05:43:09 (permalink)
    0
    Thank you for your response!
     
    Good suggestions!
     
    DCI is just regular VLANS, no overlays of any kind. Behaviour is exactly the same on Fortigates in both DC.
    One thing we haven't done is restarting the core switches both firewalls are connected to. The core switches are in a virtual-chassis setup, so they behave as one unit. Only L2 towards the ISP though. And all logs in core are looking good. And the rest of the vdoms are working as they should.
     
    Will try to change the gratuitous arp setting on a few vips to see if it changes anything.
     
    Again, thank you!

    Richie
    NSE7
    #7
    Antonio Milanese
    Bronze Member
    • Total Posts : 60
    • Scores: 6
    • Reward points: 0
    • Joined: 2012/12/15 06:11:02
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/11 07:38:01 (permalink)
    0
    Hello Richie

    I've read your update too late..

    so forcing ARP replay binding to specific interface seams to resolve the issue..

    the most interesting thing it's "set src-nat-vip enable" since it's not directly related to ARP request/response and make me scratching my head:
    AFAIK on FGT ARP replays for VIPs are sent to by default on originated request intf and not really enforced using "set extif" and IPOOL default behaviour it's to replay to all interfaces are coming from, this is handy with hairpinning but easily misleading when you have multiple wans and "dumb shared edge segment" or using SD-WAN on 5.6.x where i've hard learned to use "set associated-interface"..
    but on 5.2 the only things that come to my mind (well to my evernote issues notebook=) are a FGT bug triggered by hairpinning

    http://kb.fortinet.com/kb....do?externalID=FD37124

    ..or for some reason your FGT appear to answer to VIP/IPPOOLs ARP req coming/from differents intf/vlans or even upstream devices are learning MAC/ARP from other different interf/vlans then expected one, meaning that there is a subtle "BUM flooding leak" under the cover!
    Are you using CISCO gears on core with VSS and/or VPC within edge wan?..time to time i've seen all sort of strange arp/mac bugs (flapping) with VSS/VPC when coupled with non CISCO LAGs :\

    Just for the sake of curiosity if on the affected VIPs you revert the "set src-nat-vip enable" and use "set srcintf-filter" are they still working ?

    Regards,

    Antonio
    #8
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/11 13:48:13 (permalink)
    0
    Yes, agree, src-nat-vip shouldn't really be related to ARP issue.
     
    No cisco equipment here, only Alcatel-Lucent 6900/6860 in core.
     
    Will set a few VIPs/pools back to original setting during night and be in very early to test.
     
    Must check out what's going on in the "internet-vlan" with Wireshark first hand.
     
    Again, thank you for your input. Highly appreaciated!

    Richie
    NSE7
    #9
    piacas
    New Member
    • Total Posts : 18
    • Scores: 0
    • Reward points: 0
    • Joined: 2012/10/04 18:09:24
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/12 12:05:39 (permalink)
    0
    Let me know what you find out. I had something similar last night when swapping an ASA to a new VDOM on an A/A 1500D's. All seemed to work for about 10 minutes then traffic not on the same inside interface IP segment quit accessing internet. 
     
    Could ping everything on inside, not the FGT inside IP. ARP table on Cores looked fine.....just couldn't ping FGT IP. Ended up disconnecting inside/outside interface and putting ASA back.
     
    Opened a ticket, waiting to hear back. 
    #10
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/12 13:51:07 (permalink)
    0
    UPDATE:
    Found the fault...
     
    There are several vdoms. The latest one have internet-access too, just as the rest. But in this vdom, the VLAN-interface ARPs on EVERYTHING. You can ping just about all the unused addresses in the public /24, and it will answer! Show arp shows nothing. Doing arping from a linux machine on the public subnet shows the same thing - It answers to almost everything!
    I tried to delete the interface, but then the config sync failed... Had to do a factory reset, then delete a bunch of polices in the vdom on the current master, then paste them back in on to get sync going again.
     
    But, had to create the interface again due to short maintenance windows, without being able to reboot the master. Back to 0 really, interface behaves in the same way.
     
    The vdom is in heavy use, so will probably try to setup another physical interface untagged instead and see if that works better.

    Richie
    NSE7
    #11
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/13 08:38:23 (permalink)
    0
    UPDATE: Nothing works...
    Changing physical interface, tagged/untagged... It seems the vdom is fundamentally broken in some way.
    Have an escalated ticket now, but might have to "fix" the issue by deleting the public facing interface and route the traffic via another vdom instead.
    post edited by kallbrandt - 2018/03/13 08:43:09

    Richie
    NSE7
    #12
    kallbrandt
    Silver Member
    • Total Posts : 86
    • Scores: 18
    • Reward points: 0
    • Joined: 2016/05/21 11:21:05
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/13 13:04:40 (permalink)
    0
    Check this out:
     
    host@nohost:~$ sudo arping 194.xxx.xxx.xxx
    ARPING 194.xxx.xxx.xxx
    60 bytes from 00:09:0f:09:64:17 (194.xxx.xxx.xxx): index=0 time=5.723 msec
    60 bytes from 00:09:0f:09:64:12 (194.xxx.xxx.xxx): index=1 time=5.810 msec
    60 bytes from 00:09:0f:09:64:17 (194.xxx.xxx.xxx): index=2 time=12.204 msec
    60 bytes from 00:09:0f:09:64:12 (194.xxx.xxx.xxx): index=3 time=12.299 msec
     
    :64:12 in the interface with the actual IP-address set, :64:17 is the baaad interface. ARP-Poisoned by your own Fortigate, basically.

    Richie
    NSE7
    #13
    Antonio Milanese
    Bronze Member
    • Total Posts : 60
    • Scores: 6
    • Reward points: 0
    • Joined: 2012/12/15 06:11:02
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/14 02:18:33 (permalink)
    0
    Hi Richie,
     
    odd to say the least..
     
    are all vdoms in nat mode ? maybe an overlooked/missed/unused ippool overlapping ? this is the most ordinary hypothesis that come to my mind..
     
    By the way, if it's confirmed as a bug, can you helping us to understand how it's eventually triggered, and which is your vdoms topology ? using hybrid vdoms half-numbered , ecc...
     
    Thanks,
     
    Antonio
     
    #14
    romanr
    Platinum Member
    • Total Posts : 903
    • Scores: 26
    • Reward points: 0
    • Joined: 2004/06/08 08:29:56
    • Location: Vienna/Austria
    • Status: offline
    Re: VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13 2018/03/14 02:59:47 (permalink)
    0
    Hey,
     
    according to that MAC addresses both arp replies come from the root VDOM of you cluster.
     
    I'd run a "diagnose debug report" into a text file and try to look if there are any run time references to this IP and MAC addresses... Maybe this will give you a hint.
     
    Br,
    Roman
    #15
    Jump to:
    © 2018 APG vNext Commercial Version 5.5