Long Awaited Outages Survey Is Here!

Woohoo, the results are in, yes, a year delayed, I will spare you all from the list of excuses.

For some, responses may have changed from the time they took this survey. Personally for me, what matters most is transparency & what went wrong and lessons learned which serves as a great example of how not to do things. Reading through the summary what caught my attention was 58% responded that they almost never perform redundancy checks for their infrastructure. I guess they enjoy living on the bleeding edge ;-)

I understand there are pros and cons to this argument, then again what do you think? What would you do? Would you respectfully challenge your management to be more transparent? Or the fact you have network engineering task, on-call, lack of sleep & internal politics to worry about than to participate in outages mailing list?

Let it out under comments section but please be respectful and keep an open mind.

These aren’t issues we will solve immediately. They take time to build and they will ebb & flow. But as you diligently pursue staying on top of them, you will be locking in that legacy you desire for others to

Without further ado, I present you the survey results.

What topics would you like to see being presented and /or being discussed during the meet-up?

  1. The massive dearth of tools and documented processes for web operations to troubleshoot or investigate client-side issues.
  2. Issue mitigation, redundancy planning and testing
  3. Outage mitigation techniques, new-world monitoring (IE – some whiz-bang fully automated open-source monitoring solution for dynamic/cloud environments), lessons learned.. all of it sounds good.  ;)
  4. Tools, monitoring specifically related to infrastructure
  5. Philosophies and approaches to event management, information exchange, and perhaps even prevention. 
  6. Emergency comms, backup plans, interaction with local gov
  7. What you listed is a good start.  I would love to be able to tap into the knowledge of other that run Metro networks and not just a few WAN services.
  8. Tools and lessons learned as well as relationship management with ISP and network providers
  9. Network Outages.
  10. Wireless
  11. Tools, monitoring
  12. Monitoring tools both network and server based
  13. Anything – I can learn from it all (but am quite afraid to contribute).
  14. Triage of incidents.. Network monitors.
  15. Notification of outages, standard procedures for avoiding outages ;-)  
  16. Looking glass sharing.
  17. Open source tools & monitors
  18. Let’s all get along and be more usefully transparent? uhh…
  19. Existing tools and resources that others find useful.  Discussions on how to monitoring tools and techniques.
  20. Information exchange, collaboration on central resources, organization of outages.org and ways to participate.
  21. Any :) Note I chose yes above, but obviously some form of web-based is useful for those not around NA (not ruling out BoF along NANOG as useful also. More the merrier IMHO)
  22. Network monitoring tools and best practices.
  23. Low cost assessment and monitoring tools to help small business IT personnel self address connectivity impact upstream, open source where possible. 
  24. Network status monitoring.  Application monitoring.  Lessons learned sessions would be immensely helpful.
  25. Root cause and avoidance of major outages network and systems outages.
  26. Dissemination of information / real time communication
  27. “Standardized reporting. If outage reports are standardized, then you can start getting info out even faster, like with things like Twitter.I’m not talking about complicated xml either. Just fields that can be parsed easily.Provider:
    Relevant IPs:If we could get that, automatic confirmation of an outage might be possible.”

What works well and what should be improved within OUTAGES.ORG? Do you have a suggestion?

  1. If there was a status page for known and confirmed issues, that would be excellent, or some other aggregation method from the various Internet “health” boards that are linked to from the wiki.
  2. I like the useful nature of posts to the list, but certain things that have very limited busuness impact such as facebook and twitter should be administratively limited or blocked entirely.  There is no use to seeing 100 posts on this list as well as NANOG that say “OMG I can’t tweet.”  I’m interested in major T1/T2 carrier outages that effect multiple sites, not singe consumer applications.  If I want to know if facebook or any other specific site is down, I’ll check downforeveryoneorjustme.com.
  3. “Discussion. People need to STFU and only report facts, I’m sick of hearing about how shitty XO is. I wish that we could get official announcements from 1 place. outages.org should reach out to ISPs to get official notices.”
  4. “Clear definitions for new posters regarding “”reportable”" outage events. eg. a few T1s down in Podunkville, US is not a reportable event, while all of a tier 1 carrier’s T1s down in a metro area is.”
  5. more input from providers would be nice…
  6. More participation from the internet community.  It is a great resource when the information is posted, but many outages are not on the list.
  7. “Separating discussion from the announcements is good.
    Moderation is appreciated. It would be nice to have governmental entity input in regards to other critical infrastructure issues. Get broader involvement — there’s a lot of small to medium-sized operations that have people that don’t care/don’t know.”
  8. It’s been great for me so far – when I notice oddities, I’ll often see an email to outages@ mentioning the same thing, and then a continuing conversation which generally concludes with the root issue. It would be great if we could get more of the conversation of outages off NANOG to Outages though.. but that’s human management. ;)
  9. “Maybe separate lists into
    – a looking for additional info on a problem
    – Confirmed outages, either by multiple sites or major vendor master ticket
    – Discussion about an outage
  10. Leave it as it is, let it grow on its own.  Do not try to govern it.
  11. I find the chatter is mostly US based. Very little European news.
  12. Would be better with more content that’s more upto date. would probably pay for up to date info
  13. It’s great hearing about the larger outages so that when people ask me why something is slow, I have something to rely on. However, I’d be grateful if it had more people and a much larger scope – encompassing smaller outages as well.
  14. Notices come promptly and moderation is there when needed.  Would help if users would provide more detail about there outage and not so many generalized statements such as ….Is Sprint having a problem LOL
  15. The honest reporting of issues is very helpful.  I think it works well.  I’d like to see more providers posting information.
  16. The focus is mostly on US.  Wider focus would be appreciated
  17. An outages widget would be great, if would fit in the little corner of the proyector screens at my noc, and I assume at many nocs too.
  18. Limit excessive chatter on mailing list, especially after hours. Outages.org mailing list goes to my NOC pages, so sometime its annoying when people are sending alot of email not particularly relating to the outage.
  19. Inclusion of WIRELESS outages
  20. A web GUI would be nice to launch an outage notice, so the email’s received have some standardization
  21. What works well for me is low volume, highly valuable content.
  22. Mail postings come to late.  Seems to be a long latency between the time people send a posting, and the time I receive it in my email
  23. I’m not one to rock the boat – I like what I experience on the list. Informative, sometimes chatty, but 99% quality stuff – I can usually get answers here whereas our WAN carriers normally don’t answer questions.
  24. It may be a personal preference, but a little less Chat and more just problems and resolutions on the outages list.
  25. Getting news out quickly.. LOW SNR.
  26. Making it more of an announcement list.  The threads of ‘does this work for you’ and twenty people replying is fairly pointless…
  27. Tell more people/ get more input
  28. There should be a larger effort amongst participants on the mailing list to provide some more useful information about problems seen.  Most commonly people are saying “foo seems down for me” but only a fraction of the time do they provide any information such as what ISP and geographic location they’re trying from, or ping vs application failures, or traceroutes, or anything.  Having some kind of suggested standard of what information should be posted would make the threads much more valuable from the start instead of this information needing to be requested in replies.  This suggestion is intended for the outages mailing list itself, for initial announcements or “down/up for me too” replies only.  Anything else IMO should take place on the outages-discuss mailing list, which seems to be way underused at this point.
  29. I could list annoyances but I have no solutions at the moment (:
  30. Overall outages works well.  I wish there was more information on outages related to connectivity and data center incidents.  Building some network maps (logical or physical) would also be beneficial to me at least.
  31. I do not use the website, just the mailing lists.
  32. Not a lot of local outages posted, but that’s improving over time. In general, a good tool and something that should be continued and will hopefully improve organically as more people join.
  33. perhaps evolve towards the Bugtraq / Full Disc model, where as well as independent researchers, vendors post official advisories. For Outages-l, that would mean getting SPs and utilities to start routinely announcing outages on the list, which might take a long time to happen, and might also be too noisy to be useful, but, well you asked for ideas :)
  34. “I make much use of the outage wiki and posted resources when working with my small-business clients in determining Outage and communications issues. If possible perhaps official port-mortem reports from the ISPs would be helpful (though its understood these are rarely ever made public)”
  35. How about a ‘current known outages’ on your website?
  36. “We enjoy the broad overview of outages that could affect the Internet in general. I would like to see more service provider planned maintenance’s listed.”
  37. “Some form of template system.  Reporting outages is good, but automating anything from the emails is hard. It would obviously require a human, but maybe another list that only distributes templated email would be handy.”
  38. Open communication of known issues or queries if potential issues exist.
  39. “There is very little content regarding trans-atlantic, european or basically non-north american outages. The idea is great though.”

What value does OUTAGES.ORG (mailing list & wiki) provides you and your organization?

  1. Confirming anecdotal issues with our own observations
  2. Finding widespread outages before we may of had reports of them
  3. It gives me a broad overview of issues that may affect my customer’s ability to reach the content they are looking for.  (It is another tool for troubleshooting)
  4. Keeps us informed of outages so we can be responsive to our customers
  5. Early warning
  6. Know when major events that effect users is going on, ie, major website is down, or circuits are down in a area
  7. While I mostly manage a private worldwide MPLS VPN, we also host many public Internet facing app portals for our partner companies and agencies.  Outages.org lets me correlate known large scale problems to trouble reports I receive.  Also it helps with correlation of web site reachability tickets — i.e. outages.org discussions gives me a head up when google/facebook/amazon etc is down so then I know it’s not an issue with my firewalls, Internet routing, etc.
  8. First-responder information for to us consider on behalf of our fleet of remote users. Secondly supplemental information to the condition of our perimeter communications.
  9. Reduced troubleshooting time for issues outside our network. Ability to respond to customer complaints more effectively.
  10. I have not used the wiki much yet, but the mailing list is great for confirming issues – see my note above.
  11. Let’s me know what is happening and where on the net…
  12. Awareness of issues.
  13. It helps me know when there are problems with traffic & routing so I can pass it on to users.
  14. For such a vast user base this has been a great place to keep up with things outside our network that may affect our users
  15. We provide services to our customers via the internet.  The outages list helps me provide the NOC and operations teams with information that helps us determine root cause for customer experience issues.
  16. Quick knowledge of issues to match with customer complaints
  17. Confirmation of issues affecting our customers easing troubleshooting required, as well in some cases a heads-up so our support team is ready.
  18. Short circuits troubleshooting process – no need to expend resources tracking down issues if they are caused by an upstream privider outage
  19. So far i have had only one outage occur that we were part of, so at this point the value is TBD, but the potencial is there
  20. I know when large outages occur and can act accordingly, cuts down on troubleshooting time.
  21. I get the best information about outages to networks from trusted, knowledgeable sources. I hope to be able to leverage that someday for social good on a global scale. (Only after we build a partnership and start the discussion.) I think that the Outages community could offer value to exercise24.org and other crisis management/crisisdata events. You are technical experts and they need to leverage this knowledge.
  22. The mailing list keeps me informed about current and I also consider the wiki to be quite valuable. I am still learning the position that I am working and the help and guidance from the wiki is invaluable – namely the dashboard page.
  23. It can sometimes provide insight into issues without having to reinvent the wheel. Since we do not support any end users much of the material is of no benefit to us.
  24. “Helping customers before they know it..With some of the major outages, it has saved us hours of troublshooting with some of out international customers. VPN slowdowns, no access to certain applications etc..”
  25. We host ad serving software so it is helpful to know what areas of the world are having an outage to minimize troubleshooting time when someone isn’t seeing ads.
  26. Provides me with a good reference for researching issues that our clients bring to us.  We are a small organization and sometimes I am able to find out issues that would be much more difficult for me to research on my own (or nearly impossible to find out about).
  27. Informational only.
  28. It’s a good source of information for large network outages and/or instability that have an effect on our user base.
  29. Allows us to notify our support staff and others of potential US network issues/outages that could affect our users.
  30. Provides confirmation of wide-spread outages
  31. It helps us know the state of the internet in real-time.
  32. I do not have a lot of exterior sensors so it helps me get a feel for what is over my horizon.
  33. Understanding when to give a heads-up to my clients that their services may be affected by third parties they’ve never heard of.
  34. To help with “the internet is broken” tickets.
  35. verification of cable cuts and sms outages that are typically difficult to substantiate when one is affected but not a direct customer — in general really.
  36. Just my own intrest in seeing what hapens around the world. the realy intresting stuff is often that big so it will be on Nanog also.
  37. Massive aid in troubleshooting, and advance knowledge of impending support issues
  38. Allows the corroboration of Internet related events.
  39. Outages provides another data point when a problem is reported with our service.  Correlation can happen if we know of a particular region or provider having issues.
  40. Whenever we get complaints about reachability from our network to another network, we always check [outages] if the destination network has been reported as broken.
  41. Knowledge/awareness of what is going on. Sometimes even events far away have local impacts and having as much information available is useful operationally.
  42. I have (occasionally) been able to point to Outages as evidence that a particular customers’ troubles aren’t our fault.
  43. In case a user reports something broken it’s a metric to check if we can’t find anything here.
  44. Validation that other organizations are seeing the same problems that we are.  Notification of third-party service outages before our customers start calling.
  45. As above, I make much use of the outage wiki and posted resources when working with my small-business clients in determining Outage and communications issues.
  46. Intel into what is impacting and driving customer calls to our NOC.  Customer sources are geographically varied and having visibility to issues in those locations from trained and connected engineering types (often) is pretty useful reference material.
  47. The wiki doesn’t have a lot of value to me, but the mailing list does because I can keep an eye out for events that the members of the projects I support tend to call me about in a panic with “Your server is down!  Fix it!” messages, when in fact it’s an outage local to wherever they happen to be working from.
  48. It’s very helpful for identifying upstream provider outages, and cuts down on troubleshooting time when identifying reachability issues. Its also helpful for the NOC to be able to tell customers what’s really going on.
  49. We can cut investigation times down immediately, saving time (which is most important) and money.
  50. Gives me a way to check if there’s something occuring out of my control that would correlate to a user-reported issue.
  51. Minimal at present
  1. Tracking Outages | Welcome to Outages Blog! - pingback on August 24, 2012 at 11:29 pm

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackbacks and Pingbacks: