cdn - How do you disallow crawling on origin server and yet have the robots.txt propagate properly? -


i've come across rather unique issue. if deal scaling large sites , work company akamai, have origin servers akamai talks to. whatever serve akamai, propagate on cdn.

but how handle robots.txt? don't want google crawl origin. can huge security issue. think denial of service attacks.

but if serve robots.txt on origin "disallow", entire site uncrawlable!

the solution can think of serve different robots.txt akamai , world. disallow world, allow akamai. hacky , prone many issues cringe thinking it.

(of course, origin servers shouldn't viewable public, i'd venture practical reasons...)

it seems issue protocol should handling better. or perhaps allow site-specific, hidden robots.txt in search engine's webmaster tools...

thoughts?

if want origins not public, use firewall / access control restrict access host other akamai - it's best way avoid mistakes , it's way stop bots & attackers scan public ip ranges looking webservers.

that said, if want avoid non-malicious spiders, consider using redirect on origin server redirects requests don't have host header specifying public hostname official name. want anyway avoid issues confusion or search rank dilution if have variations of canonical hostname. apache use mod_rewrite or simple virtualhost setup default server has redirectpermanent / http://canonicalname.example.com/.

if use approach, either add production name test systems' hosts file when necessary or create , whitelist internal-only hostname (e.g. cdn-bypass.mycorp.com) can access origin directly when need to.


Comments

Popular posts from this blog

c# - how to write client side events functions for the combobox items -

exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -