<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SoftLayer Blog &#187; site scraping</title>
	<atom:link href="http://blog.softlayer.com/tag/site-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.softlayer.com</link>
	<description>A Behind the Scenes Look at the Best Hosting Provider in the World</description>
	<lastBuildDate>Thu, 23 May 2013 19:20:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.1</generator>
		<item>
		<title>Distil: Tech Partner Spotlight</title>
		<link>http://blog.softlayer.com/2012/distil-tech-partners-marketplace/</link>
		<comments>http://blog.softlayer.com/2012/distil-tech-partners-marketplace/#comments</comments>
		<pubDate>Wed, 16 May 2012 16:45:05 +0000</pubDate>
		<dc:creator>Guest Blog</dc:creator>
				<category><![CDATA[Partner Marketplace]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[application]]></category>
		<category><![CDATA[CDN]]></category>
		<category><![CDATA[content protection]]></category>
		<category><![CDATA[duplication]]></category>
		<category><![CDATA[global network]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[interview]]></category>
		<category><![CDATA[partner]]></category>
		<category><![CDATA[partner marketplace]]></category>
		<category><![CDATA[protection]]></category>
		<category><![CDATA[search engine optimization]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[site scraping]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://blog.softlayer.com/?p=8061</guid>
		<description><![CDATA[This guest blog comes to us from Distil.it, a featured member of the SoftLayer Technology Partners Marketplace. Distil is the first content protection network that helps companies identify and block malicious content scraping and data theft. In this video we talk to Distil CEO Rami Essaid about how the company developed, their participation in the [...]]]></description>
			<content:encoded><![CDATA[<p class="attribution"> This guest blog comes to us from <a href="http://www.distil.it/">Distil.it</a>, a featured member of the SoftLayer Technology Partners Marketplace. Distil is the first content protection network that helps companies identify and block malicious content scraping and data theft. In this video we talk to Distil CEO Rami Essaid about how the company developed, their participation in the TechStars program and most importantly, how they can help you!</p>
<div class="yt560"><iframe src="http://www.youtube.com/embed/F-sUZmkUajI?hd=1" frameborder="0" width="560" height="349"></iframe></div>
<div class="more-info"><strong>Company Website:</strong> <a href="http://www.distil.it/">http://www.distil.it/</a><br />
<strong>Tech Partners Marketplace:</strong> <a href="http://www.softlayer.com/partners/marketplace/distil">http://www.softlayer.com/partners/marketplace/distil</a></div>
<style type="text/css" media="screen">
h4{
font-size:16px;
color: #972F2C;
margin-bottom:0;
padding-bottom:0;
}
</style>
<h3>When Google&#8217;s &#8220;Panda&#8221; Algorithm Collides with Duplicate Content</h3>
<p>If you&#8217;re a Webmaster, it&#8217;s likely you&#8217;ve heard about the Google latest search algorithm &mdash; &#8220;Panda&#8221; &mdash; and all the benefits and implications of this update. Today, we wanted highlight what happens when Google Panda collides online with duplicate content. There have been plenty of opinions written about Google Panda and duplicate content, but we want to provide some background and examples to help you better understand how Panda and duplicate content might affect you. </p>
<h4>What is Duplicate Content?</h4>
<p style="margin-top:5px; padding-top:0;">Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page, within the same web site. When multiple pages within a web site contain essentially the same content, search engines such as Google can penalize/not display that site in any relevant search results.</p>
<h4>Should you be Concerned?</h4>
<p style="margin-top:5px; padding-top:0;">When Google released Panda, there was a significant outcry from legitimate business and publishers who were either downgraded overnight in their search engine page rank or dropped all together. For many of the businesses, the Panda algorithm reduced SEO rank and decreased visitors, site revenue and online market awareness. Some websites even experienced damage to their brand, as their customers and prospects questioned whether they were still in business.</p>
<p>We&#8217;ve spoken with <a href="http://www.cultofmac.com/">Cult of Mac</a>, <a href="http://www.digitaltrends.com/">Digital Trends</a> and several Fortune 1000 businesses, and they&#8217;ve all said the same thing: They were penalized and downgraded as a result of the Panda release as a result of unauthorized duplication of their content. They had done everything to comply with Google in optimizing their SEO configurations, but the third-party websites scraping and duplicating their content (outside of their control) caused their page ranks to fall. </p>
<p><span id="more-8061"></span></p>
<p style="margin-bottom:0; padding-bottom:0;"><strong>Google&#8217;s Official Stance on Duplicate Content:</strong></p>
<blockquote style="margin-top:5px;"><p>&#8220;We do a good job of choosing a version of the content to show in our search results.&#8221;</p>
<p>&#8220;In rare situations, our algorithm may select a URL from an external site that is hosting your content without your permission. If you believe that another site is duplicating your content in violation of copyright law, you may contact the site&#8217;s host to request removal. In addition, you can request that Google remove the infringing page from our search results by filing a request under the Digital Millennium Copyright Act.&#8221;</p>
<p><a href="http://support.google.com/webmasters/bin/answer.py?hl=en&#038;answer=66359">http://support.google.com/webmasters/bin/answer.py?hl=en&#038;answer=66359</a></p></blockquote>
<h4>Where is This &#8220;External&#8221; Duplicate Content Coming From?</h4>
<p style="margin-top:5px; padding-top:0;">Sometimes, it&#8217;s not clear how third-party sites obtain copies of legitimate work. Typically, they either steal it by manually or automatically scraping the content. The scraped content is then republished onto their sites, providing no credit or link to the original work.</p>
<p>What does that look like? It&#8217;s not difficult to find examples, but I tracked one down that seemed particularly ironic. Here&#8217;s an original article by PC World on Google&#8217;s <a href="http://www.pcworld.com/article/239007/googles_war_against_scraper_sites_continues.html">War Against Scraper Sites</a>:</p>
<p><a href="http://cdn.softlayer.com/innerlayer/distilex1.png"><img class="centered" src="http://cdn.softlayer.com/innerlayer/distilex1_s.png" alt="Screen Shot of PC World Article"/></a></p>
<p>Here&#8217;s a duplicate copy of the same story that doesn&#8217;t give any credit to the original PC World article:</p>
<p><a href="http://cdn.softlayer.com/innerlayer/distilex2.png"><img class="centered" src="http://cdn.softlayer.com/innerlayer/distilex2_s.png" alt="Screen Shot of Article on Google's War against Scraper"/></a></p>
<p>It&#8217;s clear that we&#8217;re not looking at a coincidence here. The title, article content and images are all identical. The scraping site didn&#8217;t even attempt to mask their plagiarism with synonym changes. Why would they do that? Just take a look at the ads on the scraper site &#8230; They want to profit from the keywords and traffic driven by PC World&#8217;s content.</p>
<h4>What Can You Do About It?</h4>
<ul>
<li><strong>Listen to Google</strong><br />
Google provides a list of tips for using <a href="http://support.google.com/webmasters/bin/answer.py?hl=en&#038;answer=96569&#038;topic=2371375&#038;ctx=topic">rel=&#8221;nofollow&#8221;</a> and <a href="http://support.google.com/webmasters/bin/answer.py?hl=en&#038;answer=139394&#038;topic=2371375&#038;ctx=topic">canonicalization </a>to ensure they are able to identify you as the original author of content and avoid penalizing or downgrading your business&#8217;s search ranking results.</li>
<li><strong>Learn About DMCA and Use It</strong><br />
If your content has already been duplicated by unauthorized publishers, you should learn more about the Digital Millennium Copyright Act (DMCA) and how it can help get help remove your content from infringing websites. Two helpful resources to start learning the law and your rights are <a href="http://support.google.com/bin/answer.py?hl=en&#038;answer=1386831">Google&#8217;s official DMCA policy page</a> and the <a href="http://www.copyright.gov/">United States Copyright Office</a>.</li>
<li><strong>Be Proactive About Stopping Scrapers</strong><br />
We believe the best solution is to implement practices and or services to proactively prevent people or web scrapers from harvesting or scraping your content in the first place. Although web scrapers can be difficult to detect, there are tactics and/or services that can be implemented to limit certain behaviors on your website(s). Some of the quickest ways to make strides in the right direction are to implement rate limiting rules, to block traffic from blacklisted IP addresses and to use Captcha to help reduce automated web scrapers.</li>
</ul>
<p>While none of these tactics are fool-proof ways to completely prevent your content from being duplicated, the more barriers to entry you have, the more difficult it will be for web scrapers to repeatedly duplicate your content. <a href="http://distil.it">Distil</a> built an enterprise-ready platform to monitor and prevent site scraping, so if you want some help in the protecting your content, try our our service. Whatever route you take, the key is to make sure that whatever tactics or services you implement, you don&#8217;t forget about your legitimate traffic &#8230; You don&#8217;t want to throw out the baby with the bathwater. Be proactive, but make sure you keep your priorities on the user-experience and quality of your site(s).</p>
<p>-Sean Harmer, <a href="http://distil.it">Distil</a></p>
<div class="tpm-note">This guest blog series highlights companies in SoftLayer&#8217;s <a href="http://www.softlayer.com/marketplace">Technology Partners Marketplace</a>. <br/>These <a href="http://blog.softlayer.com/partner-marketplace/">Partners</a> have built their businesses on the SoftLayer Platform, and we&#8217;re excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.softlayer.com/2012/distil-tech-partners-marketplace/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
