Today I learned:
Web Crawlers
Need a web crawler but don’t want to write one?
- http://scrapinghub.com
- http://www.outwit.com/products/hub/
- http://webroots.io
- http://kimonolabs.com
- http://grabby.io
- http://fullcontact.com
- http://emailhunter.co
- http://clearbit.com
- http://toofr.com
- http://import.io
- http://kimonolabs.com
- http://apifier.com
- http://elink.club
- http://www.eliteproxyswitcher.com/
- http://www.uipath.com/
- http://diffbot.com
- http://cloudscrape.com
- http://community.screen-scraper.com
- https://commoncrawl.org/
- http://www.fminer.com/
- https://scraperwiki.com/
- http://nutch.apache.org/
- http://www.ubotstudio.com/index7
- http://mozenda.com
- http://fivefilters.org/
- http://crawly.diffbot.com
Getting pages removed from Google cache
Have an old site that you need to keep live but don’t want the results to show on Google searches? Here are a few things you need to do:
- Change the
robots.txt
or password protect your site to prevent search engines from indexing. - Log in to Google Webmaster Tools and submit the site to the URL Removal tool.
- Finish what you need the site up for ASAP and take it offline.
Regex for various turning URLs into links in Markdown
This matches the links above:
- Search:
([\w\S]*[mo7b\/])$
- Replace:
[\1](\1)