@datarade I've seen a lot of your epic scraper comments on PH, so looks like you are the scraper king haha.
What are your go-to scrapers nowadays for social networks like Linkedin + Twitter? Looking for something web based / usable on Mac. Data-miner.io is good, but a bit complex for us non-technicals. Many sites like Portia, Import, etc don't work on these sites.
I've worked for Scrapinghub for the past two years, happy to answer any questions about Portia, its big brother Scrapy, or any part of our platform... or even web scraping in general!
Us and our users are currently crawling 3.5 pages billion per month, or around 80,000 pages per minute. So we know a little bit about scraping. :)
@jpmillions We're open source guys (and gals)… so we're definitely saddened to see users of a closed platform treated this way.
OTOH, we've seen a lot of customers coming to try Portia from Kimono. :) We're actively working on ways to help people port their Kimono crawlers, so keen on hearing anyone in this boat! Email me directly (gabriel@scrapinghub.com) or sign up to the mailing list at the bottom of this post (https://goo.gl/CGxsFl).
Both Portia and Scrapy are fully open source, and any crawlers (created or running) in our platform are fully exportable and interoperable with open technologies.
While we are focused on the long-term and so doubt our platform will be shut down any time soon, if that ever happened, all of our users would be able to export their crawlers and use them on their own infrastructure. We've done this ourselves for some of our Professional Services clients who want us to build scrapers but also run things on their own infrastructure.
ScrapingHub crew have been doing this a long time, and deliver good service. Since the untimely demise/acquihire of KimonoLabs, I'll be giving this a try.
@pablohoffman@gpuliatti doesn't look like Portia can currently handle scraping Twitter. Linkedin is even more secure than Twitter...Am I doing something wrong or is this how it is? Any suggestions for another scraper? Looking for something web based / usable on Mac. Data-miner.io is good, but a bit complex for us non-technicals.
Looks cool. I am curious to learn what the top uses case for Portia are? I understand that people scrape, but what interesting things do they do with the data?
@tribaling A few use-cases from our past client projects:
- Scrape eCommerce sites that sell your products, to check for price violations and review data.
- Build a broad crawler covering thousands of sites to automatically discover contact and profiles information for a specific industry.
- Parse all shop locations for a number of big brands to provide a locator for users looking for a specific type of shop.
- Build a database of interesting candidates to hire, by matching various sources of internet profiles with a series of filters which you or the HR team are interested in.
I know people building boutique businesses on basic web scraping… like someone who uses our platform to offer a service that allows people to monitor Amazon Kindle Books pricing, and get alerted when the price drops or the book goes on sale. In effect, bringing Amazon's data "back to the people" to allow them to make better choices.
But of course, most of the $$$ value comes from being a Fortune 500 company and being able to understand a lot more about the world, your industry and your competition.
We help both large and small increase their reach and get access to the best technology. :)
@johnalxndr Everything should be back up now! We do have a free plan, it's the big box on http://scrapinghub.com/pricing. You can get shared resources and 1 concurrent crawl for 0$ a month.
Cheers Gabriel! I have been looking at developing an app similar to the likes of Flipboard for a while, but i have been recently looking into how they currently scrap for content. Is this a complex system?
@thisdickie For something that needs to grab data from any generic news site like Flipboard, Portia may be a bit limited.
I would recommend using Scrapy, and using one of the many content parsing libraries available for Python. Moz has a great write-up of the available ones on their blog: https://moz.com/devblog/benchmar...
I just tried signing up for Portia. They had a very cool introduction called Ben that wanted me to invite my team members. Funny because I don't think of scrapping as a team sport. Next I tried creating a spider, but there were no help here at all. I had to go to the docs and I still did not understand what I was doing. I clicked some elements on a page that I wanted scraped, but first afterwards discovered that I had to define the fields I wanted scraped and then annotate the page. On top of that it seems like they are out of capacity because I suddenly got a fatal mistake from the crawler when pressing "Test". Looks good, but still a long way! 👍
@wanderslth Hi Ken, we think Import.io is great… I've personally used their browser tools quite a few times to do something quick.
I'd say the biggest benefit you get with Portia is actually our platform… every project has a great API (http://doc.scrapinghub.com/api/o...) which you can use to schedule your crawls and run crawls, as well as review and download data. You can use Crawlera's proxies, send data to various sources like Amazon S3 (plus images and files separately if you want) or even machine learning services for extra text analysis by enabling add-ons with a couple of clicks.
Portia crawlers behave quite similarly to Scrapy crawlers, which means joint-projects are possible… Portia tackling the low-hanging fruit without needing engineers, and Scrapy doing the heavy lifting on the sites that need it. Having an easier option helps lower overall project costs, and Scrapy allows engineers to piece all parts together and crawl the tougher sites.
This has allowed us to handle much bigger projects that we could with only one or the other.
@gpuliatti ah; cool -- I took a poke around the larger site and looks... comprehensive. :) I'll definitely bookmark for a deeper dive here soon. Thanks.
Crypto Buyer's Guide