Scraping Hub
p/scraping-hub
Turn web content into useful data
Charlie Irish
Portia — Scrape websites visually
Featured
48
Replies
Dre Durr💡
@datarade sounds like you should make a collection of Scrapers.
Sahil Chaturvedi
@datarade Jeez that's a lot! Are they all similar, or just different use cases?
Kumar Thangudu
@giannidalerta follow me and check out my blog. My plan is to continue these type of comments. ;)
Nick Kwan
@datarade I've seen a lot of your epic scraper comments on PH, so looks like you are the scraper king haha. What are your go-to scrapers nowadays for social networks like Linkedin + Twitter? Looking for something web based / usable on Mac. Data-miner.io is good, but a bit complex for us non-technicals. Many sites like Portia, Import, etc don't work on these sites.
Gabriel Puliatti
I've worked for Scrapinghub for the past two years, happy to answer any questions about Portia, its big brother Scrapy, or any part of our platform... or even web scraping in general! Us and our users are currently crawling 3.5 pages billion per month, or around 80,000 pages per minute. So we know a little bit about scraping. :)
Jake Miller
@gpuliatti I'm curious to hear your perspective to) Pala tired acquisition and sudden shutdo not of kimono labs and how they handled that?
Gabriel Puliatti
@jpmillions We're open source guys (and gals)… so we're definitely saddened to see users of a closed platform treated this way. OTOH, we've seen a lot of customers coming to try Portia from Kimono. :) We're actively working on ways to help people port their Kimono crawlers, so keen on hearing anyone in this boat! Email me directly (gabriel@scrapinghub.com) or sign up to the mailing list at the bottom of this post (https://goo.gl/CGxsFl). Both Portia and Scrapy are fully open source, and any crawlers (created or running) in our platform are fully exportable and interoperable with open technologies. While we are focused on the long-term and so doubt our platform will be shut down any time soon, if that ever happened, all of our users would be able to export their crawlers and use them on their own infrastructure. We've done this ourselves for some of our Professional Services clients who want us to build scrapers but also run things on their own infrastructure.
Evan Lodge
@gpuliatti I tried scraping a list of 27,000 urls... the browser crashed. Is there any easier way of adding URLs to the scrape?
tomkelshaw
ScrapingHub crew have been doing this a long time, and deliver good service. Since the untimely demise/acquihire of KimonoLabs, I'll be giving this a try.
oty
Awesome !!! Curious to know the limit of the data treatment for Big Data use case
Nick Kwan
@pablohoffman @gpuliatti doesn't look like Portia can currently handle scraping Twitter. Linkedin is even more secure than Twitter...Am I doing something wrong or is this how it is? Any suggestions for another scraper? Looking for something web based / usable on Mac. Data-miner.io is good, but a bit complex for us non-technicals.
dataflowkit
@pablohoffman @gpuliatti @nwkwan I'm sorry for late response. We've released new service for data scraping https://www.producthunt.com/post... . It is able to extract long infinite scrolled pages. I would really appreciate your expert review of our DFK service
Saijo George
Nice product. How do you guys compare to https://www.producthunt.com/tech... , that is my go to scraper these days
Yiğitcan Kutay Güler
Looks great! That's all I can say since I haven't been able to actually open the dashboard.. Is there a problem with the site? @gpuliatti
Gabriel Puliatti
@ykguler We got hit by the Product bump and our dashboard was having some issues… looks like things are working again.
Elia Morling
Looks cool. I am curious to learn what the top uses case for Portia are? I understand that people scrape, but what interesting things do they do with the data?
Gabriel Puliatti
@tribaling A few use-cases from our past client projects: - Scrape eCommerce sites that sell your products, to check for price violations and review data. - Build a broad crawler covering thousands of sites to automatically discover contact and profiles information for a specific industry. - Parse all shop locations for a number of big brands to provide a locator for users looking for a specific type of shop. - Build a database of interesting candidates to hire, by matching various sources of internet profiles with a series of filters which you or the HR team are interested in. I know people building boutique businesses on basic web scraping… like someone who uses our platform to offer a service that allows people to monitor Amazon Kindle Books pricing, and get alerted when the price drops or the book goes on sale. In effect, bringing Amazon's data "back to the people" to allow them to make better choices. But of course, most of the $$$ value comes from being a Fortune 500 company and being able to understand a lot more about the world, your industry and your competition. We help both large and small increase their reach and get access to the best technology. :)
John Alexander
do you have a free version? having a hard time figuring out these "plans"
Gabriel Puliatti
@johnalxndr Everything should be back up now! We do have a free plan, it's the big box on http://scrapinghub.com/pricing. You can get shared resources and 1 concurrent crawl for 0$ a month.
Sam Dickie
Cheers Gabriel! I have been looking at developing an app similar to the likes of Flipboard for a while, but i have been recently looking into how they currently scrap for content. Is this a complex system?
Gabriel Puliatti
@thisdickie For something that needs to grab data from any generic news site like Flipboard, Portia may be a bit limited. I would recommend using Scrapy, and using one of the many content parsing libraries available for Python. Moz has a great write-up of the available ones on their blog: https://moz.com/devblog/benchmar...
Dru Wynings
@thisdickie Hey Sam, you might also want to check out http://www.diffbot.com/ as we power more than a few article reading apps
Kevin Simper
I just tried signing up for Portia. They had a very cool introduction called Ben that wanted me to invite my team members. Funny because I don't think of scrapping as a team sport. Next I tried creating a spider, but there were no help here at all. I had to go to the docs and I still did not understand what I was doing. I clicked some elements on a page that I wanted scraped, but first afterwards discovered that I had to define the fields I wanted scraped and then annotate the page. On top of that it seems like they are out of capacity because I suddenly got a fatal mistake from the crawler when pressing "Test". Looks good, but still a long way! 👍
Heather Redman
Really useful--web data for everyone on an on-demand accessible basis. Would be interested in whether Portia also provides data structuring?
Ken Kaczmarek
Nice to see another option for scraping! Other than the open-source angle, how would you compare Portia to what Import.io offers?
Gabriel Puliatti
@wanderslth Hi Ken, we think Import.io is great… I've personally used their browser tools quite a few times to do something quick. I'd say the biggest benefit you get with Portia is actually our platform… every project has a great API (http://doc.scrapinghub.com/api/o...) which you can use to schedule your crawls and run crawls, as well as review and download data. You can use Crawlera's proxies, send data to various sources like Amazon S3 (plus images and files separately if you want) or even machine learning services for extra text analysis by enabling add-ons with a couple of clicks. Portia crawlers behave quite similarly to Scrapy crawlers, which means joint-projects are possible… Portia tackling the low-hanging fruit without needing engineers, and Scrapy doing the heavy lifting on the sites that need it. Having an easier option helps lower overall project costs, and Scrapy allows engineers to piece all parts together and crawl the tougher sites. This has allowed us to handle much bigger projects that we could with only one or the other.
Ken Kaczmarek
@gpuliatti ah; cool -- I took a poke around the larger site and looks... comprehensive. :) I'll definitely bookmark for a deeper dive here soon. Thanks.
Iverson Dantas
What do you think about tools to monitor updates social medias on twitter, facebook, youtube and instagram and also extract data from them?
Samir Doshi
This is a huge improvement on some of those other wysiwyg macro scrapers out there that just don't work.