For an app that I am building, I have been looking to extract the main image for an article from a webpage. Search on the net indicated that JSoup was probably "the" library to do it. But it too does not mention how to get the main image associated with the article. Any modern webpage has several images in it - advertising banners, site logo, social media sharing icons, article related images and more advertising. How do we extract the main image from an article? I was not willing to think and write code to do this; partly because I am lazy and also because I knew that this was done before and code should be available for this.
My initial search took me to Mashape. Searching through the APIs there, I found the free Article Analysis API by adlegant. While not required, I also decided to use the Unirest Java library by Mashape to access the API. My initial tests with the API were all positive, a simple call to the API endpoint with the URL to be analyzed and the credentials provided by Mashape was all that was required. It provided a JSON response similar to the one below:
It perfectly extracts the main image associated with the article and provides the link. All one needs to do is to use the org.json package and get the image's value. Nice and clean!
But... as I was testing the API inside the app for extracting the images, I found that most of the time it was unavailable. I agree that it is free and there should be no expectation from my side, yet I felt that the downtime was too much. As I type this post, I see that it seems to fail for a few URLs that are perfectly valid. While Mashape provides some basic support for even free APIs, I did not utilize it.
Instead, I decided to go with a parser based code to get the main image. I found a class called ImageExtractor by Gavin Bisesi (Daenyth) written using JSoup. The code employs the technique used by Google Plus to extract images from shared URLs. It does the job well and there is no dependency on external API endpoints.
While there would be many options available to scrape webpages and get the images, JSoup will probably be the primary option and APIs would be the next. If you are short on time and willing to spend, paid APIs maybe an option. If not, you would have to try out classes such as the one listed above to get the job done.
My initial search took me to Mashape. Searching through the APIs there, I found the free Article Analysis API by adlegant. While not required, I also decided to use the Unirest Java library by Mashape to access the API. My initial tests with the API were all positive, a simple call to the API endpoint with the URL to be analyzed and the credentials provided by Mashape was all that was required. It provided a JSON response similar to the one below:
It perfectly extracts the main image associated with the article and provides the link. All one needs to do is to use the org.json package and get the image's value. Nice and clean!
But... as I was testing the API inside the app for extracting the images, I found that most of the time it was unavailable. I agree that it is free and there should be no expectation from my side, yet I felt that the downtime was too much. As I type this post, I see that it seems to fail for a few URLs that are perfectly valid. While Mashape provides some basic support for even free APIs, I did not utilize it.
Instead, I decided to go with a parser based code to get the main image. I found a class called ImageExtractor by Gavin Bisesi (Daenyth) written using JSoup. The code employs the technique used by Google Plus to extract images from shared URLs. It does the job well and there is no dependency on external API endpoints.
While there would be many options available to scrape webpages and get the images, JSoup will probably be the primary option and APIs would be the next. If you are short on time and willing to spend, paid APIs maybe an option. If not, you would have to try out classes such as the one listed above to get the job done.
No comments :
Post a Comment
Leave a Comment...