Extracting the Main Image from an Article/Web Page

Sunday, May 15, 2016

For an app that I am building, I have been looking to extract the main image for an article from a webpage. Search on the net indicated that JSoup was probably "the" library to do it. But it too does not mention how to get the main image associated with the article. Any modern webpage has several images in it - advertising banners, site logo, social media sharing icons, article related images and more advertising. How do we extract the main image from an article? I was not willing to think and write code to do this; partly because I am lazy and also because I knew that this was done before and code should be available for this.

My initial search took me to Mashape. Searching through the APIs there, I found the free Article Analysis API by adlegant. While not required, I also decided to use the Unirest Java library by Mashape to access the API. My initial tests with the API were all positive, a simple call to the API endpoint with the URL to be analyzed and the credentials provided by Mashape was all that was required. It provided a JSON response similar to the one below:

{
  "author": [
    "Verge Staff"
  ],
  "categories": [
    "economy, business and finance"
  ],
  "cleaned_text": "Black Friday is right around the corner, and there are plenty of great deals to be had. We've been covering the deals for weeks now, but if you want to cut through the mess and just score the best deals you can find, you've come to the right place.\n\nAs to be expected, this year there are lots of deals to be had on TVs small and large, 1080p to 4K. You can also get a great price on last year's iPads, which are still better tablets than pretty much anything save for this year's iPads. If you're in the market for a smartwatch or fitness tracker, you can save some money on some really great options this weekend. And if you want to pick up a laptop or new headphones, there are deals to be found too.\n\nKeep in mind that the best deals won't last long and many of them are limited to Friday itself (or in rare occasions, Thursday too). To win the Black Friday game, you have to be aggressive and quick, there's no time to sleep when deals are to be had. That said, here are the 20 best Black Friday deals this year. Warm up your credit cards and get a good night's sleep, it's time to make the consumer machine work.",
  "date": "2014-11-26T19:15:45Z",
  "entities": [
    "Warm",
    "Black",
    "TVs"
  ],
  "image": "https://cdn1.vox-cdn.com/thumbor/IkFo5ddhDEXA2t_mUfezCdvGbUI=/35x0:606x381/1280x854/cdn0.vox-cdn.com/uploads/chorus_image/image/44236174/black-friday-branding-jc3.0.png",
  "language": "en",
  "link": "http://www.theverge.com/2014/11/26/7292895/best-black-friday-deals",
  "main_body": "<div class=\"m-article__entry\" gravityScore=\"204\" gravityNodes=\"3\"><p>Black Friday is right around the corner, and there are plenty of great deals to be had. We've been&#160;covering the deals for weeks now, but if you want to cut through the mess and just score the best deals you can find, you've come to the right place.</p>&#13;\n<p>As to be expected, this year there are lots of deals to be had on TVs small and large, 1080p to 4K. You can also get a great price on last year's iPads, which are still better tablets than pretty much anything save for this year's iPads. If you're in the market for a smartwatch or fitness tracker, you can save some money on some really great options this weekend. And if you want to pick up a laptop or new headphones, there are deals to be found too.</p>\n \n&#13;\n<p>Keep in mind that the best deals won't last long and many of them are limited to Friday itself (or in rare occasions, Thursday too). To win the Black Friday game, you have to be aggressive and quick, there's no time to sleep when deals are to be had. That said, here are the 20 best Black Friday deals this year. Warm up your credit cards and get a good night's sleep, it's time to make the consumer machine work.</p>&#13;\n &#13;\n &#13;\n &#13;\n &#13;\n &#13;\n &#13;\n \n       \n\n    </div>\n  ",
  "summary": [
    "That said, here are the 20 best Black Friday deals this year.",
    "Black Friday is right around the corner, and there are plenty of great deals to be had.",
    "We've been covering the deals for weeks now, but if you want to cut through the mess and just score the best deals you can find, you've come to the right place.",
    "Keep in mind that the best deals won't last long and many of them are limited to Friday itself (or in rare occasions, Thursday too).",
    "To win the Black Friday game, you have to be aggressive and quick, there's no time to sleep when deals are to be had."
  ],
  "tags": [
    "best deals",
    "laptop",
    "expected year",
    "said 20",
    "sleep",
    "best",
    "nights",
    "20",
    "tracker",
    "black",
    "friday game",
    "time make",
    "save",
    "mess",
    "right corner",
    "friday",
    "really great",
    "save money",
    "year ipads",
    "friday deals",
    "great",
    "headphones",
    "deals",
    "best black",
    "black friday"
  ],
  "text_sentiment": {
    "sentiment": 0.2986327561327561,
    "subjectivity": 0.4551911976911977,
    "word": "positive"
  },
  "title": "The 20 best Black Friday deals"
}

It perfectly extracts the main image associated with the article and provides the link. All one needs to do is to use the org.json package and get the image's value. Nice and clean!

But... as I was testing the API inside the app for extracting the images, I found that most of the time it was unavailable. I agree that it is free and there should be no expectation from my side, yet I felt that the downtime was too much. As I type this post, I see that it seems to fail for a few URLs that are perfectly valid. While Mashape provides some basic support for even free APIs, I did not utilize it.

Instead, I decided to go with a parser based code to get the main image. I found a class called ImageExtractor by Gavin Bisesi (Daenyth) written using JSoup. The code employs the technique used by Google Plus to extract images from shared URLs. It does the job well and there is no dependency on external API endpoints.

While there would be many options available to scrape webpages and get the images, JSoup will probably be the primary option and APIs would be the next. If you are short on time and willing to spend, paid APIs maybe an option. If not, you would have to try out classes such as the one listed above to get the job done.

Danesh

Visit Pleb.in for apps developed by Danesh

Extracting the Main Image from an Article/Web Page

Danesh

No comments :

Post a Comment