A Little Bit Of Web Scraping In A Hybrid Mobile App

12 January 2015JavaScript, AngularJS, Web Scraping

I was working on a mobile app that needed to get data from a website. Since that website didn’t expose an API to get the data, I needed to scrape it. To make it a little bit more complicated, I had to log in first to access the data I needed.

Remember: Web scraping can be illegal, so make sure you won’t get sued if you're using data that is not yours.

Since I was building a hybrid mobile app (which uses HTML/CSS/JavaScript), I figured I’d just try doing it all from JavaScript using AngularJS.

Other Options

The best way to go about web scraping is creating a web service that handles the scraping so that when anything changes on the website you're scraping, you'll only have to update the web service. The app I built was just for myself, so I didn't bother setting up a web service for it.

Before I decided to do the scraping in JavaScript I had actually looked at using import.io to create an API for the website with their tool.

Unfortunately, it didn't work with the website I was using, I kept getting an error when trying to record the login. So you might want to have a look at using that before you build the scraping yourself.

Let’s get started

First, we’ll need a website to scrape, for the purpose of this blog post I picked the open source website http://www.nerddinner.com.

We will be writing code to login to this website and get a list of the dinners created by the logged in user.

Let’s open up Chrome and browse to http://www.nerddinner.com. Open up the Developer Tools and go to the Network Tab. Make sure you tick “Preserve log” on the top. Now login with username “gonehybrid” and password “password”. After login, find the POST request in the queue and click on it. It should look like this:

Screenshot of POST request in Dev Tools

Have a look at the Request and Response Headers to get an idea of what is being sent to and from the server.

Now we are going to write our code that will mimic this POST request to login to the website. I’m using the AngularJS $http service in the code, you can also just use jQuery $.ajax or plain XMLHttpRequest.

<!-- index.html -->
<!DOCTYPE html>
<html>
<head>
    <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.3.4/angular.min.js"></script>
    <script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
    <script src="main.js"></script>
</head>
<body ng-app="myApp" ng-controller="myController">
    <div>
        <input type="text" ng-model="username" />
        <input type="password" ng-model="password" />
        <button ng-click="doLogin()">Login</button>
    </div>
    <div>
        <ul>
            <li ng-repeat="dinner in dinners">
               {{ dinner.Name }} at {{ dinner.Date }}
            </li>
        </ul>
    </div>
</body>
</html>
// main.js
(function () {

    var dinnerService = function ($http) {

        return {
            login: function (username, password) {
                var request = {
                    method: 'POST',
                    url: 'http://www.nerddinner.com/Account/LogOn',
                    headers: {
                        'Content-Type': 'application/x-www-form-urlencoded'
                    },
                    data: 'UserName=' + username +
                          '&Password=' + password +
                          '&RememberMe=false'
                };
                return $http(request);
            },

            getMyDinners: function() {
            }
        }
    };

    var myController = function ($scope, $http, dinnerService) {

        $scope.doLogin = function () {
            var onSuccess = function (response) {
                dinnerService.getMyDinners()
                             .then(function(response) {
                                        $scope.dinners = response;
                                   });
            };

            dinnerService.login($scope.username, $scope.password)
                         .then(onSuccess);
        }
    };

    var myApp = angular.module('myApp', []);

    myApp.factory('dinnerService', ['$http', dinnerService]);
    myApp.controller('myController', ['$scope', '$http', 'dinnerService', myController]);

})();

Now, let's run this in Chrome and login with the same credentials. We'll see the following error in the console: Screenshot of Error message

Same-Origin Policy

This error means that Chrome does not allow you to do an HTTP request to another domain. The browser implements the same-origin policy which forbids scripts from one site to access resources on other sites.

To get around this there is something called Cross-Origin Resource Sharing (CORS). With CORS the server can specify which sites are allowed to run scripts by sending the header Access-Control-Allow-Origin back to the browser. The browser can then allow the website to access resources on the server.

This is not going to work for us, because http://www.nerddinner.com does not return that header.

But, wait, don't go away yet, we can still do this!

First of all, our hybrid mobile app will be loaded into a web view on the mobile device. This web view doesn't implement the same-origin policy so we won't have this problem on the device.

But, it's so much easier to use your desktop browser during development and debugging, so how do we get Chrome to bypass that security check? It's actually very easy, you just need to start Chrome with some additional flags to disable the security check.

On OSX:

$ open -a Google\ Chrome --args --disable-web-security

On Windows:

chrome.exe --user-data-dir="C:/Temp/Chrome" --disable-web-security

You'll know that it's started in security-disabled mode when you see a yellow message under the address bar saying: You are using an unsupported command-line flag: --disable-web-security. Stability and security will suffer.

If you don't see this message, make sure you close all other Chrome windows and try again.

Ok, so let's try logging into the nerddinner website again. This time we get no error and we're logged in!

Cookies

Now let's do a GET request to http://www.nerddinner.com/Dinners/My to get the list of dinners created by the logged in user.

Add the following code to dinnerService:

getMyDinners: function () {

    var parseDinners = function (response) {
    }

    return $http.get('http://www.nerddinner.com/Dinners/My')
                .then(parseDinners);
}

When we look at the Network tab we can see that we're being redirected to the Login page. What's happening here is that the server doesn't know that we're logged in.

Have a look at the Response Headers from the login POST request. There was a coookie sent back after the login: ASPXAUTH. This is the cookie an ASP.NET website uses to determine if the user authenticated.

Now have a look at the Request Headers on our GET request for http://www.nerddinner.com/Dinners/My. As you can see, the cookie is not being sent back to the server in our GET request.

To make sure that the cookie is sent on all requests we include the following code:

myApp.config(function ($httpProvider) {
    $httpProvider.defaults.withCredentials = true;
});

Parsing

OK, so now we can actually get the dinners list and parse it. If you have a look at the source of the http://www.nerddinner.com/Dinners/My page, you'll see that the list of dinners has class="upcomingdinners". We'll use this to find the list and parse it into an array of dinners.

getMyDinners: function () {

    var parseDinners = function (response) {

        var tmp = document.implementation.createHTMLDocument();
        tmp.body.innerHTML = response.data;

        var items = $(tmp.body.children).find('.upcomingdinners li');

        var dinners = [];
        for (var i = 0; i < items.length; i++) {
            var dinner = {
                Name: $(items[i]).children('a')[0].innerText,
                Date: $(items[i]).children('strong')[0].innerText
            };
            dinners.push(dinner);
        }

        return dinners;
    }

    return $http.get('http://www.nerddinner.com/Dinners/My')
                .then(parseDinners);
}

Let's load index.html again and do the login, and now you can see the 2 dinners that are in the My Dinners section loaded into the page!

Mobile App

We haven't actually built a mobile app here yet and I'll do a follow-up post explaining how to use this code in a hybrid mobile app built with the Ionic Framework.

For those of you who already know how to do that, make sure you whitelist the domains you're accessing.

So that's all folks, go ahead and create your web scraping hybrid app and let me know in the comments what kind of kick-ass app you built!

WRITTEN BY
profile
Ashteya Biharisingh

Full stack developer who likes to build mobile apps and read books.