1

I want to start off by saying that we only scrape our own account, because my company needs data from our own dashboard that we can't get from the MWS APIs. I am very familiar with those APIs.

I've had login/scraping scripts for years. But recently Amazon started offering up captchas. My old way of scraping was from PHP making cURL requests to mimic the browser.

My new approach is using PhantomJS and CasperJS to achieve the same effect. Everything was working fine for a day, but I'm getting captcha again.

Now, I happen to know from internal sources that Amazon isn't doing any scrape detection. They do however do hacking / DDOS attack detection. So I think something about this casperJS code is getting flagged as an attack.

I don't think I'm calling the script too often. And I've changed my IP address that the requests are coming from.

Here is some casperJS code

var fs = require('fs');
var casper = require('casper').create({
    pageSettings: {
        loadImages: false,
        loadPlugins: false,
        userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
    }
});

// use any cookies
var cookieFilename = "cookies/_cookies.txt";
var data = fs.read(cookieFilename);
if(data) {
    phantom.cookies = JSON.parse(data);
}

//First step is to open Amazon
casper.start("https://sellercentral.amazon.com/gp/homepage.html", function() {
    console.log("Amazon website opened");
});

casper.wait(1000, function() {
    if(this.exists("form[name=signinWidget]")) {
        console.log("need to login");
        //Now we have to populate username and password, and submit the form
        casper.wait(1000, function(){
            console.log("Login using username and password");
            this.evaluate(function(){
                document.getElementById("username").value="*****";
                document.getElementById("password").value="*****";
                document.querySelector("form[name=signinWidget]").submit();
            });
        });
        // write the cookies
        casper.wait(1000, function() {
            var cookies = JSON.stringify(phantom.cookies);
            fs.write(cookieFilename, cookies, 644);
        })
    } else {
        console.log("already logged in");
    }
});


//Wait to be redirected to the Home page, and then make a screenshot
casper.wait(1000, function(){
    console.log("is login found?");
    console.log(this.exists("form[name=signinWidget]"));
    this.echo(this.getPageContent());
});

casper.run();

The result of that last line is just a login page with captcha. What gives? This should be a normal browser. When I use the same login on my computer, I get no issues at all.

I've also tried several different user agent strings. Sometimes changing those works temporarily.

Also, when I load all this locally, it works fine. But on the linux server it get's the captcha. Note that I've changed the IP on the remote linux server many times. It still get's the captcha.

Sean Clark
  • 1,436
  • 1
  • 17
  • 31
  • Have you tried using cookie jar to persist cookies? – Vaviloff Dec 04 '15 at 00:40
  • I've got the --cookies-file option when running casper, cookie jar doesn't make since for cURL since it's just invoking the node script on the other server. Thats the one that should retain the cookies. It doesn't seem like it is, even though cookies are being written. – Sean Clark Dec 04 '15 at 04:07
  • "cookie jar doesn't make since for cURL since it's just invoking the node script on the other server" Sorry, didn't catch that up. My reasoning is that to imitate a real browser as closely as possible, you have to support cookies. Also, if you make more that one page, script should make pauses betwen them as if a human opens those pages. And if all else fails there is one last desperate option to outsource captcha solving. – Vaviloff Dec 04 '15 at 04:18
  • Unless I'm doing something wrong in the above code, cookies are supported. I'm passing the cookie flag to phantom, and cookies are being written. I'll try doing some waiting to see what happens. – Sean Clark Dec 04 '15 at 04:44
  • It get's even weirder. I installed everything locally. Same exact script (with now working cookies thanks to http://stackoverflow.com/questions/15907800/how-to-persist-cookies-between-different-casperjs-processes/16954187#comment55915536_16954187. And the remote one (with a new IP) gets the captcha. But my local mac version is just fine. What the hell – Sean Clark Dec 04 '15 at 05:23
  • "They do however do hacking / DDOS attack detection". Well, maybe your server IP range in just untrusted? – Vaviloff Dec 04 '15 at 05:42
  • @Vaviloff this was the answer. You should leave it as such so I can mark it – Sean Clark Dec 06 '15 at 20:00
  • Thanks, Sean, I've placed the suggestion as the answer. – Vaviloff Dec 07 '15 at 03:47
  • Sorry to place this here, but maybe one of you can help answer my question? It's very closely related. Thanks! http://stackoverflow.com/questions/35852091/amazon-login-phantom-js – cracka31 Mar 10 '16 at 22:27

1 Answers1

0

As it often happens with scraping/automation the reason for errors is not necessarily incorrectly written script, but also the context, underlying infrastructure.

In this case we determined (in comments) that the script was challenged with captcha only when run from a particular server, IP-address of which seems to have been put in an untrusted list.

Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • To elaborate for other users, In the past changing the IP worked. But every IP in the rackspace data center i tried was blocked. So it was the whole range. So I moved the serve to a new data centered AND made my script much more human like, as to not get blocked in the future. – Sean Clark Dec 07 '15 at 07:15
  • In my case, I was logging in every time, and not utilizing cookies correctly since the CasperJS command line flag doesn't work. I had to save and restore cookies myself using fs.write – Sean Clark Dec 07 '15 at 07:16
  • Interesting remark about CasperJS command not working - is it a bug in MacOs version or maybe an old Casper version? – Vaviloff Dec 07 '15 at 07:52
  • I'm not sure, I found the answer on another SO post. But it worked when I manually wrote the cookies and parsed them back in myself – Sean Clark Dec 07 '15 at 08:24