2

I need to scrape the content behind a page that has a http basic authentication. Also, the site has has ssl. What i wrote so far:

Document document = Jsoup.connect("https://someuser:somepassword@somedomain.com").get();

But it doesn't work. Also tried:

Document document = Jsoup
                    .connect("https://somedomain.com").get();
                    .header("Authorization", "Basic " + base64login)
                    .get();

Where base64login is:

 private String title;
String username = "someuser";
String password = "somepass";
String login = username + ":" + password;
public String base64login = Base64.encodeToString(login.getBytes(), Base64.DEFAULT);

I don't know how to get it working. Can somebody help me?

Weizen
  • 263
  • 2
  • 6
  • 17

1 Answers1

1

Without the URL it is hard to know, but I guess your default Charset encoding does not match what the webserver expects. Maybe try this:

public String base64login = new String(
    Base64.encodeBase64(login.getBytes(Charset.forName("UTF-8")))
    );

This uses the org.apache.commons.codec.binary.Base64 methods.

The login should be done as you tried in your second approach, i.e. with the "Authorization" header. BTW - your have an error there in your code. The header method must be called before get and there is no semicolon after connect.

luksch
  • 11,497
  • 6
  • 38
  • 53
  • Anyway the real error is org.jsoup.HttpStatusException: HTTP error fetching URL. Status=401 – Weizen Feb 20 '16 at 08:27
  • The 401 error hints towards a wrong user/pw combination. See http://www.checkupdown.com/status/E401.html – luksch Feb 20 '16 at 09:50
  • I'm pretty sure they're correct, i log with those credentials via web and it works. – Weizen Feb 20 '16 at 10:09
  • Maybe you need to provide a correct userAgent? Or some cookie? – luksch Feb 20 '16 at 10:17
  • I am providing a correct userAgent. I don't know what to do with cookies. Anyway i don't think this is the problem – Weizen Feb 20 '16 at 10:20
  • I can't reproduce this without having an account with the site. Good luck in finding the real culprit. – luksch Feb 20 '16 at 10:26