My answer applies only to Form Authentication (this is the most common form of authentication).
Basically, when you browse a website, you open a "session" on it. When you log in on the website, your session gets "authenticated" and you're granted access everywhere based on that.
Your browser identifies the corresponding session to the server thanks to a Session Id stored in a cookie.
So you must browse the login page and then browse the page you want without forgetting to send the cookie in the process. The cookie is the link between all the pages you browse.
I actually faced the same problem you did a while ago, and wrote a class to do that without having to keep in mind this cookie thing.
Look quickly at the class, it is not important, but look well at the example below. It allows you to submit forms that implement CSRF protection.
This class has basically the following features:
- Complies with CSRF token-based protection
- Sends a "common" user-agent. Some websites reject queries that don't communicate a user-agent
- Sends a Referrer header. Some websites reject queries that don't communicate a referrer (this is another anti-csrf protection)
- Stores the cookie across the calls
File: WebClient.php
<?php
/**
* Webclient
*
* Helper class to browse the web
*
* @author Bgi
*/
class WebClient
{
private $ch;
private $cookie = '';
private $html;
public function Navigate($url, $post = array())
{
curl_setopt($this->ch, CURLOPT_URL, $url);
curl_setopt($this->ch, CURLOPT_COOKIE, $this->cookie);
if (!empty($post)) {
curl_setopt($this->ch, CURLOPT_POST, TRUE);
curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
}
$response = $this->exec();
if ($response['Code'] !== 200) {
return FALSE;
}
//echo curl_getinfo($this->ch, CURLINFO_HEADER_OUT);
return $response['Html'];
}
public function getInputs()
{
$return = array();
$dom = new DOMDocument();
@$dom->loadHtml($this->html);
$inputs = $dom->getElementsByTagName('input');
foreach($inputs as $input)
{
if ($input->hasAttributes() && $input->attributes->getNamedItem('name') !== NULL)
{
if ($input->attributes->getNamedItem('value') !== NULL)
$return[$input->attributes->getNamedItem('name')->value] = $input->attributes->getNamedItem('value')->value;
else
$return[$input->attributes->getNamedItem('name')->value] = NULL;
}
}
return $return;
}
public function __construct()
{
$this->init();
}
public function __destruct()
{
$this->close();
}
private function init()
{
$this->ch = curl_init();
curl_setopt($this->ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1");
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($this->ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($this->ch, CURLINFO_HEADER_OUT, TRUE);
curl_setopt($this->ch, CURLOPT_HEADER, TRUE);
curl_setopt($this->ch, CURLOPT_AUTOREFERER, TRUE);
}
private function exec()
{
$headers = array();
$html = '';
ob_start();
curl_exec($this->ch);
$output = ob_get_contents();
ob_end_clean();
$retcode = curl_getinfo($this->ch, CURLINFO_HTTP_CODE);
if ($retcode == 200) {
$separator = strpos($output, "\r\n\r\n");
$html = substr($output, $separator);
$h = trim(substr($output,0,$separator));
$lines = explode("\n", $h);
foreach($lines as $line) {
$kv = explode(':',$line);
if (count($kv) == 2) {
$k = trim($kv[0]);
$v = trim($kv[1]);
$headers[$k] = $v;
}
}
}
// TODO: it would deserve to be tested extensively.
if (!empty($headers['Set-Cookie']))
$this->cookie = $headers['Set-Cookie'];
$this->html = $html;
return array('Code' => $retcode, 'Headers' => $headers, 'Html' => $html);
}
private function close()
{
curl_close($this->ch);
}
}
How do I use it?
In this example, I login to a website, then browse to a page which contains a form to upload a file, then I upload the file:
<?php
require_once('WebClient.php');
$url = 'http://example.com/administrator/index.php'; // This a Joomla admin
$wc = new WebClient();
$page = $wc->Navigate($url);
if ($page === FALSE) {
die('Failed to load login page.');
}
echo('Logging in...');
$post = $wc->getInputs();
$post['username'] = $username;
$post['passwd'] = $passwd;
$page = $wc->Navigate($url, $post);
if ($page === FALSE) {
die('Failed to post credentials.');
}
echo('Initializing installation...');
$page = $wc->Navigate($url.'?option=com_installer');
if ($page === FALSE) {
die('Failed to access installer.');
}
echo('Installing...');
$post = $wc->getInputs();
$post['install_package'] = '@'.$file; // The @ specifies we are sending a file
$page = $wc->Navigate($url.'?option=com_installer&view=install', $post);
if ($page === FALSE) {
die('Failed to upload file.');
}
echo('Done.');
The Navigate() method returns either FALSE either the HTML content of the page browsed.
Oh, and one last thing: don't use regexes to parse HTML, this is WRONG. There is a legendary StackOverflow answer about that: see here.