1

I need to scrape some data out of a website that asks for login first, and i do manage to login using curl, here is my login code:

$login = 'https://example.com/login';
$ch = curl_init($login);

curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_COOKIEJAR => COOKIE_FILE,
    CURLOPT_COOKIEFILE => COOKIE_FILE
]);

$response = curl_exec($ch);     
$re = '/<input type="hidden" name="csrf" value="(.*?)" \/>/m';

preg_match_all($re, $response, $matches, PREG_SET_ORDER, 0);

$arr = array(
    'email' => 'email@example.com',
    'password' => 'Password123',
    'csrf' => $matches[0][1]
);

curl_setopt_array($ch, [
    CURLOPT_URL => $login,
    CURLOPT_USERAGENT => 'Mozilla/5.0',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => http_build_query($arr),
    CURLOPT_COOKIEJAR => COOKIE_FILE,
    CURLOPT_COOKIEFILE => COOKIE_FILE,
    CURLOPT_FOLLOWLOCATION => true
]);

curl_exec($ch);

Now, after login i have to scrape 70-100 pages and i manage to do that by using a foreach loop but it takes like forever. Here is my code:

$arr = [
    [
        'id' => '1',
        'csrf' => $matches[0][1] //same csrf as in login
    ],[
        'id' => '2',
        'csrf' => $matches[0][1] //same csrf as in login
    ],[
        ...
    ],[
        'id' => '100',
        'csrf' => $matches[0][1] //same csrf as in login
    ]
];

foreach($arr as $v){
    curl_setopt_array($ch,[
        CURLOPT_URL => 'https://example.com/submit',
        CURLOPT_USERAGENT => 'Mozilla/5.0',
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => http_build_query($v),
        CURLOPT_FOLLOWLOCATION => true
    ]);

    $return = curl_exec($ch);
    $info = curl_getinfo($ch);
    
    //do something with the returned data   
}

But if i'm trying to use multi_curl i can't keep the login alive and i'm greeted by a 405 http_code.

Is there a solution to use curl for login and multi for scraping? Thank you!

EDIT Here is the code that i'm using for multi_curl(found it here, on stackoverflow):

function multiRequest($data, $options = array()) {
 
  // array of curl handles
  $curly = array();
  // data to be returned
  $result = array();
 
  // multi handle
  $mh = curl_multi_init();
 
  // loop through $data and create curl handles
  // then add them to the multi-handle
  foreach ($data as $id => $d) {
 
    $curly[$id] = curl_init();
 
    $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
    curl_setopt($curly[$id], CURLOPT_URL, $url);
    curl_setopt($curly[$id], CURLOPT_USERAGENT, 'Mozilla/5.0');
    curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
 
    // post?
    if (is_array($d)) {
      if (!empty($d['post'])) {
        curl_setopt($curly[$id], CURLOPT_POST, true);
        curl_setopt($curly[$id], CURLOPT_POSTFIELDS, http_build_query($d['post']));
        curl_setopt($curly[$id], CURLOPT_FOLLOWLOCATION, true);
      }
    }
 
    // extra options?
    if (!empty($options)) {
      curl_setopt_array($curly[$id], $options);
    }
 
    curl_multi_add_handle($mh, $curly[$id]);
  }
 
  // execute the handles
  $running = null;
  do {
    curl_multi_exec($mh, $running);
  } while($running > 0);
 
 
  // get content and remove handles
  foreach($curly as $id => $c) {
    $result[$id] = curl_multi_getcontent($c);
    curl_multi_remove_handle($mh, $c);
  }
 
  // all done
  curl_multi_close($mh);
 
  return $result;
}
 
Emanuel Ones
  • 289
  • 3
  • 9
  • 1
    `if i'm trying to use multi_curl i can't keep the login alive` - uhh, did you enable cookie sharing between the curl handle that logged in and the other curl handles? otherwise, only the curl that logged in will actually be... logged in. for cookie sharing, read up on [curl_share_init()](http://php.net/manual/en/function.curl-share-init.php) (albeit honestly, i have several times opted to clone the cookies instead of using actual cookie sharing) and show us the curl_multi code you tried that didn't work, people **often** get curl_multi code wrong, maybe we can point out something you did wrong – hanshenrik Feb 23 '19 at 09:25
  • @hanshenrik, i've added the multi_curl code that i'm using(i found it here on stack) – Emanuel Ones Feb 23 '19 at 10:06
  • 1
    that curl_multi function is dangerous, add a max connections limit to it. if you have a list of 100,000 urls, this thing will attempt to create 100,000 curl handles, and 100,000 simultaneous connections, you'll run out of socket handles and crash, or run out of memory and crash. - but did you dump the cookies of the handle that logged in and give the cookies to the curl_multi's? – hanshenrik Feb 23 '19 at 11:47
  • 1
    and by the way, [do not parse HTML with regex](https://stackoverflow.com/a/1732454/1067003), use DOMDocument/DOMXPath instead. ```$re = '//m'; preg_match_all($re, $response, $matches, PREG_SET_ORDER, 0);``` should actually be ```$domd=@DOMDocument::loadHTML($response);$xp=new DOMXPath($domd);$csrf=$xp->query("//input[@name='csrf']")->item(0)->getAttribute("value");``` – hanshenrik Feb 23 '19 at 11:50
  • i've managed to use your answer `curl_share_init()` and now it works. Thank you @hanshenrik! You might aswell add it as an aswer so i can choose it as prefered – Emanuel Ones Feb 23 '19 at 13:00
  • FYI it's __scrape__ (and __scraping__, __scraped__, __scraper__) not scrap – DisappointedByUnaccountableMod Apr 23 '21 at 09:31

0 Answers0