I'm a newbie in PHP OOP. I'm trying to get how can I structure this kind of application. This application is for Scraping about 100 different websites.
I have a main Class, "Scrap" that handles the methods that are global to all different websites, and inside the folder "Scripts" I have the Classes for handle particular aspects of the website I'm scraping. I have another folder that is called "Lib" to include external libraries.
Let me explain visually:
I have this file schema:
- Scrap.php
+ Scripts
- Google.php
- Yahoo.php
- Stackoverflow.php
+ Lib
+ libScrap
- LIB_parse.php
+ phpQuery
- phpQuery.php
- others files and folder...
The Scrap.php contains the following:
<?php
// Includes
require('/lib/libScrap/LIB_parse.php');
require('/lib/phpQuery/phpQuery.php');
// Testing Scrap
$testing = new Scrap;
$testing->teste = $testing->getPage('http://www.yahoo.com','','off');
echo $testing->teste;
class Scrap {
public function __construct() {
// do things!
}
/*
* This method grabs the entire page(HTML) on given URL
* Ex: $htmlgrab->teste = $htmlgrab->getPage('http://testing.com/ofertas/','','off');
* Returns, the HTML of given URL
*/
public function getPage($site, $proxy, $proxystatus) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on') {
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start(); // prevent any output
return curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($ch);
}
/*
*
*
*/
public function getLinks() {
// do things!
}
/*
* This method grabs the page title.
* Ex: <title>This is the page title</title>
* Returns, "This is the page title"
*/
public function getTitle() {
// do things!
}
}
?>
And inside the folder "Scripts" I will have files like this one:
<?php
require('../Scrap.php');
class Yahoo extends Scrap {
public function doSomething() {
// do things!
}
}
?>
End note: I need to call/instantiate all the classes created in the folder "Scripts" to Scrap the websites. My doubt is about the best method to instantiate about 100 classes.
If you can give me some clues on how to design this.
Best Regards,
Sorry my bad english.