0

I have html page with structure like this

<div id="1">
  <div id="2">
    <div id="3">
      <div id="4">
        <div id="5">   
          <div id="photo">    
            <a id="photo" href="link">
              <img width="200" src="http://site.com/photo.jpg"> 
            </a> 
          </div>
          <div id="info"></div>
        </div>
      </div> 
    </div> 
  </div> 
</div> 

I need to get img url (http://site.com/...)

my code:

include('simple_html_dom.php');

// Create a DOM object from a URL
$html = file_get_html('http://site.com/123');


// find all div tags with id=gbar
foreach($html->find('img[width="200"]') as $e)
    echo $e->src . '<br>';

but it doesn't work for this site.
May be there is another way to get image url

Mikhail Vladimirov
  • 13,572
  • 1
  • 38
  • 40
jhenya-d
  • 399
  • 7
  • 19
  • you cannot have two elements having same id. please correct that first. – Kumar Saurabh Sinha Mar 06 '13 at 10:09
  • @SaurabhSinha - semantically true but I don't think simple-html-dom cares as it parses a flat file and will simply return the first occurrence. – Emissary Mar 06 '13 at 10:14
  • OP: What site? Are you sure the HTML your script is being served is the same as the HTML you are being served. ie. `file_get_html` (i think) uses the native `files_get_contents` which in turn sends a raw request with no headers - the likes of Facebook for example wont give you the content you are expecting without the user-agent being specified. Can you `echo $html` and double check that this is the structure you are expecting. – Emissary Mar 06 '13 at 10:17
  • for wordpress site pages it works correct – jhenya-d Mar 06 '13 at 10:17
  • page is like http://vk.com/durov – jhenya-d Mar 06 '13 at 10:20
  • There is no error in above code!!! but it may chance to have some problem to grab data though file_get_content() or url provide some data which is not in form of HTML!!! – TheFoxLab Mar 06 '13 at 10:33

3 Answers3

0

Should probably be $html->find('img[width=200]') without extra quotes around 200.

Mikhail Vladimirov
  • 13,572
  • 1
  • 38
  • 40
  • @user2054164 Your example works fine for me even with quotes around `200`. Can you insert `print_r ($html)` right after `$html = file_get_html('http://site.com/123');` and post here what does it print? – Mikhail Vladimirov Mar 06 '13 at 10:23
  • empty, site is the main page of vk.com user like vk.com/durov – jhenya-d Mar 06 '13 at 10:31
0

As expected the site serves different content based on the User-Agent, to get the HTML that you are expecting you need let the server know you want the "for browsers" version. For example you could remove this line:

$html = file_get_html('http://vk.com/durov');

... and replace it with something like this:

$context = stream_context_create(array('http' => array(
  'header' => 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'
)));
$html = str_get_html( file_get_contents('http://vk.com/durov', false, $context) );

I should note that the practice of spoofing the User-Agent is generally frowned upon and you should perhaps run this to see if the information contained suits your needs:

<?php
  header('Content-type: text/plain');
  echo file_get_contents('http://siteurl.com');

which will show the source code that the site wants bots to see - for the site in question this is a lightweight version of the page - which from your point of view takes less time to process.

Emissary
  • 9,954
  • 8
  • 54
  • 65
  • @user2054164 definitely works mate - I've just run [this](http://pastebin.com/2p6qAcBQ) which printed a single line `http://cs7003.vk.me/v7003685/1ddd/jZ8LZcwYN20.jpg` \[[link](http://cs7003.vk.me/v7003685/1ddd/jZ8LZcwYN20.jpg)\] – Emissary Mar 06 '13 at 11:03
  • you are right I just get lightweight page without tags that I need, thats why I get empty result. Is it way to get the desired result? – jhenya-d Mar 06 '13 at 11:04
  • but if I need to be logged to see user profile, it return empty result, is it possible to add authorization in this code – jhenya-d Mar 06 '13 at 11:13
  • It's certainly possible - You're better off using cURL to fetch pages with the necessary post requests and store cookie data etc. That is another question in itself though - there are plenty of StackOveflow examples - [like this](http://stackoverflow.com/questions/4873783/how-to-login-to-another-site-via-php) - on that topic. – Emissary Mar 06 '13 at 11:19
0

You could use a regular expression to find it, for example:

<?php 
$string = '
<div id="1">
  <div id="2">
    <div id="3">
      <div id="4">
        <div id="5">   
          <div id="photo">    
            <a id="photo" href="link">
              <img width="200" src="http://site.com/photo.jpg"> 
            </a> 
          </div>
          <div id="info"></div>
        </div>
      </div> 
    </div> 
  </div> 
</div> ';

$pattern = '/http[^""]+/';
preg_match($pattern, $string, $matches);
print_r($matches);

prints:

Array
(
    [0] => http://site.com/photo.jpg
)
Oli
  • 2,370
  • 2
  • 26
  • 42