5

I'm trying to extract the e-mail adress and the phone number from a linkedin profile using jsoup, each of these informations is in a table. I have written a code to extract them but it doesn't work, the code should work on any linkedin profile. Any help or guidance would be much appreciated.

public static void main(String[] args) {
    try {

        String url = "https://fr.linkedin.com/";
        // fetch the document over HTTP
        Document doc = Jsoup.connect(url).get();

        // get the page title

        String title = doc.title();
        System.out.println("Nom & Prénom: " + title);
        //  first method
        Elements table = doc.select("div[class=more-info defer-load]").select("table");
        Iterator < Element > iterator = table.select("ul li a").iterator();
        while (iterator.hasNext()) {
            System.out.println(iterator.next().text());
        }
        // second method
        for (Element tablee: doc.select("div[class=more-info defer-load]").select("table")) {
            for (Element row: tablee.select("tr")) {
                Elements tds = row.select("td");
                if (tds.size() > 0) {
                    System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
                }
            }
        }
    }
}

here is an example of the html code that i'm trying to extract (taken from a linkedin profile)

<table summary="Coordonnées en ligne">
   <tr>
      <th>E-mail</th>
      <td>
         <div id="email">
            <div id="email-view">
               <ul>
                  <li>
                     <a href="mailto:adam1adam@gmail.com">adam1adam@gmail.com</a>
                  </li>
               </ul>
            </div>
         </div>
      </td>
   </tr>
   <tr class="no-contact-info-data">
      <th>Messagerie instantanée</th>
      <td>
         <div id="im" class="editable-item">
         </div>
      </td>
   </tr>
   <tr class="address-book">
      <th>Carnet d’adresses</th>
      <td>
         <span class="address-book">
         <a title="Une nouvelle fenêtre s’ouvrira" class="address-book-edit" href="/editContact?editContact=&contactMemberID=368674763">Ajouter</a> des coordonnées.
         </span>
      </td>
   </tr>
</table>
<table summary="Coordonnées">
   <tr>
      <th>Téléphone</th>
      <td>
         <div id="phone" class="editable-item">
            <div id="phone-view">
               <ul>
                  <li>0021653191431&nbsp;(Mobile)</li>
               </ul>
            </div>
         </div>
      </td>
   </tr>
   <tr class="no-contact-info-data">
      <th>Adresse</th>
      <td>
         <div id="address" class="editable-item">
            <div id="address-view">
               <ul>
               </ul>
            </div>
         </div>
      </td>
   </tr>
</table>
AMI
  • 97
  • 12

1 Answers1

1

To scrape email and phone number, use css selectors to target the element identifiers.

    String email = doc.select("div#email-view > ul > li > a").attr("href");
    System.out.println(email);

    String phone = doc.select("div#phone-view > ul > li").text();   
    System.out.println(phone);

See CSS Selectors for more information.

Output

mailto:adam1adam@gmail.com
0021653191431 (Mobile)
Graham
  • 7,431
  • 18
  • 59
  • 84
Zack
  • 3,819
  • 3
  • 27
  • 48
  • I don't get any errors running this code, but it also doesn't return any result! I can't figure out where the problem is. – AMI Jul 28 '16 at 13:40
  • The problem is LinkedIn contact information is not available unless you're logged in. You will need to login and pass the cookie when you connect to the profile. Here is an example: http://stackoverflow.com/questions/31640844/login-to-website-through-jsoup-post-method-not-working – Zack Jul 28 '16 at 14:01
  • i tried to do exactly like the example you gave me but it keeps on returning nothing! the weird thing is i can actually fetch other data like the user's experience and education so i don't think the problem is related to the cookie, is it ? – AMI Jul 28 '16 at 15:50
  • If you're using Jsoup.connect(url).get() then you're not passing cookies, which is like you are visiting LinkedIn.com without logging in. You have to be logged in to see contact info. You can see other public data without being logged in, but not contact info. – Zack Jul 28 '16 at 15:57
  • I get your point and i tried to follow the example you gave me but it didn't work, i sent a get request to [https://www.linkedin.com/uas/login] then i sent a post request with my mail adress and linkedin password to [https://www.linkedin.com/nhome/], but then no idea how to navigate to the profile that i wanted to fetch. – AMI Jul 28 '16 at 21:55
  • @AAA try this solution: http://stackoverflow.com/questions/31280097/login-into-linkedin-with-jsoup – Zack Jul 29 '16 at 13:45
  • you have to add .cookie(previousConnect) – Salman Salman Jul 29 '16 at 14:51