0

I have no idea it's working only when I right-click, copy the entire body as HTML into a file and parse it, but when I access it directly from the link via request, I get 0 results.

Ex.

Sample HTML

bsoup.py

When I use these codes

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')

for card in cards:
    for strong in card.findChildren('strong', recursive=True):
        print('================================================')
        print(strong.next_sibling)

Result: ✅Working, I can see 3 cover letters

➜  web_scrap git:(master) ✗ python3 bsoup.py
================================================

    hi!
I have experience with sending commands to Alexa.
The thing is that you'll need to send audio file (not text) to Alexa service.
Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
But it is possible

================================================

    Answer my questions please

================================================

    I am an Expert in Alexa development

bsoupv2.py

I've tried to use a request to a link instead.

URL Link updated

from bs4 import BeautifulSoup
import requests

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

url = "https://www.bunlongheng.com/raw/NGIwMTUzOTQtMTFlNS00NjYzLTk3MjEtYmU0Yzg5OGU1Yzc4"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')

soup = soup.find(['body'])
cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')

for card in cards:
    for strong in card.findChildren('strong', recursive=True):
        print('================================================')
        print(strong.next_sibling)

Result: ❌ Not Working, I can see 0 cover letter, nothing!

➜  web_scrap git:(master) ✗ python3 bsoupv2.py
➜  web_scrap git:(master) ✗
code-8
  • 54,650
  • 106
  • 352
  • 604
  • 3
    Opening the https://www.upwork.com/ab/applicants/1578029486353379328/applicants I see only a login page. You have to login first to see the content. – Andrej Kesely Oct 19 '22 at 15:03
  • try using req.text instead of req.content, as req.content will return bytes. (edit: ah yes, as the above comments suggest - you have to provide auth within the request) – py9 Oct 19 '22 at 15:05
  • 1
    Does this answer your question? [How to scrape a website which requires login using python and beautifulsoup?](https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup) – 0stone0 Oct 19 '22 at 15:09
  • @0stone0 : If we leave upwork link out of the picture, and use the link to raw texts as I provided which doesn't require login, I bet I will get same result. – code-8 Oct 19 '22 at 15:51
  • @AndrejKesely, I can update the post, let's use the link from the sample. – code-8 Oct 19 '22 at 15:52
  • @code-8 See my answer. The content on the server contains html entities which need to be parsed (unescaped). – Andrej Kesely Oct 19 '22 at 15:59

1 Answers1

1

Try to html.unescape the response from the www.bulongheng.com:

import html
import requests
from bs4 import BeautifulSoup

headers = {
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "GET",
    "Access-Control-Allow-Headers": "Content-Type",
    "Access-Control-Max-Age": "3600",
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}


url = "https://www.bunlongheng.com/raw/NGIwMTUzOTQtMTFlNS00NjYzLTk3MjEtYmU0Yzg5OGU1Yzc4"
req = requests.get(url, headers)
soup = BeautifulSoup(html.unescape(req.text), "html.parser")

cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')

for card in cards:
    for strong in card.findChildren("strong", recursive=True):
        print("================================================")
        print(strong.next_sibling)

Prints:

================================================

    hi!
I have experience with sending commands to Alexa.
The thing is that you'll need to send audio file (not text) to Alexa service.
Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
But it is possible
  
================================================

    Answer my questions please
  
================================================

    I am an Expert in Alexa development
  
================================================

    hi!
I have experience with sending commands to Alexa.
The thing is that you'll need to send audio file (not text) to Alexa service.
Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
But it is possible
  
================================================

    Answer my questions please
  
================================================

    I am an Expert in Alexa development
  
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91