1/1/1970
BeautifulSoup - Python package for parsing HTML and XML documents
Extract content from a url
import requests
url = "https://timesofindia.indiatimes.com/city/delhi"
r = requests.get(url)
print(r.text)Add function to save r.text in a file
import requests
#fuction to write r.tex in a file
def fetchAndSaveToFile(url, path):
r= requests.get(url)
# syntax to open file in write mode
with open(path, "w") as f:
f.write(r.text)
url = "https://timesofindia.indiatimes.com/city/delhi"
# save file fuction call
fetchAndSaveToFile(url, "data/times.html")Use with proxies, here Proxy Lab to access url
import requests
#proxy lab proxy
proxies = {
"http": "http://customer-rcodewithharry:ActcitXccR8xbxs@pr.oxylabs.io:7777"
"htts": "http://customer-rcodewithharry:ActcitXccR8xbxs@pr.oxylabs.io:7777"
}
def fetchAndSaveToFile(url, path):
r= requests.get(url)
with open(path, "w") as f:
f.write(r.text)
url = "https://timesofindia.indiatimes.com/city/delhi"
fetchAndSaveToFile(url, "data/times.html")Use services like Oxylab Proxies to generate new IP adress everytime, So no fear of blocking of a IP adress and exhaution of limit of a IP address to fetch a website
To Access .html file using BeautifulSoup
import requests
from bs4 import BeautifulSoup
# syntax to open file in read mode
with open("sample.html", "r") as f:
html_doc = f.read()
soup = Beautifulsoup(html_doc, 'html.parser')