Python整站爬取是一項(xiàng)非常有用的技能,能夠讓我們快速、自動(dòng)地獲取大量的數(shù)據(jù)。下面是一個(gè)簡(jiǎn)單的Python程序,用于爬取整個(gè)網(wǎng)站:
import requests from bs4 import BeautifulSoup import os def download_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'} r = requests.get(url, headers=headers, timeout=30) r.encoding = 'utf-8' return r.text def get_links(html): soup = BeautifulSoup(html, 'lxml') links = [] for i in soup.find_all('a'): link = i.get('href') links.append(link) return links def downloads_urls(root_url, save_path): if not os.path.exists(save_path): os.makedirs(save_path) html = download_page(root_url) links = get_links(html) for link in links: link = link.strip() if not link.startswith('http'): link = root_url + link if link.endswith('.html') or link.endswith('.htm'): content = download_page(link) filename = link.split('/')[-1] with open(os.path.join(save_path, filename), 'w', encoding='utf-8') as f: f.write(content) if __name__ == '__main__': url = 'https://www.example.com' save_path = r'D:\example' downloads_urls(url, save_path)
上面的程序使用了requests和BeautifulSoup來(lái)獲取網(wǎng)頁(yè)內(nèi)容和解析網(wǎng)頁(yè)結(jié)構(gòu),然后使用os和open函數(shù)把網(wǎng)頁(yè)保存到本地。
當(dāng)然,這只是一個(gè)簡(jiǎn)單的例子,實(shí)際中要考慮到網(wǎng)站的反爬措施、網(wǎng)站的結(jié)構(gòu)和規(guī)模等等問(wèn)題。但是,這個(gè)簡(jiǎn)單的程序足以幫助你入門Python整站爬取這個(gè)領(lǐng)域。