原代码:
import requests import re headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188" } for i in range(0,226,25): zxcvzx = f"https://movie.douban.com/top250?start={i}&filter=" response = requests.get(zxcvzx, headers=headers) a = re.findall("<img width=\"100\" alt=\"(.*?)\"", response.text, re.S) for vbn in a: print(vbn)
优化,
import requests from bs4 import BeautifulSoup headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188" } session = requests.Session() session.headers.update(headers) def get_movie_titles(url): response = session.get(url) soup = BeautifulSoup(response.content, "html.parser") movie_titles = soup.find_all("img", width="100", alt=True) return [title["alt"] for title in movie_titles] def main(): for i in range(0, 226, 25): url = f"https://movie.douban.com/top250?start={i}&filter=" movie_titles = get_movie_titles(url) for title in movie_titles: print(title) if __name__ == "__main__": main()
优化点:
1. 使用 requests 库中的 Session 对象来通过复用连接和全局设置请求头来提高性能。
2. 使用 bs4 库中的 BeautifulSoup 来解析 HTML 内容,而不是使用正则表达式。