首页 编程教程正文

python爬虫采集可可英语the english we speak

piaodoo 编程教程 2020-02-22 22:16:27 1466 0 python教程

本文来源吾爱破解论坛

起因:
最近在学英语发现有一个材料叫做the english we speak,不是特别热门的资源,音频还算比较容易找但是文本比较难找(可以打包下载的很难找),有的网站需要注册后用积分购买。
最终发现可可英语有在线的文本,因为有300多期不提供批量下载,于是就想着用python采集下来制作成pdf方便使用,随便把音频也采集下来,采集的网站http://m.kekenet.com/menu/14439/index.shtml


分析
1.音频和所需的文本都在网页源代码中
2.音频直接beautifulsoup解析到链接即可
3.文本的采集稍微麻烦一点,因为位置不固定,只能采集上一层标签,而且需要的p标签中还有span标签中的内容不需要,strong标签不需要,</br>换成换行符
4.解决3的问题采用正则表达式,先把tag类型转成str,删除span标签及其中的内容,删除strong标签,</br>换成换行符,然后转回tag类型,循环遍历p标签,get_text()只取内容,把所用的内容拼接起来保存。

[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
import time

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'
}
def remove_span_tag(tag):
    content = str(tag)
    treated_content = re.sub('<span.*?span>','',content,flags=re.S+re.I)
    result = BeautifulSoup(treated_content,'lxml')
    return result

def remove_strong_tag(tag):
    content = str(tag)
    treated_content = re.sub('<strong>|<strong/>','',content,flags=re.S+re.I)
    result = BeautifulSoup(treated_content,'lxml')
    return result

def remove2next1(string):
    treated_content = re.sub('\n\n','\n',string,flags=re.S+re.I)
    return treated_content

def change_br2next(tag):
    content = str(tag)
    # treated_content = re.sub('<br>','',content,flags=re.S+re.I)
    treated_content = re.sub('<br/>','\n',content,flags=re.S+re.I)
    result = BeautifulSoup(treated_content,'lxml')
    return result

def get_html(url):
    response = requests.get(url,headers=headers)
    if response.status_code==200:
        response.encoding = 'utf-8'
        # print(response.apparent_encoding)
        return response.text

def parse_audio_text(html):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('div.f-title')[0].string
    # print(title)
    audio = soup.select('#show_mp3 > audio')[0].source['src']
    # print(audio)
    content = soup.select('#content > div > div.infoMain > div.f-y.w.hauto')[0]
    texts =content.select('p')
    # test = change_br2next(remove_strong_tag(remove_span_tag(texts[1])))
    # print(test)
    result = ''
    for text in texts:
        result+= change_br2next(remove_strong_tag(remove_span_tag(text))).get_text()
    result_text = remove2next1(result)
    # print(result_text)
    return title,audio,result_text
    # texts = remove_strong_tag(remove_span_tag(content.select('p')[1]))

def parse_index(html):
    soup = BeautifulSoup(html,'lxml')
    links = soup.select('.listItem')
    srcs = []
    for link in links:
        src = link.select('a')[0]['href']
        src = 'http://m.kekenet.com'+ src
        srcs.append(src)
    return srcs

def get_index(url):
    response = requests.get(url,headers=headers)
    if response.status_code==200:
        response.encoding = 'utf-8'
        # print(response.apparent_encoding)
        return response.text

def save_text(title,content):
    with open(title + '.txt','a',encoding='utf-8') as f:
        f.write(content)
        f.close()

def downloadFILE(url,name):
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
    }
    resp = requests.get(url=url,stream=True,headers=headers)
    content_size = int(int(resp.headers['Content-Length'])/1024)
    with open(name, "wb") as f:
        print("Pkg total size is:",content_size,'k,start...')
        for data in tqdm(iterable=resp.iter_content(1024),total=content_size,unit='k',desc=name):
            f.write(data)
        print(name , "download finished!")

if __name__ == "__main__":
    for i in range(1,24):
        url = 'http://m.kekenet.com/menu/14439/List_{}.shtml'.format(str(i))    
        html = get_index(url)
        srcs = parse_index(html)
        # print(srcs)
        print('list',i)
        for src in srcs:
            detial_html = get_html(src)
            title,audio,result_text= parse_audio_text(detial_html)
            title = re.search('第(.*?)期',title,re.S)
            if title:
                title = title.group(1).zfill(3)
            print(audio)
            print(result_text)
            save_text(title,result_text)
            downloadFILE(audio,title +'.mp3')
  
# 24链接http://m.kekenet.com/menu/14439/index.shtml


最后用word排一下版生成pdf如图共400多页

the english we speak pdf.PNG (221.24 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-11 23:10 上传



采集的结果放网盘了,需要的自取,终于可以快乐地学英语了
the english we speak(bbc地道英语)
链接:https://pan.baidu.com/s/1OKO6wo1hQ1xEIOQYHd62lQ
提取码:3k9t




版权声明:

本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。

有关影视版权:本站只供百度云网盘资源,版权均属于影片公司所有,请在下载后24小时删除,切勿用于商业用途。本站所有资源信息均从互联网搜索而来,本站不对显示的内容承担责任,如您认为本站页面信息侵犯了您的权益,请附上版权证明邮件告知【754403226@qq.com】,在收到邮件后72小时内删除。本文链接:http://www.piaodoo.com/8062.html

评论

搜索