已经开始听喜马拉雅Fm电台有2个月,听里面的故事,感觉能听到自己,特别是蕊希电台,始于声音,陷于故事,忠于总结。感谢喜马拉雅Fm陪我度过了这2个月,应该是太爱了,然后就开始对Fm下手了。QAQ
该博客基于以下博客,提取和修改。
https://www.jianshu.com/p/8ff95111b18ahttps://www.imooc.com/article/48315
需要解决问题
1.m4a文件储存在json文本中 --f12审查元素,使用json.loads读取信息 2.将其他主播的所有音频文件也下载3.下载文件时,对提取的文件进行分类 --提取主播id,使用meta进行传递
三、先给大家看看成果
一、提取网页源码
1.1_提取trackId:"https://www.ximalaya.com/qinggan/321787/130991924"
1.2_提取其他主播Id
1.3_主播所有作品的trackId:"http://www.ximalaya.com/revision/album/getTracksList?albumId=321787&pageNum=13"
1.4_提取.m4a文件:https://www.ximalaya.com/revision/play/tracks?trackIds=35217881
二、代码设置:middlewares.py,settings.py,items.py就不细讲了,可以看我之前的博客。
2.1_pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport scrapyfrom os.path import join,basename,dirnameimport osimport urlparsefrom scrapy.pipelines.files import FilesPipelinefrom Xima.settings import FILES_STOREfrom scrapy.exceptions import DropItemclass XimaPipeline(FilesPipeline): def get_media_requests(self,item,info): yield scrapy.Request(item['m4_urls'],meta={ "file_name":item['file_name'],'m4_urls':item['m4_urls']}) def file_path(self,request,response=None,info=None): #get_media_requests函数是返回了一个request对象的,而这个request对象就是file_path函数接收的那个 item = request.meta return join(FILES_STORE, item['file_name'] + '\\' + basename(item['m4_urls'])) def item_completed(self, results, item, info): file_paths = [x['path'] for ok, x in results if ok] if not file_paths: raise DropItem("Item contains no files") return item
2.2_爬取代码
# -*- coding: utf-8 -*-import scrapyfrom Xima.items import XimaItemimport jsonimport pdbfrom Xima.settings import USER_AGENTimport randomclass XimaSpider(scrapy.Spider): name = 'xima' allowed_domains = ['www.ximalaya.com'] start_urls = ['https://www.ximalaya.com/revision/seo/hotWordAlbums?id=321787&queryType=1'] headers = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Connection': 'keep-alive', 'Content-Length': '11', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Host': 'www.ximalaya.com', 'Origin': 'www.ximalaya.com', 'Referer': 'https://www.ximalaya.com/revision/seo/hotWordAlbums?id=321787&queryType=1', 'User-Agent': random.choice(USER_AGENT), 'X-Requested-With': 'XMLHttpRequest', } def start_requests(self): yield scrapy.Request(self.start_urls[0],callback=self.parse_1) def parse_1(self,response): for each_url in json.loads(response.body)['data']['hotWordAlbums']: for i in xrange(20): new_url = 'http://www.ximalaya.com/revision/album/getTracksList?albumId='+str(each_url['id'])+'&pageNum='+str(i) yield scrapy.Request(new_url,callback=self.parse,meta={ 'trackid':str(each_url['id'])}) def parse(self, response): if json.loads(response.body)['data']['tracks']: for sel in json.loads(response.body)['data']['tracks']: stackids = sel['trackId'] meta1 = response.meta yield scrapy.Request('https://www.ximalaya.com/revision/play/tracks?trackIds=%s'%stackids,callback=self.m4a,meta=meta1) def m4a(self,response): xima = XimaItem() if json.loads(response.body)['data']['tracksForAudioPlay'][0]['src']: xima['file_name'] = response.meta['trackid'] xima['m4_urls'] = json.loads(response.body)['data']['tracksForAudioPlay'][0]['src'] yield xima