一、前言
看了网上很多的教程都是通过OCR识别的,这种方法的优点在于通用性强。不同的答题活动都可以参加,但是缺点也明显,速度有限,并且如果通过调用第三方OCR,有次数限制。但是使用本教程提到的数据接口。我们能很容易的获取数据,速度快,但是接口是变化的,需要及时更新。
二、实战解析
1、背景介绍
百万英雄答题是一个最近很火爆的答题软件,答对12题的人,可以平分最后的奖金。奖金不错,笔者参加过几次,不过获得的都是小奖,最后几块钱的那种。对于不难的题目,能够直接百度出答案的题目,如果有个软件辅助实时给出参考,还是一件很舒服的事情。想干就干,走起!
2、先睹为快
先看下部署效果,通过服务器后端处理,通过前端显示,亲测延时3s:
为啥做成这样呢?因为这样,别的人也可以通过浏览器进行访问,独乐不如众乐嘛!
Github开源地址:https://github.com/Jack-Cherish/python-spider
3、西瓜视频APP抓包
对于如何抓包,我想应该都会了,我在手机APP抓包教程中有详细讲解,如有不会的,请暂时移步:http://blog.csdn.net/c406495762/article/details/76850843
在比赛答题的时候,我们可以通过抓包,找到这样的接口(点击放大):
可以看到,参数如上图所示。其中heartbeat后面的参数是一个随着场次的增加,逐渐增加的一个数,后面其他的例如iid和device_id是每个人的用户信息,在接口的最后,有个rticket参数,这个是一个时间戳,可以通过time.time()模拟。
2018-1-17更新:据朋友反应,url的有效参数只有heartbeat和rticket参数,用户信息可以不填写。
注意:只有在答题直播开始的时候,才能通过接口抓取到数据,没有直播的时候,是获取不到数据的,是乱码。
通过这个接口获取数据,然后对数据进行解析,在通过百度知道索问题,简单高效。有了这个思想,就可以开始写代码了。
# -*-coding:utf-8 -*- import requests from lxml import etree from bs4 import BeautifulSoup import urllib import time, re, types, os """ 代码写的匆忙,本来想再重构下,完善好注释再发,但是比较忙,想想算了,所以自行完善吧!写法很不规范,勿见怪。 作者: Jack Cui Website:https://cuijiahua.com 注: 本软件仅用于学习交流,请勿用于任何商业用途! """ class BaiWan(): def __init__(self): # 百度知道搜索接口 self.baidu = 'http://zhidao.baidu.com/search?' # 百万英雄及接口,每个人的接口都不一样,里面包含的手机信息,因此不公布,请自行抓包,有疑问欢迎留言:https://cuijiahua.com/liuyan.html self.api = 'https://api-spe-ttl.ixigua.com/xxxxxxx={}'.format(int(time.time()*1000)) # 获取答案并解析问题 def get_question(self): to = True while to: list_dir = os.listdir('./') if 'question.txt' not in list_dir: fw = open('question.txt', 'w') fw.write('百万英雄尚未出题请稍后!') fw.close() go = True while go: req = requests.get(self.api, verify=False) req.encoding = 'utf-8' html = req.text print(html) if '*' in html: question_start = html.index('*') try: question_end = html.index('?') except: question_end = html.index('?') question = html[question_start:question_end][2:] if question != None: fr = open('question.txt', 'r') text = fr.readline() fr.close() if text != question: print(question) go = False with open('question.txt', 'w') as f: f.write(question) else: time.sleep(1) else: to = False else: to = False temp = re.findall(r'[\u4e00-\u9fa5a-zA-Z0-9\+\-\*/]', html[question_end+1:]) b_index = [] print(temp) for index, each in enumerate(temp): if each == 'B': b_index.append(index) elif each == 'P' and (len(temp) - index) <= 3 : b_index.append(index) break if len(b_index) == 4: a = ''.join(temp[b_index[0] + 1:b_index[1]]) b = ''.join(temp[b_index[1] + 1:b_index[2]]) c = ''.join(temp[b_index[2] + 1:b_index[3]]) alternative_answers = [a,b,c] if '下列' in question: question = a + ' ' + b + ' ' + c + ' ' + question.replace('下列', '') elif '以下' in question: question = a + ' ' + b + ' ' + c + ' ' + question.replace('以下', '') else: alternative_answers = [] # 根据问题和备选答案搜索答案 self.search(question, alternative_answers) time.sleep(1) def search(self, question, alternative_answers): print(question) print(alternative_answers) infos = {"word":question} # 调用百度接口 url = self.baidu + 'lm=0&rn=10&pn=0&fr=search&ie=gbk&' + urllib.parse.urlencode(infos, encoding='GB2312') print(url) headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36', } sess = requests.Session() req = sess.get(url = url, headers=headers, verify=False) req.encoding = 'gbk' # print(req.text) bf = BeautifulSoup(req.text, 'lxml') answers = bf.find_all('dd',class_='dd answer') for answer in answers: print(answer.text) # 推荐答案 recommend = '' if alternative_answers != []: best = [] print('\n') for answer in answers: # print(answer.text) for each_answer in alternative_answers: if each_answer in answer.text: best.append(each_answer) print(each_answer,end=' ') # print(answer.text) print('\n') break statistics = {} for each in best: if each not in statistics.keys(): statistics[each] = 1 else: statistics[each] += 1 errors = ['没有', '不是', '不对', '不正确','错误','不包括','不包含','不在','错'] error_list = list(map(lambda x: x in question, errors)) print(error_list) if sum(error_list) >= 1: for each_answer in alternative_answers: if each_answer not in statistics.items(): recommend = each_answer print('推荐答案:', recommend) break elif statistics != {}: recommend = sorted(statistics.items(), key=lambda e:e[1], reverse=True)[0][0] print('推荐答案:', recommend) # 写入文件 with open('file.txt', 'w') as f: f.write('问题:' + question) f.write('\n') f.write('*' * 50) f.write('\n') if alternative_answers != []: f.write('选项:') for i in range(len(alternative_answers)): f.write(alternative_answers[i]) f.write(' ') f.write('\n') f.write('*' * 50) f.write('\n') f.write('参考答案:\n') for answer in answers: f.write(answer.text) f.write('\n') f.write('*' * 50) f.write('\n') if recommend != '': f.write('最终答案请自行斟酌!\t') f.write('推荐答案:' + sorted(statistics.items(), key=lambda e:e[1], reverse=True)[0][0]) if __name__ == '__main__': bw = BaiWan() bw.get_question()
获取数据和查找答案就是这样,很简单,代码写的也较为凌乱,大牛可以按照这个思路改一改。
4、网站部署
没做过后端和前端,花了一天时间,现学现卖弄好的,javascript也是现看现用,百度的程序,调试调试而已。可能有很多用法比较low的地方,用法不对,请勿见怪,有大牛感兴趣,可以自行完善。
这是我当时看的一些文章:
Node.js和Socket.IO通信基础:菜鸟学习nodejs–Socket.IO即时通讯
Node.js逐行读取txt文件:Line-Reader
Node.js定时任务:Node-Schedule
后端app.js:
var http = require('http'); var fs = require('fs'); var schedule = require("node-schedule"); var message = {}; var count = 0; var server = http.createServer(function (req,res){ fs.readFile('./index.html',function(error,data){ res.writeHead(200,{'Content-Type':'text/html'}); res.end(data,'utf-8'); }); }).listen(80); console.log('Server running!'); var lineReader = require('line-reader'); function messageGet(){ lineReader.eachLine('file.txt', function(line, last) { count++; var name = 'line' + count; console.log(name); console.log(line); message[name] = line; }); if(count == 25){ count = 0; } else{ for(var i = count+1; i <= 25; i++){ var name = 'line' + i; message[name] = 'f'; } count = 0; } } var io = require('socket.io').listen(server); var rule = new schedule.RecurrenceRule(); var times = []; for(var i=1; i<1800; i++){ times.push(i); } rule.second = times; schedule.scheduleJob(rule, function(){ messageGet(); }); io.sockets.on('connection',function(socket){ // console.log('User connected' + count + 'user(s) present'); socket.emit('users',message); socket.broadcast.emit('users',message); socket.on('disconnect',function(){ console.log('User disconnected'); //socket.broadcast.emit('users',message); }); });
前端index.html:
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta http-equiv="refresh" content="2"> <title>Jack Cui答题辅助系统</title> </head> <body> <h1>百万英雄答题辅助系统</h1> <p id="line1"></p> <p id="line2"></p> <p id="line3"></p> <p id="line4"></p> <p id="line5"></p> <p id="line6"></p> <p id="line7"></p> <p id="line8"></p> <p id="line9"></p> <p id="line10"></p> <p id="line11"></p> <p id="line12"></p> <p id="line13"></p> <p id="line14"></p> <p id="line15"></p> <p id="line16"></p> <p id="line17"></p> <p id="line18"></p> <p id="line19"></p> <p id="line20"></p> <p id="line21"></p> <p id="line22"></p> <p id="line23"></p> <p id="line24"></p> <p id="line25"></p> <script src="https://222.222.124.77:9001/jquery.min.js"></script> <script src="/socket.io/socket.io.js"></script> <script> var socket = io.connect('http://你的IP:端口'); var line1 = document.getElementById('line1'); var line2 = document.getElementById('line2'); var line3 = document.getElementById('line3'); var line4 = document.getElementById('line4'); var line5 = document.getElementById('line5'); var line6 = document.getElementById('line6'); var line7 = document.getElementById('line7'); var line8 = document.getElementById('line8'); var line9 = document.getElementById('line9'); var line10 = document.getElementById('line10'); var line11 = document.getElementById('line11'); var line12 = document.getElementById('line12'); var line13 = document.getElementById('line13'); var line14 = document.getElementById('line14'); var line15 = document.getElementById('line15'); var line16 = document.getElementById('line16'); var line17 = document.getElementById('line17'); var line18 = document.getElementById('line18'); var line19 = document.getElementById('line19'); var line20 = document.getElementById('line20'); var line21 = document.getElementById('line21'); var line22 = document.getElementById('line22'); var line23 = document.getElementById('line23'); var line24 = document.getElementById('line24'); var line25 = document.getElementById('line25'); socket.on('users',function(data){ if(data.line1 == 'f'){ line1.innerHTML = '' } else{ line1.innerHTML = data.line1 } if(data.line2 == 'f'){ line2.innerHTML = '' } else{ line2.innerHTML = data.line2 } if(data.line3 == 'f'){ line3.innerHTML = '' } else{ line3.innerHTML = data.line3 } if(data.line4 == 'f'){ line4.innerHTML = '' } else{ line4.innerHTML = data.line4 } if(data.line5 == 'f'){ line5.innerHTML = '' } else{ line5.innerHTML = data.line5 } if(data.line6 == 'f'){ line6.innerHTML = '' } else{ line6.innerHTML = data.line6 } if(data.line7 == 'f'){ line7.innerHTML = '' } else{ line7.innerHTML = data.line7 } if(data.line8 == 'f'){ line8.innerHTML = '' } else{ line8.innerHTML = data.line8 } if(data.line9 == 'f'){ line9.innerHTML = '' } else{ line9.innerHTML = data.line9 } if(data.line10 == 'f'){ line10.innerHTML = '' } else{ line10.innerHTML = data.line10 } if(data.line11 == 'f'){ line11.innerHTML = '' } else{ line11.innerHTML = data.line11 } if(data.line12 == 'f'){ line12.innerHTML = '' } else{ line12.innerHTML = data.line12 } if(data.line13 == 'f'){ line13.innerHTML = '' } else{ line13.innerHTML = data.line13 } if(data.line14 == 'f'){ line14.innerHTML = '' } else{ line14.innerHTML = data.line14 } if(data.line15 == 'f'){ line15.innerHTML = '' } else{ line15.innerHTML = data.line15 } if(data.line16 == 'f'){ line16.innerHTML = '' } else{ line16.innerHTML = data.line16 } if(data.line17 == 'f'){ line17.innerHTML = '' } else{ line17.innerHTML = data.line17 } if(data.line18 == 'f'){ line18.innerHTML = '' } else{ line18.innerHTML = data.line18 } if(data.line19 == 'f'){ line19.innerHTML = '' } else{ line19.innerHTML = data.line19 } if(data.line20 == 'f'){ line20.innerHTML = '' } else{ line20.innerHTML = data.line20 } if(data.line21 == 'f'){ line21.innerHTML = '' } else{ line21.innerHTML = data.line21 } if(data.line22 == 'f'){ line22.innerHTML = '' } else{ line22.innerHTML = data.line22 } if(data.line23 == 'f'){ line23.innerHTML = '' } else{ line23.innerHTML = data.line23 } if(data.line24 == 'f'){ line24.innerHTML = '' } else{ line24.innerHTML = data.line24 } if(data.line25 == 'f'){ line25.innerHTML = '' } else{ line25.innerHTML = data.line25 } }); </script> </body> </html>
将这些部署到服务器上。这是我的部署效果:
部署好后。使用指令运行Node.js服务:
node app.js
运行python3脚本:
python3 baiwan.py
如果一切都搭建好了,那么这个百万英雄答题辅助系统就可以运行了!
三、总结
- 本软件仅用于学习交流,请勿用于任何商业用途。
- 也可以对代码进行简单修改,python打印信息,只在本地查看,无需写入txt文件,部署到服务器上。
- 代码乱,没有经过优化,还需重构。
- 我的Github爬虫开源地址:https://github.com/Jack-Cherish/python-spider/
来源:
https://cuijiahua.com/blog/2018/01/spider_3.html