centos自动谷歌搜关键词并收集URL
QQ群:397745473
环境准备 需要安装centos 桌面环境,只需要在装系统的时候选上desktop就行了。
centos 7 安装RDP 参考:https://www.cnblogs.com/lenmom/p/9516210.html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 yum install -y epel-release xrdp tigervnc-server tmux vncpasswd root yum -y update && yum -y upgrade sed -i 's/max_bpp=32/max_bpp=24/g' /etc/xrdp/xrdp.ini 修改XRDP最大连接数,否则远程连接可能无法成功,把max_bpp=32, 改为max_bpp=24 sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config systemctl stop firewalld systemctl disable firewalld 或 firewall-cmd --permanent --zone=public --add-port=3389/tcp firewall-cmd --reload systemctl start xrdp systemctl enable xrdp systemctl enable sshd yum -y install tmux tmux new -s vsyour tmux ls tmux a -d -t vsyour 左右分屏:prefix + % 上下分屏:prefix + '
安装软件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 anaconda 参考:https://www.anaconda.com/ wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh pycharm 参考:https://www.cnblogs.com/niuli1987/p/9917650.html wget https://download.jetbrains.com/python/pycharm-professional-2018.1.tar.gz ln -s /root/pycharm-2018.1/bin/pycharm.sh /root/桌面/pycharm 用xfce4 桌面环境 yum install epel-release yum groupinstall xfce4 执行 yum groupinstall xfce4安装Xfce4桌面环境。如果需要,可选安装xfce4的其他模块。 执行sudo systemctl isolate graphical.target,进入Xfce yum groupinstall "X Window system"
解决问题 Centos xrdp 远程连接后突然闪退 由于anconda 与xrdp冲突所以重启后连接xrdp远程桌面时会出现闪退的现象,所以需要启动的时候把原来的anconda的注释掉,改成下面这样就可以了
供参考:vim ~/.bashrc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 alias rm='rm -i' alias cp='cp -i' alias mv='mv -i' if [ -f /etc/bashrc ]; then . /etc/bashrc fi . "/root/.acme.sh/acme.sh.env" export PATH="$PATH :/root/anaconda3/bin"
参考:https://www.cnblogs.com/infoo/p/11239490.html http://blog.sina.com.cn/s/blog_71bd750b010312q3.html
另一种无法进入桌面的情况 都没干啥 就又进不了桌面了….
于是又是一顿操作
1 2 3 4 5 6 参考:https://bugzilla.redhat.com/show_bug.cgi?id=1529419 abrt-auto-reporting enabled 1.Fresh install RHEL7.7alpha via mimi mode; 2.Install GUI via the command " yum groupinstall " Server with GUI" " ,boot into GUI,it is not reproduce; 3.Update libX11(libX11-1.6.7-2.el7.x86_64) ,reboot and check the messages log ,it still not reproduce. 重启后就恢复正常了
安装浏览器驱动 1 2 3 4 wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz tar -xvzf geckodriver-v0.23.0-linux64.tar.gz chmod +x geckodriver mv geckodriver /usr/local /bin/
安装输入法 注意 装完后进不了桌面了,查了很久原因,原来是这个引起的。。。这一步最好就不操作了!!!
参考:
https://jingyan.baidu.com/article/cbf0e500b791142eaa28932f.html
1 2 3 4 yum remove ibus yum install ibus ibus-table yum install ibus ibus-table-wubi 这个操作完有问题,直接进不了桌面了
自动搜索代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 from bs4 import BeautifulSoup from selenium import webdriver import datetime import requests import json import os import time import random from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains class GetGoogleUrl(object): def __init__(self): self.printInfo('开始获取关键词....' ) self.keys=[] for i in range(0,31): pastTime = (datetime.datetime.now() - datetime.timedelta(days=i)).strftime('%Y%m%d' ) url = f'https://trends.google.com/trends/api/dailytrends?hl=en-US&tz=0&ed={pastTime}&geo=US&ns=15' try: response = requests.get(url) response.encoding = 'utf-8' jdata = json.loads(response.text.strip().replace(')]}\' ,\n', ' ')) for x in jdata[' default'][' trendingSearchesDays'][0][' trendingSearches']: self.keys.append(x[' title'][' query']) except Exception as e: self.printInfo(f' 获取关键词{url}出错,提示信息:{e}') self.printInfo(f' 使用URL {url} 当前词量 {len(self.keys)}个') self.keys=set(self.keys) self.printInfo(f' 去重后还有{len(self.keys)}个') firefox_options = webdriver.FirefoxOptions() firefox_options.add_argument("--disable-infobars") # 设置禁用浏览器正在被自动化程序控制的提示 #firefox_options.add_argument(' --headless') # 以所谓的headless模式打开chrome firefox_options.add_argument("--incognito") # 设置无痕模式 self.Firefox_driver = webdriver.Firefox(options=firefox_options) def printInfo(self,string): nowTime = datetime.datetime.now().strftime(' %Y-%m-%d %H:%M:%S') # 现在 printContent=f' [*] {nowTime} {string}' print(printContent) def getUrl(self): try: soup=BeautifulSoup(self.Firefox_driver.page_source,' lxml') if soup: for i in soup.find_all(' div',attrs={' class':' r'}): try: writeUrl=i.a.get(' href') except Exception as e: self.printInfo(f' {i} 获取url 失败,提示: {e}') continue if writeUrl: with open(' urls.txt',' a') as f: f.write(i.a.get(' href')+' \n') if soup.find(' a',attrs={' id':' pnnext'}): return True else: return False except Exception as e: self.printInfo(f' {self.Firefox_driver.current_url} 获取网站源码失败,提示: {e}') return True def seachKey(self,key): self.printInfo(f' 正在搜索: {key}') down = "var q=document.documentElement.scrollTop=100000" try: self.Firefox_driver.find_element_by_name("q").clear() self.Firefox_driver.find_element_by_name("q").send_keys(key, Keys.RETURN) time.sleep(random.randint(1, 60)) pageNumber=0 while True: pageNumber+=1 # 获取内容 if not self.getUrl(): time.sleep(random.randint(1, 60)) self.printInfo(f' 抓取{pageNumber}页完成!') break time.sleep(random.randint(1, 10)) self.Firefox_driver.execute_script(down) # 下拉 #self.Firefox_driver.execute_script(' window.scrollTo(0,1000000)') #self.Firefox_driver.execute_script(pnnext) # 翻页 ActionChains(self.Firefox_driver).move_to_element(self.Firefox_driver.find_element_by_id("pnnext")).perform() # 模拟鼠标移动 time.sleep(random.randint(1, 3)) self.Firefox_driver.find_element_by_id("pnnext").click() # 点击翻页 time.sleep(random.randint(1, 10)) except Exception as e: self.printInfo(f' 搜索失败,提示: {e}') tipMessage=' '' Message: Unable to clear element that cannot be edited: <input name="q" type ="hidden" > '' ' if tipMessage in e: return ' exit ' def firefoxDriver(self, url): time.sleep(random.randint(1,3)) # sleep一下,否则有可能报错 self.Firefox_driver.implicitly_wait(1) self.Firefox_driver.get(url) keyNumber=0 for key in self.keys: keyNumber+=1 self.printInfo(f' [{str(keyNumber).zfill(3)}]'.center(100,' +')) if self.seachKey(key) == ' exit ': return ' exit ' if __name__ == ' __main__': for _ in range(2): getGoogleUrl = GetGoogleUrl() if getGoogleUrl.firefoxDriver(' https://www.google.com/ncr') == ' exit ': getGoogleUrl.Firefox_driver.quit()
去重代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import glob def fun_set(l): return list(set (l)) allDomain=[] for fileName in glob.glob('urls*.txt' ): with open(fileName,'r' ) as f: allUrl=f.read().split('\n' ) print (f'{fileName} 取到条 {len(allUrl)} URL.' ) allDomain+=allUrl allDomain_1=[] for i in fun_set(allDomain): data='/' .join(i.split('/' )[2:3]) if data: allDomain_1.append(data) n=0 for i in fun_set(allDomain_1): n+=1 print (n,i)
新发现 在windows 上安装过 TunSafe 或者 WireGuard 的客户端后,就能直接用cmd中打开ssh连接远程的linux电脑了。
QQ群:397745473