感悟网 导航

求python抓网页的代码 python如何抓取网页源代码中的字符串

作者&投稿:占程 (若有异议请与网页底部的电邮联系)
Python爬虫怎么抓取html网页的代码块~

mport urllib.request
import re

def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
html = html.decode('GBK')
return html

def getMeg(html):
reg = re.compile(r'******')
meglist = re.findall(reg,html)
for meg in meglist:
with open('out.txt',mode='a',encoding='utf-8') as file:
file.write('%s
' % meg)

if __name__ == "__main__":
html = getHtml(url)
getMeg(html)

正则提取
找前后关键字
python可以很方便地抓取网页并过滤网页的内容,那么,如何从如下的网页中提取良玉的博客blog.uouo123.com。

window.quickReplyflag = true;






良玉的博客blog.uouo123.com



如下是核心代码,使用正则表达式实现:
html2 = opener.open(page).read()
allfinds2 = re.findall(r'
(.+?)',html2, re.S)
print allfinds2[0].strip()
第一行:打开链接,page指向的是所要提取的文章标题的链接;
第二行:当读取到了连接的内容后,使用正则表达式进行匹配。这里要匹配的字符串的尾部是,要匹配最近的需要注意下面黑体字部分:

python3.x中使用urllib.request模块来抓取网页代码,通过urllib.request.urlopen函数取网页内容,获取的为数据流,通过read()函数把数字读取出来,再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中<meta  http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中为gbk编码。),这样就得到了网页的源代码。

如下例所示,抓取本页代码:

import urllib.request

html = urllib.request.urlopen('
).read().decode('gbk') #注意抓取后要按网页编码进行解码
print(html)

以下为urllib.request.urlopen函数说明:

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)


Open the URL url, which can be either a string or a Request object.


data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.


data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.


urllib.request module uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.


The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.


If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.


The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().


The cadefault parameter is ignored.


For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.


For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as


geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.


Raises URLError on errors.


Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).


In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.


The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.



Changed in version 3.2: cafile
and capath were added.



Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).



New in version 3.2: data can be
an iterable object.



Changed in version 3.3: cadefault
was added.



Changed in version 3.4.3: context
was added.



import urllib2
print urllib2.urlopen("http://www.baidu.com").read()

# urllib2.urlopen()返回一个类似于文件的对象,所以要用 read()读取其内容
# 也可以保存成本地文件

outfile = open("output.html","w")
outfile.write(urllib2.urlopen("http://www.baidu.com").read())
outfile.close()

《如何抓取网页上的数据(如何使用Python进行网页数据抓取)》
答:selenium是一个自动化测试工具,也可以用来模拟浏览器行为进行网页数据抓取。使用selenium库可以执行JavaScript代码、模拟点击按钮、填写表单等操作。下面是一个使用selenium库模拟浏览器行为的示例代码:```python fromseleniumimportwebdriver driver=webdriver.Chrome()driver.get(url)button=driver.find_element_b...

《求python抓网页的代码》
答:python3.x中使用urllib.request模块来抓取网页代码,通过urllib.request.urlopen函数取网页内容,获取的为数据流,通过read()函数把数字读取出来,再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中得知,如下例中为gbk编码。),这样就得到了网页的源代码。如下例所示,抓取本页代码:imp...

《如何用python爬取网站数据》
答:1.首先要明确想要爬取的目标。对于网页源信息的爬取首先要获取url,然后定位的目标内容。2.先使用基础for循环生成的url信息。3.然后需要模拟浏览器的请求(使用request.get(url)),获取目标网页的源代码信息(req.text)。4.目标信息就在源代码中,为了简单的获取目标信息需要用Beautifulsoup库对源代码进行解析...

《如何用Python爬虫抓取网页内容?》
答:模拟请求网页。模拟浏览器,打开目标网站。获取数据。打开网站之后,就可以自动化的获取我们所需要的网站数据。保存数据。拿到数据之后,需要持久化到本地文件或者数据库等存储设备中。那么我们该如何使用 Python 来编写自己的爬虫程序呢,在这里我要重点介绍一个 Python 库:Requests。Requests 使用 Requests ...

《如何用python抓取网页特定内容》
答:最简单可以用urllib,python2.x和python3.x的用法不同,以python2.x为例:import urllibhtml = urllib.open(url)text = html.read()复杂些可以用requests库,支持各种请求类型,支持cookies,header等 再复杂些的可以用selenium,支持抓取javascript产生的文本 我设计了简单的爬虫闯关网站 www.heibanke....

《谁用过python中的re来抓取网页,能否给个例子,谢谢》
答:这是我写的一个非常简单的抓取页面的脚本,作用为获得指定URL的所有链接地址并获取所有链接的标题。===geturls.py=== coding:utf-8 import urllib import urlparse import re import socket import threading 定义链接正则 urlre = re.compile(r"href=[\"']?([^ >\"']+)")titlere = re.com...

《python爬虫怎么做?》
答:抓取网页 完成必要工具安装后,我们正式开始编写我们的爬虫。我们的第一个任务是要抓取所有豆瓣上的图书信息。我们以/subject/26986954/为例,首先看看开如何抓取网页的内容。使用python的requests提供的get()方法我们可以非常简单的获取的指定网页的内容,代码如下:提取内容 抓取到网页的内容后,我们要做的就...

《如何用Python抓取动态页面信息》
答:urllib不可以解析动态信息,但是浏览器可以。在浏览器上展现处理的信息其实是处理好的HTML文档。这为我们抓取动态页面信息提供了很好的思路。在Python中有一个很有名的图形库——PyQt。PyQt虽然是图形库,但是他里面 QtWebkit。这个很实用。谷歌的Chrome和苹果的Safari都是基于WebKit内核开发的,所以我们可以...

《如何用python爬取网站数据?》
答:对应的网页源码如下,包含我们所需要的数据:2.对应网页结构,主要代码如下,很简单,主要用到requests+BeautifulSoup,其中requests用于请求页面,BeautifulSoup用于解析页面:程序运行截图如下,已经成功爬取到数据:抓取网站动态数据(数据不在网页源码中,json等文件中):以人人贷网站数据为例 1.这里假设我们...

《python 如何抓取动态页面内容?》
答:= zlib.decompress(respHtml, -zlib.MAX_WBITS); return respHtml;及示例代码:url = "http://www.crifan.com";respHtml = getUrlRespHtml(url);完全库函数,自己搜:crifanLib.py 关于抓取动态页面,详见:Python专题教程:抓取网站,模拟登陆,抓取动态网页 (自己搜标题即可找到)...

   

返回顶部
本页内容来自于网友发表,若有相关事宜请照下面的电邮联系
感悟网