urlurl.openconnectionn + rhino 怎么做到抓取动态页面

点击联系发帖人 时间：2018-06-02 09:10

httpsurlconnection

抓取网页其实就是模拟客户端（PC端，手机端。。。）发送请求，获得响应数据documentation，解析对应数据的过程。---自己理解，错误请告知
一般常用请求方式有GET，POST，HEAD三种
GET请求的数据是作为url的一部分，对于GET请求来说，附带数据长度有限制，数据安全性低
POST请求，数据作为标准数据传输给服务器，数据长度没有限制，数据通过加密传输，安全性高
HEAD类似于get请求，只不过返回的响应中没有具体的内容，用于获取报头
闲话少说。
通过GET请求获取网页
UrlConnection下载网页通过InputStream读取数据，通过FileOutPutStream将数据写入文件
public class DownloadHtml {
* 方法说明：用于下载HTML页面
*@param SrcPath
下载目标页面的URL
*@param filePath 下载得到的HTML页面存放本地目录
*@param fileName
下载页面的名字
public static void downloadHtmlByNet(String SrcPath,String filePath,String fileName){
URL url = new URL(SrcPath);
URLConnection conn = url.openConnection();
//设置超时间为3秒
conn.setConnectTimeout(3*1000);
//防止屏蔽程序抓取而返回403错误
conn.setRequestProperty("User-Agent", "Mozilla/4.0 ( MSIE 5.0; Windows NT; DigExt)");
InputStream str = conn.getInputStream();
//控制流的大小为1k
byte[] bs = new byte[1024];
//读取到的长度
int len = 0;
//是否需要创建文件夹
File saveDir = new File(filePath);
if(!saveDir.exists()){
saveDir.mkdir();
File file = new File(saveDir+File.separator+fileName);
//实例输出一个对象
FileOutputStream out = new FileOutputStream(file);
//循环判断，如果读取的个数b为空了，则is.read()方法返回-1，具体请参考InputStream的read();
while ((len = str.read(bs)) != -1) {
//将对象写入到对应的文件中
out.write(bs, 0, len);
out.flush();
out.close();
str.close();
System.out.println("下载成功");
}catch (Exception e) {
e.printStackTrace();
public static void main(String[] args) {
//下载网页　　　　url是要下载的指定网页，filepath存放文件的目录如d:/resource/html/ ,filename指文件名如"下载的网页.html"
downloadHtmlByNet(url,filepath,filename);
HttpClient是Apache Jakarta Common 下的子项目。提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包
public static void downloadHtmlByNet(String SrcPath,String filePath,String fileName){
DefaultHttpClient httpClient=new DefaultHttpClient();//初始化httpclient
BasicHttpParams httpParams=new BasicHttpParams();//初始化参数
//模拟浏览器访问防止屏蔽程序抓取而返回403错误user_agent="Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0. Safari/537.36　　　　　user_agent="Mozilla/4.0 ( MSIE 5.0; Windows NT; DigExt)"
httpParams.setParameter("http.useragent", user_agent);
httpClient.setParams(httpParams);
HttpGet httpGet=new HttpGet(SrcPath);
HttpContext httpContext=new BasicHttpContext();
HttpResponse httpResponse=httpClient.execute(httpGet,httpContext);
HttpEntity entity=httpResponse.getEntity();
if(entity!=null){
writeToFile(entity,filePath,fileName);//将entity内容输出到文件
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
httpClient.getConnectionManager().shutdown();
private static void writeToFile(HttpEntity entity, String filepath, String filename) {
InputStream str = entity.getContent();
//控制流的大小为1k
byte[] bs = new byte[1024];
//读取到的长度
int len = 0;
//是否需要创建文件夹
File saveDir = new File(filePath);
if(!saveDir.exists())
saveDir.mkdir();
File file = new File(saveDir+File.separator+fileName);
//实例输出一个对象
FileOutputStream out = new FileOutputStream(file);
//循环判断，如果读取的个数b为空了，则is.read()方法返回-1，具体请参考InputStream的read();
while ((len = str.read(bs)) != -1) {
//将对象写入到对应的文件中
out.write(bs, 0, len);
out.flush();
out.close();
str.close();
System.out.println("下载成功");
catch(Exception e){
e.printStackTrace();
唉，以前学过都忘差不多了，多学多记，下次使用post抓去有用数据。
阅读(...) 评论()URLConnection实现爬虫（解决重定向、设置cookie才能抓取页面等问题）_ASP.NET技巧_动态网站制作指南
URLConnection实现爬虫（解决重定向、设置cookie才能抓取页面等问题）
来源：人气：992
1.关键方法
* 向指定 URL 发送POST方法的请求
* @param url
发送请求的 URL
* @param param
请求参数，请求参数应该是 name1=value1&name2=value2 的形式。
* @param encode
请求页面的字符编码
* @param cookie
* @return 所代表远程资源的响应结果
public static String sendPost1(String url, String param, String encode,String cookie) {
intWriter out =
BufferedReader in =
String result = "";
URL realUrl = new URL(url);
// 打开和URL之间的连接
URLConnection conn = realUrl.openConnection();
// 设置通用的请求属性
conn.setRequestProperty("accept", "*/*");
conn.setRequestProperty("Accept-Language","zh-CN,q=0.8");
conn.setRequestProperty("Cache-Control","max-age=0");
conn.setRequestProperty("connection", "Keep-Alive");
conn.setRequestProperty("Cookie",cookie);
//conn.setRequestProperty("Host","www.zjtax.gov.cn");
conn.setRequestProperty("user-agent",
"Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1;SV1)");
// 发送POST请求必须设置如下两行
conn.setDoOutput(true);
conn.setDoInput(true);
// 获取URLConnection对象对应的输出流
out = new PrintWriter(conn.getOutputStream());
// 发送请求参数
out.print(param);
// flush输出流的缓冲
out.flush();
// 定义BufferedReader输入流来读取URL的响应
in = new BufferedReader(
new InputStreamReader(conn.getInputStream(),encode));
while ((line = in.readLine()) != null) {
} catch (Exception e) {
System.out.println("发送 POST 请求出现异常！"+e);
e.printStackTrace();
//使用finally块来关闭输出流、输入流
if(out!=null){
out.close();
if(in!=null){
in.close();
catch(IOException ex){
ex.printStackTrace();
* 获取cookie
* @param url
发送请求的URL
* @return key=key=...
public static String getCookie2(String url) {
HttpURLConnection conn =
URL realUrl = new URL(url);
conn = (HttpURLConnection) realUrl.openConnection();
conn.setRequestProperty("Accept","text/html,/xhtml+,application/q=0.9,image/webp,*/*;q=0.8");
conn.setRequestProperty("Accept-Encoding","gz, deflate, sdch");
conn.setRequestProperty("Accept-Language","zh-CN,q=0.8");
conn.setRequestProperty("Cache-Control","max-age=0");
conn.setRequestProperty("connection", "Keep-Alive");
//conn.setRequestProperty("Host","www.zjtax.gov.cn");
conn.setRequestProperty("user-agent","Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1;SV1)");
//是否自动执行 http 重定向，默认为true
//如果实际操作中，不存在重定向问题，不需要设置此行。
conn.setInstanceFollowRedirects(false);
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setRequestMethod("POST");
} catch (Exception e) {
e.printStackTrace();
String Id = "";
String cookieVal = "";
String key =
Map&String,List&String&& map = conn.getHeaderFields();
for (String key1 : map.keySet()) {
System.out.println(key1 + "---&" + map.get(key1));
//取cookie
for(int i = 1; (key = conn.getHeaderFieldKey(i)) != i++){
if(key.equalsIgnoreCase("set-cookie")){
cookieVal = conn.getHeaderField(i);
cookieVal = cookieVal.substring(0, cookieVal.indexOf(";"));
sessionId = sessionId + cookieVal + ";";
//如果实际操作中，不存在重定向问题，不需要以下四行
String location= conn.getHeaderField("Location");//获取重定向地址
List&String& list = getCookie3(location,sessionId);
List&String& list2 = getCookie3(list.get(1),sessionId+list.get(0));
sessionId = sessionId + list2.get(0);
return sessionId;
* 获取 cookie
* @param url
发送请求的URL
* @param cookie
public static List&String& getCookie3(String url,String cookie) {
HttpURLConnection conn =
URL realUrl = new URL(url);
conn = (HttpURLConnection) realUrl.openConnection();
conn.setRequestProperty("Accept","text/html,application/xhtml+xml,application/q=0.9,image/webp,*/*;q=0.8");
conn.setRequestProperty("Accept-Encoding","gzip, deflate, sdch");
conn.setRequestProperty("Accept-Language","zh-CN,q=0.8");
conn.setRequestProperty("Cache-Control","max-age=0");
conn.setRequestProperty("connection", "Keep-Alive");
//conn.setRequestProperty("Host","www.zjtax.gov.cn");
conn.setRequestProperty("user-agent","Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1;SV1)");
conn.setRequestProperty("Cookie",cookie);
conn.setInstanceFollowRedirects(false);
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setRequestMethod("POST");
} catch (Exception e) {
e.printStackTrace();
String sessionId = "";
String cookieVal = "";
String key =
String location= conn.getHeaderField("Location");
for(int i = 1; (key = conn.getHeaderFieldKey(i)) != i++){
if(key.equalsIgnoreCase("set-cookie")){
cookieVal = conn.getHeaderField(i);
cookieVal = cookieVal.substring(0, cookieVal.indexOf(";"));
sessionId = sessionId + cookieVal + ";";
List&String& list = new ArrayList&String&();
list.add(sessionId);//存放cookie
list.add(location);//存放重定向地址
另附，最基本的get抓取、post抓取、获取cookie方法
public class HttpURLContent {
* 向指定URL发送GET方法的请求
* @param url
发送请求的URL
* @param param
请求参数，请求参数应该是 name1=value1&name2=value2 的形式。
* @return URL 所代表远程资源的响应结果
public static String sendGet(String url, String param) {
String result = "";
BufferedReader in =
String urlNameString = url + "?" +
URL realUrl = new URL(urlNameString);
// 打开和URL之间的连接
URLConnection connection = realUrl.openConnection();
// 设置通用的请求属性
connection.setRequestProperty("accept", "*/*");
connection.setRequestProperty("connection", "Keep-Alive");
connection.setRequestProperty("user-agent","Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1;SV1)");
// 建立实际的连接
connection.connect();
// 定义 BufferedReader输入流来读取URL的响应
in = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
while ((line = in.readLine()) != null) {
} catch (Exception e) {
System.out.println("发送GET请求出现异常！" + e);
e.printStackTrace();
// 使用finally块来关闭输入流
if (in != null) {
in.close();
} catch (Exception e2) {
e2.printStackTrace();
* 向指定 URL 发送POST方法的请求
* @param url
发送请求的 URL
* @param param
请求参数，请求参数应该是 name1=value1&name2=value2 的形式。
* @return 所代表远程资源的响应结果
public static String sendPost(String url, String param) {
PrintWriter out =
BufferedReader in =
String result = "";
URL realUrl = new URL(url);
// 打开和URL之间的连接
URLConnection conn = realUrl.openConnection();
// 设置通用的请求属性
conn.setRequestProperty("accept", "*/*");
conn.setRequestProperty("connection", "Keep-Alive");
conn.setRequestProperty("user-agent",
"Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1;SV1)");
// 发送POST请求必须设置如下两行
conn.setDoOutput(true);
conn.setDoInput(true);
// 获取URLConnection对象对应的输出流
out = new PrintWriter(conn.getOutputStream());
// 发送请求参数
out.print(param);
// flush输出流的缓冲
out.flush();
// 定义BufferedReader输入流来读取URL的响应
in = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
while ((line = in.readLine()) != null) {
} catch (Exception e) {
System.out.println("发送 POST 请求出现异常！"+e);
e.printStackTrace();
//使用finally块来关闭输出流、输入流
if(out!=null){
out.close();
if(in!=null){
in.close();
catch(IOException ex){
ex.printStackTrace();
public static String getCookie(String url) {
HttpURLConnection conn =
URL realUrl = new URL(url);
conn = (HttpURLConnection) realUrl.openConnection();
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setRequestMethod("POST");
} catch (Exception e) {
e.printStackTrace();
String sessionId = "";
String cookieVal = "";
String key =
//取cookie
for(int i = 1; (key = conn.getHeaderFieldKey(i)) != i++){
if(key.equalsIgnoreCase("set-cookie")){
cookieVal = conn.getHeaderField(i);
cookieVal = cookieVal.substring(0, cookieVal.indexOf(";"));
sessionId = sessionId + cookieVal + ";";
return sessionId;
2.问题总结
第一步：使用最基本方法，直接抓取，抓取到内容，恭喜你。
第二步：直接抓取页面无果时，通过设置cookie抓取，即conn.setRequestProperty(“Cookie”,cookie);
第三步：新的问题是，如何获取cookie，当第一次访问页面时会产生cookie。所以要先访问一次页面，拿到cookie。即getCookie(String url)方法
第四步：这里就比较复杂了，我接触的大部面抓取，目标页面不存在重定向。如果遇到，就需要使用getCookie2()和getCookie3()方法获取cookie。
这也是我目前遇到最麻烦的抓取，用了二天才解决。加油加油加油！！！
3.测试代码
* 出口退税率查询
* 测试url：
* http://www.zjtax.gov.cn/wcm/xchaxun/tuishui.?sotype=FULLNAME&sovalue=钢铁&PageIndex=1
public HashMap&String,Object& getCktsls(String url){
//先获取cookie
String cookie= HttpURLContent.getCookie2("http://www.zjtax.gov.cn/wcm/xchaxun/tuishui.jsp");
HashMap&String,Object& re = new HashMap&String,Object&();
//抓取结果
String result = HttpURLContent.sendPost1(url,null,"utf-8",cookie);
//System.out.println(result);
//以下代码是对结果的处理了。。。根据实际情况。。。
if(result.contains("&font color='#104194'&共")){//查询到结果
String[] result_arr = result.split("&font color='#104194'&共");
String totalPage_str = result_arr[1].substring(0, result_arr[1].indexOf("页")).trim();
List&Map&String,String&& mapList = new ArrayList&Map&String,String&&();
String[] result_arr1 = result.split("class=\"gs_cx4_sp7\"&");
for(int i=1;i&result_arr1.i++){
Map&String,String& map = new HashMap&String,String&();
map.put("number", result_arr1[i].substring(0, result_arr1[i].indexOf("&/span&")));
String[] result_arr2 = result_arr1[i].split("\"&");
for(int j=1;j&result_arr2.j++){
String value = "";
if(j&=5) value = result_arr2[j].substring(0, result_arr2[j].indexOf("&/span&"));
switch (j) {
map.put("nsrmc",value );
map.put("type", value);
map.put("sdate", value);
map.put("edate", value);
map.put("sign", value);
mapList.add(map);
re.put("totalPage_str", totalPage_str);
re.put("result", mapList);
}else{//未查询到结果
优质网站模板温馨提示！由于新浪微博认证机制调整，您的新浪微博帐号绑定已过期，请重新绑定！&&|&&
性格开朗,热爱计算机编程,,喜欢软件开发,希望成为中国的IT届中一名有用的人
LOFTER精选
网易考拉推荐
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
阅读(1958)|
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
历史上的今天
loftPermalink:'',
id:'fks_086075',
blogTitle:'java中根据url抓取html页面内容的方法',
blogAbstract:'import java.io.BufferedRimport java.io.IOEimport java.io.InputStreamRimport java.net.HttpURLCimport java.net.MalformedURLEimport java.net.URL;public class Test {public static String getHtml(String urlString) {try {StringBuffer html = new StringBuffer();URL url = new URL(urlString);HttpURLConnection conn = (HttpURLConnection)',
blogTag:'',
blogUrl:'blog/static/',
isPublished:1,
istop:false,
modifyTime:9,
publishTime:2,
permalink:'blog/static/',
commentCount:0,
mainCommentCount:0,
recommendCount:0,
bsrk:-100,
publisherId:0,
recomBlogHome:false,
currentRecomBlog:false,
attachmentsFileIds:[],
groupInfo:{},
friendstatus:'none',
followstatus:'unFollow',
pubSucc:'',
visitorProvince:'',
visitorCity:'',
visitorNewUser:false,
postAddInfo:{},
mset:'000',
remindgoodnightblog:false,
isBlackVisitor:false,
isShowYodaoAd:false,
hostIntro:'性格开朗,热爱计算机编程,,喜欢软件开发,希望成为中国的IT届中一名有用的人',
hmcon:'1',
selfRecomBlogCount:'0',
lofter_single:''
{list a as x}
{if x.moveFrom=='wap'}
{elseif x.moveFrom=='iphone'}
{elseif x.moveFrom=='android'}
{elseif x.moveFrom=='mobile'}
${a.selfIntro|escape}{if great260}${suplement}{/if}
{list a as x}
推荐过这篇日志的人：
{list a as x}
{if !!b&&b.length>0}
他们还推荐了：
{list b as y}
转载记录：
{list d as x}
{list a as x}
{list a as x}
{list a as x}
{list a as x}
{if x_index>4}{break}{/if}
${fn2(x.publishTime,'yyyy-MM-dd HH:mm:ss')}
{list a as x}
{if !!(blogDetail.preBlogPermalink)}
{if !!(blogDetail.nextBlogPermalink)}
{list a as x}
{if defined('newslist')&&newslist.length>0}
{list newslist as x}
{if x_index>7}{break}{/if}
{list a as x}
{var first_option =}
{list x.voteDetailList as voteToOption}
{if voteToOption==1}
{if first_option==false},{/if}&&“${b[voteToOption_index]}”&&
{if (x.role!="-1") },“我是${c[x.role]}”&&{/if}
&&&&&&&&${fn1(x.voteTime)}
{if x.userName==''}{/if}
网易公司版权所有&&
{list x.l as y}
{if defined('wl')}
{list wl as x}{/list}博客分类：
HttpURLConnection 和HttpClient+Jsoup处理标签抓取页面和模拟登录
博客分类： httpclient
HttpURLConnectionHttpClientJsoup
HttpURLConnection抓取
package com.app.
import java.io.BufferedR
import java.io.BufferedW
import java.io.FileNotFoundE
import java.io.FileOutputS
import java.io.IOE
import java.io.InputS
import java.io.InputStreamR
import java.io.OutputStreamW
import java.io.UnsupportedEncodingE
import java.io.W
import java.net.HttpURLC
import java.net.URL;
import java.net.URLE
public class Html {
private static final String loginURL = "http://login.goodjobs.cn/index.php/action/UserLogin";
private static final String forwardURL = "http://user.goodjobs.cn/dispatcher.php/module/Personal/?skip_fill=1";
* 获取登录页面请求
* @param loginUrl登录URL
* @param params登录用户名/密码参数
* @throws Exception
public static String
createHtml(String...params)throws Exception{
URL url = new URL(loginURL);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setDoOutput(true);
loginHtml(conn, params);
return forwardHtml(conn,url);
* 登录页面
* @param conn
* @param params登录用户名/密码参数
* @throws Exception
private static void loginHtml(HttpURLConnection conn, String... params)
throws Exception {
OutputStreamWriter out = new OutputStreamWriter(conn.getOutputStream(), "GBK");
StringBuffer buff=new StringBuffer();
buff.append("memberName="+URLEncoder.encode(params[0], "UTF-8"));//页面用户名
buff.append("&password="+URLEncoder.encode(params[1],"UTF-8"));//页面密码
out.write(buff.toString());//填充参数
out.flush();
out.close();
* 转向到定向的页面
* @param conn连接对象
* @param url重新定向请求URL
* @param toUrl定向到页面请求URL
* @throws Exception
public static String forwardHtml(HttpURLConnection conn,URL url)throws Exception{
//重新打开一个连接
String cookieVal = conn.getHeaderField("Set-Cookie");
url = new URL(forwardURL);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
conn.setRequestProperty("User-Agent","Mozilla/4.0 ( MSIE 6.0; Windows NT 5.1; SV1; Foxy/1; .NET CLR 2.0.50727;MEGAUPLOAD 1.0)");
conn.setFollowRedirects(false);//置此类是否应该自动执行 HTTP 重定向
// 取得cookie,相当于记录了身份,供下次访问时使用
if (cookieVal != null) {
//发送cookie信息上去,以表明自己的身份,否则会被认为没有权限
conn.setRequestProperty("Cookie", cookieVal);
conn.connect();
InputStream in = conn.getInputStream();
BufferedReader buffReader = new BufferedReader( new InputStreamReader(in,"GBK"));
String line =
String content = "";
while ((line = buffReader.readLine()) != null) {
content +="\n" +
//IOUtils.write(result, new FileOutputStream("d:/index.html"),"GBK");
write(content, "d:/forward.html");
buffReader.close();
* @param content
* @param htmlPath
public static boolean write(String content, String htmlPath) {
boolean flag =
Writer out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(htmlPath), "GBK"));
out.write("\n" + content);
out.close();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
public static void main(String[] args)throws Exception{
String [] params={"admin","admin12"};
System.out.println(createHtml(params));
HttpClient抓取页面未处理样式的
package com.app.
import java.io.BufferedW
import java.io.FileNotFoundE
import java.io.FileOutputS
import java.io.IOE
import java.io.OutputStreamW
import java.io.UnsupportedEncodingE
import java.io.W
import java.text.SimpleDateF
import java.util.D
import org.apache.commons.httpclient.C
import org.apache.commons.httpclient.HttpC
import org.apache.commons.httpclient.NameValueP
import org.apache.commons.httpclient.cookie.CookieP
import org.apache.commons.httpclient.cookie.CookieS
import org.apache.commons.httpclient.methods.PostM
import org.apache.commons.httpclient.params.HttpMethodP
public class HttpClientHtml {
private static final String SITE = "login.goodjobs.cn";
private static final int PORT = 80;
private static final String loginAction = "/index.php/action/UserLogin";
private static final String forwardURL = "http://user.goodjobs.cn/dispatcher.php/module/Personal/?skip_fill=1";
* 模拟等录
* @param LOGON_SITE
* @param LOGON_PORT
* @param login_Action
* @param params
* @throws Exception
private static HttpClient loginHtml(String LOGON_SITE, int LOGON_PORT,String login_Action,String ...params) throws Exception {
HttpClient client = new HttpClient();
client.getHostConfiguration().setHost(LOGON_SITE, LOGON_PORT);
// 模拟登录页面
PostMethod post = new PostMethod(login_Action);
NameValuePair userName = new NameValuePair("memberName",params[0] );
NameValuePair password = new NameValuePair("password",params[1] );
post.setRequestBody(new NameValuePair[] { userName, password });
client.executeMethod(post);
post.releaseConnection();
// 查看cookie信息
CookieSpec cookiespec = CookiePolicy.getDefaultSpec();
Cookie[] cookies = cookiespec.match(LOGON_SITE, LOGON_PORT, "/", false,
client.getState().getCookies());
if (cookies != null)
if (cookies.length == 0) {
System.out.println("Cookies is not Exists ");
for (int i = 0; i & cookies. i++) {
System.out.println(cookies[i].toString());
* 模拟等录后获取所需要的页面
* @param client
* @param newUrl
* @throws Exception
private static void createHtml(HttpClient client, String newUrl)
Exception {
PostMethod post = new PostMethod(newUrl);
client.executeMethod(post);
post.getParams().setParameter(HttpMethodParams.HTTP_CONTENT_CHARSET, "GBK");
String content= post.getResponseBodyAsString();
SimpleDateFormat format=new SimpleDateFormat("yyyy-MM-dd");
//IOUtils.write(content, new FileOutputStream("d:/"+format.format(new Date())+".html"),"GBK");
write(content,"d:/"+format.format(new Date())+".html");
post.releaseConnection();
public static void main(String[] args) throws Exception {
String [] params={"admin","admin123"};
HttpClient client = loginHtml(SITE, PORT, loginAction,params);
// 访问所需的页面
createHtml(client, forwardURL);
//System.out.println(UUID.randomUUID());
* @param content
* @param htmlPath
public static boolean write(String content, String htmlPath) {
boolean flag =
Writer out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(htmlPath), "GBK"));
out.write("\n" + content);
out.close();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
HttpClient抓取页面处理样式的页面效果（连接服务器站点的css）
package com.app.
import java.io.BufferedR
import java.io.BufferedW
import java.io.F
import java.io.FileOutputS
import java.io.FileR
import java.io.IOE
import java.io.OutputStreamW
import java.io.W
import java.text.SimpleDateF
import java.util.D
import org.apache.commons.httpclient.C
import org.apache.commons.httpclient.HttpC
import org.apache.commons.httpclient.NameValueP
import org.apache.commons.httpclient.cookie.CookieP
import org.apache.commons.httpclient.cookie.CookieS
import org.apache.commons.httpclient.methods.PostM
import org.apache.commons.httpclient.params.HttpMethodP
import org.jsoup.J
import org.jsoup.nodes.D
import org.jsoup.nodes.E
import org.jsoup.select.E
import com.app.comom.FileU
public class HttpClientHtml {
private static final String SITE = "login.goodjobs.cn";
private static final int PORT = 80;
private static final String loginAction = "/index.php/action/UserLogin";
private static final String forwardURL = "http://user.goodjobs.cn/dispatcher.php/module/Personal/?skip_fill=1";
private static final String toUrl = "d:\\test\\";
private static final String css = "http://user.goodjobs.cn/personal.css";
private static final String Img = "http://user.goodjobs.cn/images";
private static final String _JS = "http://user.goodjobs.cn/scripts/fValidate/fValidate.one.js";
* 模拟等录
* @param LOGON_SITE
* @param LOGON_PORT
* @param login_Action
* @param params
* @throws Exception
private static HttpClient loginHtml(String LOGON_SITE, int LOGON_PORT,String login_Action,String ...params) throws Exception {
HttpClient client = new HttpClient();
client.getHostConfiguration().setHost(LOGON_SITE, LOGON_PORT);
// 模拟登录页面
PostMethod post = new PostMethod(login_Action);
NameValuePair userName = new NameValuePair("memberName",params[0] );
NameValuePair password = new NameValuePair("password",params[1] );
post.setRequestBody(new NameValuePair[] { userName, password });
client.executeMethod(post);
post.releaseConnection();
// 查看cookie信息
CookieSpec cookiespec = CookiePolicy.getDefaultSpec();
Cookie[] cookies = cookiespec.match(LOGON_SITE, LOGON_PORT, "/", false,
client.getState().getCookies());
if (cookies != null)
if (cookies.length == 0) {
System.out.println("Cookies is not Exists ");
for (int i = 0; i & cookies. i++) {
System.out.println(cookies[i].toString());
* 模拟等录后获取所需要的页面
* @param client
* @param newUrl
* @throws Exception
private static String
createHtml(HttpClient client, String newUrl) throws
Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
String filePath = toUrl + format.format(new Date() )+ "_" + 1 + ".html";
PostMethod post = new PostMethod(newUrl);
client.executeMethod(post);
//设置编码
post.getParams().setParameter(HttpMethodParams.HTTP_CONTENT_CHARSET, "GBK");
String content= post.getResponseBodyAsString();
FileUtil.write(content, filePath);
System.out.println("\n写入文件成功!");
post.releaseConnection();
return fileP
* 解析html代码
* @param filePath
* @param random
private static String JsoupFile(String filePath, int random) {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
File infile = new File(filePath);
String url = toUrl + format.format(new Date()) + "_new_" + random+ ".html";
File outFile = new File(url);
Document doc = Jsoup.parse(infile, "GBK");
String html="&!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN'&";
StringBuffer sb = new StringBuffer();
sb.append(html).append("\n");
sb.append("&html&").append("\n");
sb.append("&head&").append("\n");
sb.append("&title&欢迎使用新安人才网个人专区&/title&").append("\n");
Elements meta = doc.getElementsByTag("meta");
sb.append(meta.toString()).append("\n");
////////////////////////////body//////////////////////////
Elements body = doc.getElementsByTag("body");
////////////////////////////link//////////////////////////
Elements links = doc.select("link");//对link标签有href的路径都作处理
for (Element link : links) {
String hrefAttr = link.attr("href");
if (hrefAttr.contains("/personal.css")) {
hrefAttr = hrefAttr.replace("/personal.css",css);
Element hrefVal=link.attr("href", hrefAttr);//修改href的属性值
sb.append(hrefVal.toString()).append("\n");
////////////////////////////script//////////////////////////
Elements scripts = doc.select("script");//对script标签
for (Element js : scripts) {
String jsrc = js.attr("src");
if (jsrc.contains("/fValidate.one.js")) {
String oldJS="/scripts/fValidate/fValidate.one.js";//之前的css
jsrc = jsrc.replace(oldJS,_JS);
Element val=js.attr("src", jsrc);//修改href的属性值
sb.append(val.toString()).append("\n").append("&/head&");
////////////////////////////script//////////////////////////
Elements tags = body.select("*");//对所有标签有src的路径都作处理
for (Element tag : tags) {
String src = tag.attr("src");
if (src.contains("/images")) {
src = src.replace("/images",Img);
tag.attr("src", src);//修改src的属性值
sb.append(body.toString());
sb.append("&/html&");
BufferedReader in = new BufferedReader(new FileReader(infile));
Writer out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(outFile), "gbk"));
String content = sb.toString();
out.write(content);
in.close();
System.out.println("页面已经爬完");
out.close();
} catch (IOException e) {
e.printStackTrace();
public static void main(String[] args) throws Exception {
String [] params={"admin","admin123"};
HttpClient client = loginHtml(SITE, PORT, loginAction,params);
// 访问所需的页面
String path=createHtml(client, forwardURL);
System.out.println( JsoupFile(path,1));
HttpClient抓取页面处理样式的页面效果（从网站下载以txt格式文件写入html处理的css）
package com.app.
import java.io.BufferedR
import java.io.BufferedW
import java.io.F
import java.io.FileOutputS
import java.io.FileR
import java.io.IOE
import java.io.OutputStreamW
import java.io.W
import java.text.SimpleDateF
import java.util.D
import org.apache.commons.httpclient.C
import org.apache.commons.httpclient.HttpC
import org.apache.commons.httpclient.NameValueP
import org.apache.commons.httpclient.cookie.CookieP
import org.apache.commons.httpclient.cookie.CookieS
import org.apache.commons.httpclient.methods.PostM
import org.apache.commons.httpclient.params.HttpMethodP
import org.jsoup.J
import org.jsoup.nodes.D
import org.jsoup.nodes.E
import org.jsoup.select.E
import com.app.comom.FileU
public class HttpClientHtml {
private static final String SITE = "login.goodjobs.cn";
private static final int PORT = 80;
private static final String loginAction = "/index.php/action/UserLogin";
private static final String forwardURL = "http://user.goodjobs.cn/dispatcher.php/module/Personal/?skip_fill=1";
private static final String toUrl = "d:\\test\\";
private static final String hostCss
= "d:\\test\\style.txt";
private static final String Img = "http://user.goodjobs.cn/images";
private static final String _JS = "http://user.goodjobs.cn/scripts/fValidate/fValidate.one.js";
* 模拟等录
* @param LOGON_SITE
* @param LOGON_PORT
* @param login_Action
* @param params
* @throws Exception
private static HttpClient loginHtml(String LOGON_SITE, int LOGON_PORT,String login_Action,String ...params) throws Exception {
HttpClient client = new HttpClient();
client.getHostConfiguration().setHost(LOGON_SITE, LOGON_PORT);
// 模拟登录页面
PostMethod post = new PostMethod(login_Action);
NameValuePair userName = new NameValuePair("memberName",params[0] );
NameValuePair password = new NameValuePair("password",params[1] );
post.setRequestBody(new NameValuePair[] { userName, password });
client.executeMethod(post);
post.releaseConnection();
// 查看cookie信息
CookieSpec cookiespec = CookiePolicy.getDefaultSpec();
Cookie[] cookies = cookiespec.match(LOGON_SITE, LOGON_PORT, "/", false,
client.getState().getCookies());
if (cookies != null)
if (cookies.length == 0) {
System.out.println("Cookies is not Exists ");
for (int i = 0; i & cookies. i++) {
System.out.println(cookies[i].toString());
* 模拟等录后获取所需要的页面
* @param client
* @param newUrl
* @throws Exception
private static String
createHtml(HttpClient client, String newUrl) throws
Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
String filePath = toUrl + format.format(new Date() )+ "_" + 1 + ".html";
PostMethod post = new PostMethod(newUrl);
client.executeMethod(post);
//设置编码
post.getParams().setParameter(HttpMethodParams.HTTP_CONTENT_CHARSET, "GBK");
String content= post.getResponseBodyAsString();
FileUtil.write(content, filePath);
System.out.println("\n写入文件成功!");
post.releaseConnection();
return fileP
* 解析html代码
* @param filePath
* @param random
private static String JsoupFile(String filePath, int random) {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
File infile = new File(filePath);
String url = toUrl + format.format(new Date()) + "_new_" + random+ ".html";
File outFile = new File(url);
Document doc = Jsoup.parse(infile, "GBK");
String html="&!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN'&";
StringBuffer sb = new StringBuffer();
sb.append(html).append("\n");
sb.append("&html&").append("\n");
sb.append("&head&").append("\n");
sb.append("&title&欢迎使用新安人才网个人专区&/title&").append("\n");
Elements meta = doc.getElementsByTag("meta");
sb.append(meta.toString()).append("\n");
/////////////////////////////本地css////////////////////////////
File cssFile = new File(hostCss);
BufferedReader in = new BufferedReader(new FileReader(cssFile));
Writer out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(outFile), "gbk"));
String content=in.readLine();
while(content!=null){
//System.out.println(content);
sb.append(content+"\n");
content=in.readLine();
in.close();
////////////////////////////处理body标签//////////////////////////
Elements body = doc.getElementsByTag("body");
////////////////////////////处理script标签//////////////////////////
Elements scripts = doc.select("script");//对script标签
for (Element js : scripts) {
String jsrc = js.attr("src");
if (jsrc.contains("/fValidate.one.js")) {
String oldJS="/scripts/fValidate/fValidate.one.js";//之前的css
jsrc = jsrc.replace(oldJS,_JS);
Element val=js.attr("src", jsrc);//修改href的属性值
sb.append(val.toString()).append("\n").append("&/head&");
////////////////////////////处理所有src的属性值//////////////////////////
Elements tags = body.select("*");//对所有标签有src的路径都作处理
for (Element tag : tags) {
String src = tag.attr("src");
if (src.contains("/images")) {
src = src.replace("/images",Img);
tag.attr("src", src);//修改src的属性值
sb.append(body.toString());
sb.append("&/html&");
out.write(sb.toString());
in.close();
System.out.println("页面已经爬完");
out.close();
} catch (IOException e) {
e.printStackTrace();
public static void main(String[] args) throws Exception {
String [] params={"admin","admin123"};
HttpClient client = loginHtml(SITE, PORT, loginAction,params);
// 页面生成
String path=createHtml(client, forwardURL);
System.out.println( JsoupFile(path,1));
浏览: 2092889 次
来自: 杭州
你好，ES跑起来了吗？我的在tomcat启动时卡在这里Hibe ...
java实现操作word中的表格内容，用插件实现的话，可以试试 ...
Maven多模块spring + springMVC + JP ...
博主还有wlsvm.zip这个压缩包吗？链接打不开了
(window.slotbydup=window.slotbydup || []).push({
id: '4773203',
container: s,
size: '200,200',
display: 'inlay-fix'}

我爱游戏网

urlurl.openconnectionn + rhino 怎么做到抓取动态页面

我要回帖

更多关于 httpsurlconnection 的文章

更多推荐