日期:2014-05-18  浏览次数:20812 次

请问怎么抓取登录后才能看到的网页?
想抓取一个网站上的页面内容,但是那个网页需要登录才能打开.

我有用户名和密码,用下面的方式可以登录成功,代码如下:

URLConnection   connection   =   new   URL( "http://localhost/login.jsp?user=test&pswd=123 ").openConnection();

BufferedReader   reader   =   new   BufferedReader(new   java.io.InputStreamReader(connection.getInputStream()));
String   line   =   " ";
while   ((line   =   reader.readLine())   !=   null){
    System.out.println(line);
}
登录成功之后,用户的session被保留在服务器端.

===================================================

但是,再用这种方式去取需要权限的网页时,得到的还是未登录信息.代码如下:
connection   =   new   URL( "http://localhost/user.jsp ").openConnection();

reader   =   new   BufferedReader(new   InputStreamReader(connection.getInputStream()));
String   line   =   " ";
while   ((line   =   reader.readLine())   !=   null){
    System.out.println(line);
}
读取失败,也就是说服务器那端没有用户session.
我猜可能是URLConnection对象被重新初始化了,所以才这样.

那怎么才能将上面登录之后的session保留下来继续使用呢.
或者有别的什么办法可以突破验证抓取网页.

------解决方案--------------------
试试 http://jakarta.apache.org/commons/httpclient/
这个看看
------解决方案--------------------
在第二次请求的时候要加入Cookie
import java.io.*;
import java.net.URL;
import java.net.HttpURLConnection;
import java.net.URLEncoder;
import java.io.BufferedReader;
import java.io.InputStreamReader;

public class GetCookie {
private String url =
"http://www.aaaa.net/USER/user_login.asp?logid= " + URLEncoder.encode( "nihao321 ") + "&pswd= " + URLEncoder.encode( "nihao321 ");
private String url1 = "http://www.aaaa.net/user/per_data.asp ";
public GetCookie() {
//get();
};

public String get() {
String sCurrentLine;
StringBuffer sTotalString;
sCurrentLine = " ";
sTotalString = new StringBuffer( " ");
String Cookie = " ";
try {
System.out.println(url);
java.io.InputStream l_urlStream;
java.io.BufferedReader l_reader = null;
java.net.HttpURLConnection l_connection;
java.net.URL l_url = new java.net.URL(this.url1);
l_connection = (java.net.HttpURLConnection)
l_url.openConnection();
l_connection.connect();
Cookie = l_connection.getHeaderField(5);
java.net.HttpURLConnection l_connection_1;
java.net.URL l_url_1 = new java.net.URL(this.url);
l_connection_1 = (java.net.HttpURLConnection)
l_url_1.openConnection();
l_connection_1.addRequestProperty( "Cookie ", Cookie);
l_connection_1.connect();
l_urlStream = l_connection_1.getInputStream();
l_reader = new java.io.BufferedReader(new java.io.
InputStreamReader(l_urlStream));
while ( (sCurrentLine = l_reader.readLine()) != null) {
sTotalString = sTotalString.append(new StringBuffer(sCurrentLine));
}
//System.out.print(sTotalString);
/*System.out.println(l_connection_1.getHeaderField(0));
System.out.println(l_connection_1.getHeaderField(1));
System.out.println(l_connection_1.getHeaderField(2));
System.out.println(l_connection_1.getHeaderField(3));
System.out.println(l_connection_1.getHeaderField(4))