URLConnection不允许我访问Http错误（404，500等）的数据

java urlconnection

2022-09-01 21:48:53

我正在制作一个爬虫，并且需要从流中获取数据，无论它是否是200。CURL正在这样做，以及任何标准浏览器。

以下内容实际上不会获取请求的内容，即使存在一些请求，也会引发一个带有 http 错误状态代码的异常。无论如何，我想要输出，有没有办法？我更喜欢使用此库，因为它实际上会执行持久连接，这非常适合我正在执行的爬网类型。

package test;

import java.net.*;
import java.io.*;

public class Test {

    public static void main(String[] args) {

         try {

            URL url = new URL("http://github.com/XXXXXXXXXXXXXX");
            URLConnection connection = url.openConnection();

            DataInputStream inStream = new DataInputStream(connection.getInputStream());
            String inputLine;

            while ((inputLine = inStream.readLine()) != null) {
                System.out.println(inputLine);
            }
            inStream.close();
        } catch (MalformedURLException me) {
            System.err.println("MalformedURLException: " + me);
        } catch (IOException ioe) {
            System.err.println("IOException: " + ioe);
        }
    }
}

工作，谢谢：这是我想到的 - 只是作为一个粗略的概念证明：

import java.net.*;
import java.io.*;

public class Test {

    public static void main(String[] args) {
//InputStream error = ((HttpURLConnection) connection).getErrorStream();

        URL url = null;
        URLConnection connection = null;
        String inputLine = "";

        try {

            url = new URL("http://verelo.com/asdfrwdfgdg");
            connection = url.openConnection();

            DataInputStream inStream = new DataInputStream(connection.getInputStream());

            while ((inputLine = inStream.readLine()) != null) {
                System.out.println(inputLine);
            }
            inStream.close();
        } catch (MalformedURLException me) {
            System.err.println("MalformedURLException: " + me);
        } catch (IOException ioe) {
            System.err.println("IOException: " + ioe);

            InputStream error = ((HttpURLConnection) connection).getErrorStream();

            try {
                int data = error.read();
                while (data != -1) {
                    //do something with data...
                    //System.out.println(data);
                    inputLine = inputLine + (char)data;
                    data = error.read();
                    //inputLine = inputLine + (char)data;
                }
                error.close();
            } catch (Exception ex) {
                try {
                    if (error != null) {
                        error.close();
                    }
                } catch (Exception e) {

                }
            }
        }

        System.out.println(inputLine);
    }
}

答案 1

简单：

URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
if (connection instanceof HttpURLConnection) {
   HttpURLConnection httpConn = (HttpURLConnection) connection;
   int statusCode = httpConn.getResponseCode();
   if (statusCode != 200 /* or statusCode >= 200 && statusCode < 300 */) {
     is = httpConn.getErrorStream();
   }
}

您可以参考Javadoc进行解释。我处理这个问题的最好方法如下：

URLConnection connection = url.openConnection();
InputStream is = null;
try {
    is = connection.getInputStream();
} catch (IOException ioe) {
    if (connection instanceof HttpURLConnection) {
        HttpURLConnection httpConn = (HttpURLConnection) connection;
        int statusCode = httpConn.getResponseCode();
        if (statusCode != 200) {
            is = httpConn.getErrorStream();
        }
    }
}

答案 2

调用后，您需要执行以下操作。openConnection

将 URLConnection 强制转换为 HttpURLConnection
调用 getResponseCode
如果响应成功，请使用 getInputStream，否则使用 getErrorStream

（成功的测试应该是因为除了200之外，还有有效的HTTP成功代码。200 <= code < 300

我正在制作一个爬虫，并且需要从流中获取数据，无论它是否是200。

请注意，如果代码是4xx或5xx，则“数据”很可能是某种错误页面。

应该提出的最后一点是，您应该始终尊重“机器人.txt”文件......并在抓取/抓取所有者可能关心的网站内容之前阅读服务条款。简单地关闭GET请求可能会惹恼网站所有者...除非你已经和他们达成了某种“安排”。