file_get_contents（）给我 403 禁止

php html http-headers screen-scraping

2022-08-30 16:53:22

我有一个合作伙伴，他创建了一些内容供我抓取。
我可以使用浏览器访问该页面，但是当尝试使用时，我得到了一个.file_get_contents403 forbidden

我尝试过使用，但这没有帮助 - 这可能是因为我不知道应该放什么。stream_context_create

1）我有什么办法可以抓取数据吗？
2）如果不是，并且不允许合作伙伴配置服务器以允许我访问，那么我该怎么办？

我尝试过使用的代码：

$opts = array(
  'http'=>array(
    'user_agent' => 'My company name',
    'method'=>"GET",
    'header'=> implode("\r\n", array(
      'Content-type: text/plain;'
    ))
  )
);

$context = stream_context_create($opts);

//Get header content
$_header = file_get_contents($partner_url,false, $context);

答案 1

这不是脚本中的问题，而是合作伙伴 Web 服务器安全性中的一项功能。

很难确切地说出是什么阻止了你，很可能是某种阻止刮擦的障碍。如果您的伴侣可以访问他的Web服务器设置，则可能有助于确定。

您可以做的是通过设置用户代理标头来“伪造Web浏览器”，以便它模仿标准的Web浏览器。

我建议cURL来做到这一点，并且很容易找到很好的文档来做到这一点。

    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_URL, "example.com");

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

    // $output contains the output string
    $output = curl_exec($ch);

    // close curl resource to free up system resources
    curl_close($ch);

答案 2

首先设置用户代理

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');

file_get_contents（） 给我 403 禁止

file_get_contents（）给我 403 禁止