我不是特别喜欢任何现有答案的方法
Timo的代码:在CURLM_CALL_MULTI_PERFORM期间可能会 sleep/select(),这是错误的,当 ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) 时,它也可能无法进入睡眠状态,这可能会使代码无缘无故地以 100% CPU 使用率(1 核)旋转
Sudhir 的代码:当 $still_running > 0 时不会休眠,并在下载完所有内容之前调用 async 函数 curl_multi_exec(),这会导致 php 使用 100% cpu(1 cpu 内核),直到所有内容都下载完毕,换句话说,下载时无法进入睡眠状态
这里有一个方法,没有这些问题:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
$worker = curl_init($website);
curl_setopt_array($worker, [
CURLOPT_RETURNTRANSFER => 1
]);
curl_multi_add_handle($mh, $worker);
}
for (;;) {
$still_running = null;
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
// handle curl multi error?
}
if ($still_running < 1) {
// all downloads completed
break;
}
// some haven't finished downloading, sleep until more data arrives:
curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info["result"] !== CURLE_OK) {
// handle download error?
}
$results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
curl_multi_remove_handle($mh, $info["handle"]);
curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);
请注意,这里所有3种方法(我的答案,Sudhir的答案和Timo的答案)共有的一个问题是,它们将同时打开所有连接,如果您有1,000,000个网站要获取,这些脚本将尝试同时打开1,000,000个连接。如果你需要喜欢..一次只下载50个网站,或者类似的东西,也许可以试试:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
if ($max_connections < 1) {
throw new InvalidArgumentException("max_connections MUST be >=1");
}
foreach ($urls as $key => $foo) {
if (! is_string($foo)) {
throw new \InvalidArgumentException("all urls must be strings!");
}
if (empty($foo)) {
unset($urls[$key]); // ?
}
}
unset($foo);
// DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
$ret = array();
$mh = curl_multi_init();
$workers = array();
$work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
// > If an added handle fails very quickly, it may never be counted as a running_handle
while (1) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($still_running < count($workers)) {
// some workers finished, fetch their response and close them
break;
}
$cms = curl_multi_select($mh, 1);
// var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
}
while (false !== ($info = curl_multi_info_read($mh))) {
// echo "NOT FALSE!";
// var_dump($info);
{
if ($info['msg'] !== CURLMSG_DONE) {
continue;
}
if ($info['result'] !== CURLE_OK) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$info['result'],
"curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
), true);
}
} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$err,
"curl error " . $err . ": " . curl_strerror($err)
), true);
}
} else {
$ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
}
curl_multi_remove_handle($mh, $info['handle']);
assert(isset($workers[(int) $info['handle']]));
unset($workers[(int) $info['handle']]);
curl_close($info['handle']);
}
}
// echo "NO MORE INFO!";
};
foreach ($urls as $url) {
while (count($workers) >= $max_connections) {
// echo "TOO MANY WORKERS!\n";
$work();
}
$neww = curl_init($url);
if (! $neww) {
trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
if ($return_fault_reason) {
$ret[$url] = array(
false,
- 1,
"curl_init() failed"
);
}
continue;
}
$workers[(int) $neww] = $url;
curl_setopt_array($neww, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT_MS => $timeout_ms
));
curl_multi_add_handle($mh, $neww);
// curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
}
while (count($workers) > 0) {
// echo "WAITING FOR WORKERS TO BECOME 0!";
// var_dump(count($workers));
$work();
}
curl_multi_close($mh);
return $ret;
}
这将下载整个列表,而不会同时下载超过50个URL(但即使这种方法也会将所有结果存储在ram中,因此即使这种方法也可能最终耗尽RAM;如果您想将其存储在数据库中而不是ram中,则可以修改curl_multi_getcontent部分以将其存储在数据库中而不是ram持久变量中。