检测并从字符串中提取 URL?

2022-08-31 20:19:41

这是一个简单的问题,但我只是不明白。我想检测字符串中的url并将其替换为缩短的字符串。

我从stackoverflow中发现了这个表达式,但结果只是http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

有没有更好的主意?


答案 1

让我继续说,我不是复杂案例正则表达式的大力倡导者。试图为这样的东西写出完美的表达是非常困难的。也就是说,我确实碰巧有一个用于检测URL的,并且它由通过的350行单元测试用例类支持。有人从一个简单的正则表达式开始,多年来,我们扩展了表达式和测试用例来处理我们发现的问题。这绝对不是微不足道的:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
                + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
                + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

下面是使用它的示例:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}

答案 2
/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

例:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
    System.out.println(url);
}

指纹:

https://stackoverflow.com/
http://www.google.com/