正则表达式获取网页中链接和内容解决办法-Java教程-爱易网页

正则表达式获取网页中链接和内容解决办法

日期：2014-05-19　浏览次数：21232 次

正则表达式获取网页中链接和内容
我想用正则表达式获取网页中链接和内容

<img src="http://www.baidu.com/icon.png" /><a href="http://guide.sina.cn/?pos=1&vt=1">导航</a><a href="http://sina.cn/nc.php?pos=1&vt=1">新闻</a><a href="http://mil.sina.cn/?pos=1&vt=1">军事</a><a href="http://weibo.cn/?gotoreg=1&from=index&s2w=index&wm=ig_0001_index&pos=1&vt=1">微博</a><a href="http://finance.sina.cn/?sa=t60d13v512&pos=1&vt=1">股票</a><br/>

我想得到内容中的href连接和内容例如

http://guide.sina.cn/?pos=1&vt=1 导航
http://sina.cn/nc.php?pos=1&vt=1 新闻
...
http://finance.sina.cn/?sa=t60d13v512&pos=1&vt= 股票

------解决方案--------------------
第一个方法是读取数据

Java code


        /**
     * 文件中读取 目标文件
     * @return
     * @author wangjikuan
     */
    private static StringBuffer getSb(){
        StringBuffer sb = new StringBuffer();
        File f = new File("c:/xx.txt");
        try {
            BufferedReader reader = new BufferedReader( new InputStreamReader(new FileInputStream(f), "gbk"));
            String s = "";
            while((s = reader.readLine()) != null){
                sb.append(s);
            }
            
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return sb;
    }
    
    /**
     * 解析 字符串，得到目标
     * @param sb
     * @author wangjikuan
     */
    private static void parse(StringBuffer sb){
        String regx = "<a.*?</a>";
        Pattern p = Pattern.compile(regx);
        Matcher m = p.matcher(sb.toString());
        
        String regx1 = "(?<=href=\").*(?=\")";
        Pattern p1 = Pattern.compile(regx1);
    
        String regx2 = "(?<=>).*(?=<)";
        Pattern p2 = Pattern.compile(regx2);
        
        
        while(m.find()){
            String child = m.group();
            Matcher m1 = p1.matcher(child);
            
            if(m1.find()){
                System.out.print(m1.group());
            }
            
            Matcher m2 = p2.matcher(child);
            
            if(m2.find()){
                System.out.println(m2.group());
            }
            
        }
        
    }
    
    public static void main(String[] args) {
        parse(getSb());
    }

------解决方案--------------------
Java code
    public static void main(String args[]) {
        String str = "<img src=\"http://www.baidu.com/icon.png\" /><a href=\"http://guide.sina.cn/?pos=1&amp;vt=1\">导航</a><a href=\"http://sina.cn/nc.php?pos=1&amp;vt=1\">新闻</a><a href=\"http://mil.sina.cn/?pos=1&amp;vt=1\">军事</a><a href=\"http://weibo.cn/?gotoreg=1&amp;from=index&amp;s2w=index&amp;wm=ig_0001_index&amp;pos=1&amp;vt=1\">微博</a><a href=\"http://finance.sina.cn/?sa=t60d13v512&amp;pos=1&amp;vt=1\">股票</a><br/>";
        String regex = "href=\"(.*?)\">(.*?)<";

        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(str);
        while (m.find()) {
            System.out.println(m.group(1));
            System.out.println(m.group(2));
            System.out.println("-------------"

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

正则表达式获取网页中链接和内容解决办法

相关资料更多>

推荐阅读更多>