日期:2014-05-18  浏览次数:20433 次

c# 正则 提取 。麻烦各位了 !
C# code

<div class="pagebox"><span class="pagebox_pre_nolink">上一页</span><span class="pagebox_num_nonce">1</span><span class="pagebox_num"><a target="_self" href="102641554-2.html" class="page">2</a></span><span class="pagebox_num"><a target="_self" href="102641554-3.html" class="page">3</a></span><span class="pagebox_num"><a target="_self" href="102641554-4.html" class="page">4</a></span><span class="pagebox_num"><a target="_self" href="102641554-5.html" class="page">5</a></span><span class="pagebox_next"><a href="102641554-2.html">下一页</a></span></div>




输出 102641554-2.html 2
  102641554-3.html 3
  102641554-4.html 4
  102641554-5.html 5

也是说要 class="page"的标签的src属性和值 ,两个都要 ,源数据还有其它的标签 所以class="page"条件也要 。

------解决方案--------------------
href="([^"]+)"[^>]*>(\d+)</a>



1: 102641554-2.html
2: 2
1: 102641554-3.html
2: 3
1: 102641554-4.html
2: 4
1: 102641554-5.html
2: 5

------解决方案--------------------
C# code
            string str = @"<div class=""pagebox""><span class=""pagebox_pre_nolink"">上一页</span><span class=""pagebox_num_nonce"">1</span><span class=""pagebox_num""><a target=""_self"" href=""102641554-2.html"" class=""page"">2</a></span><span class=""pagebox_num""><a target=""_self"" href=""102641554-3.html"" class=""page"">3</a></span><span class=""pagebox_num""><a target=""_self"" href=""102641554-4.html"" class=""page"">4</a></span><span class=""pagebox_num""><a target=""_self"" href=""102641554-5.html"" class=""page"">5</a></span><span class=""pagebox_next""><a href=""102641554-2.html"">下一页</a></span></div>";
            Regex reg = new Regex(@"(?is)<a[^>]*?href=(['""\s]?)(?<url>[^'""\s]+)\1[^>]*?class=""page""[^>]*?>(?<text>.*?)</a>");
            foreach (Match m in reg.Matches(str))
                Console.WriteLine("{0} {1}", m.Groups["url"].Value, m.Groups["text"].Value);