日期:2014-05-16  浏览次数:20379 次

Jsoup缃戦〉鍐呭鎶撳彇鍒嗘瀽(1)

聽 鍦↗ava 绋嬪簭鍦ㄨВ鏋怘TML 鏂囨。鏃讹紝澶у搴旇鏅撳緱htmlparser 杩欎釜寮€婧愰」鐩紝鎴戜篃鏄娇鐢ㄨ繃锛屼笉杩囪繖涓▼搴忓埌浜?006骞村氨娌℃湁鏇存柊浜嗐€傜敱浜庢垜鐨勫熀纭€杈冨樊锛屽浜庢墿灞曡嚜瀹氫箟鐨勬爣绛捐繕鏄笉澶噦锛岃繕鏄湁瓒呮椂闂鍥版壈锛屽伓鐒剁殑 鏈轰細涓彂鐜版湁jsoup锛岃€屼笖鏇存柊鍒颁簡1.72鐗堬紝浣跨敤璧锋潵杩樻槸寰堝鏄撲笂鎵嬬殑銆備笅闈㈠啓浜涗娇鐢ㄥ績寰楋細

聽 聽 聽 聽聽jsoup聽is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

聽 聽 聽 聽jsuop鏄竴娆緅ava鐨刪tml瑙f瀽鍣紝鎻愪緵涓€濂楅潪甯哥渷鍔涚殑API锛岄€氳繃dom妯″瀷css鍜岀被浼间簬jquery鐨勬柟寮忔潵鑾峰彇鍜屾搷浣滄暟鎹€?/p>

聽 聽 聽 鍔熻兘锛?.瑙f瀽涓€涓狧tml鏂囨。锛?.瑙f瀽涓€涓猙ody鐗囨

聽 聽 聽 聽

Java浠g爜 聽鏀惰棌浠g爜
  1. String聽html聽=聽"<html><head><title>First聽parse</title></head>"聽聽
  2. 聽聽+聽"<body><p>Parsed聽HTML聽into聽a聽doc.</p></body></html>";聽聽
  3. Document聽doc聽=聽Jsoup.parse(html);//鍒嗘瀽鏂囨。锛屼娇鐢╠oc.toString()鍙互杞负鏂囨湰聽聽
  4. Element聽body=doc.body();//鑾峰彇body鐗囨锛屼娇鐢╞ody.toString()鍙互杞负鏂囨湰聽聽

聽 聽 聽

聽 聽 聽 鑾峰彇鏂瑰紡锛?.浠庢湰鍦版枃浠跺姞杞?聽 2.鏍规嵁url鍦板潃鑾峰彇

聽 聽 聽

Java浠g爜 聽鏀惰棌浠g爜
  1. /**浣跨敤闈欐€伮燡soup.parse(File聽in,聽String聽charsetName,聽String聽baseUri)聽鏂规硶
  2. 聽*鍏朵腑baseUri鍙傛暟鐢ㄤ簬瑙e喅鏂囦欢涓璘RLs鏄浉瀵硅矾寰勭殑闂銆?/span>聽
  3. 聽*濡傛灉涓嶉渶瑕佸彲浠ヤ紶鍏ヤ竴涓┖鐨勫瓧绗︿覆銆?/span>聽
  4. 聽*/聽聽
  5. File聽input聽=聽new聽File("/tmp/input.html");聽聽
  6. Document聽doc聽=聽Jsoup.parse(input,聽"UTF-8",聽"http://example.com/");聽聽

聽聽

Java浠g爜 聽鏀惰棌浠g爜
  1. /**
  2. 聽*鏍规嵁url鐩存帴鑾峰彇鍐呭锛屽彲浠ュ姞鍏ヨ秴鏃讹紝get鏂规硶涓嶈锛屽氨鐢╬ost鏂规硶
  3. 聽*鎴戝湪瀹為檯搴旂敤涓紝鍑虹幇404,405,504绛夐敊璇俊鎭?/span>聽
  4. 聽*灏唃et鏀逛负post灏卞彲浠ワ紝鎴栬€呭弽杩囨潵鏀?/span>聽
  5. 聽*濡傛灉绛変互鍚庡紕鏄庣櫧浜嗭紝鍐嶆潵瑙i噴娓呮
  6. 聽*/聽聽
  7. Document聽doc1聽=聽Jsoup.connect("http://www.hao123.com/").get();聽聽
  8. String聽title聽=聽doc1.title();聽//鑾峰彇缃戦〉鐨勬爣棰?/span>聽聽
  9. String聽content=doc1.toString();//灏嗙綉椤佃浆涓烘枃鏈?/span>聽聽
  10. 聽聽
  11. Document聽doc2聽=聽Jsoup.connect("http://www.hao123.com")聽聽
  12. 聽聽.data("query",聽"Java")//璇锋眰鍙傛暟聽聽
  13. 聽聽.userAgent("