web crawler - Perl Mechanize module for scraping pdfs -
i have website many pdfs uploaded. want download pdfs present in website. first need provide username , password website. after searching sometime found www::mechanize package work. problem arises here want make recursive search in website meaning if link not contain pdf, should not discard link should navigate link , check whether new page has links contain pdfs. in way should exhaustively search entire website download pdfs uploaded. suggestion on how this?
i'd go wget
, runs on variety of platforms.
if want in perl, check cpan web crawlers.
you might want decouple collecting pdf urls downloading them. crawling lengthy processing , might advantageous able hand off downloading tasks seperate worker processes.
Comments
Post a Comment