web crawler - Perl Mechanize module for scraping pdfs -
i have website many pdfs uploaded. want download pdfs present in website. first need provide username , password website. after searching sometime found www::mechanize package work. problem arises here want make recursive search in website meaning if link not contain pdf, should not discard link should navigate link , check whether new page has links contain pdfs. in way should exhaustively search entire website download pdfs uploaded. suggestion on how this?
i'd go wget, runs on variety of platforms.
if want in perl, check cpan web crawlers.
you might want decouple collecting pdf urls downloading them. crawling lengthy processing , might advantageous able hand off downloading tasks seperate worker processes.
Comments
Post a Comment