[00:20] *** satoshi has joined #archiveteam [00:37] *** hive-mind has joined #archiveteam [00:37] *** hive-min1 has quit IRC (Read error: Connection reset by peer) [00:38] *** BlueMax has joined #archiveteam [01:56] *** sirvy_ has joined #archiveteam [02:16] *** killsushi has quit IRC (Quit: Leaving) [02:39] *** satoshi has quit IRC (Remote host closed the connection) [02:41] *** Raccoon has joined #archiveteam [02:47] *** Fusl has quit IRC (Quit: K-Lined) [02:47] *** Fusl__ has quit IRC (Quit: K-Lined) [02:48] *** Fusl has joined #archiveteam [02:48] *** svchfoo3 sets mode: +o Fusl [02:49] *** Fusl is now known as Fusl__ [02:49] *** Fusl_ sets mode: +o Fusl__ [02:49] *** Fusl has joined #archiveteam [02:49] *** Fusl__ sets mode: +o Fusl [02:50] *** Fusl_ sets mode: +o Fusl [02:51] *** Fusl__ has quit IRC (Client Quit) [02:52] *** Fusl__ has joined #archiveteam [02:52] *** Fusl_ sets mode: +o Fusl__ [02:52] *** Fusl sets mode: +o Fusl__ [03:41] *** m007a83_ has joined #archiveteam [03:44] *** qw3rty116 has joined #archiveteam [03:45] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [03:50] *** qw3rty115 has quit IRC (Ping timeout: 600 seconds) [03:54] *** odemgi_ has joined #archiveteam [03:56] *** odemg has quit IRC (Read error: Operation timed out) [04:00] *** odemgi has quit IRC (Read error: Operation timed out) [04:10] *** odemg has joined #archiveteam [04:46] *** Flashfloo has quit IRC (The Lounge - https://thelounge.chat) [04:46] *** Flashfire has quit IRC (Quit: The Lounge - https://thelounge.chat) [04:46] *** kiska has quit IRC (Quit: The Lounge - https://thelounge.chat) [04:46] *** Flashfloo has joined #archiveteam [04:46] *** kiska has joined #archiveteam [04:46] *** Fusl sets mode: +o kiska [04:46] *** Fusl__ sets mode: +o kiska [04:46] *** Fusl_ sets mode: +o kiska [04:46] *** Flashfire has joined #archiveteam [05:21] *** cerca has quit IRC (Leaving) [05:37] *** dhyan_nat has joined #archiveteam [05:42] *** Ivy has quit IRC (Quit: Connection closed for inactivity) [05:50] *** m007a83 has joined #archiveteam [05:53] *** m007a83_ has quit IRC (Ping timeout: 252 seconds) [07:46] *** jut has joined #archiveteam [09:18] *** killsushi has joined #archiveteam [09:38] *** magus_bgf has joined #archiveteam [09:56] *** dhyan_nat has quit IRC (Read error: Operation timed out) [09:58] *** magus_bgf has quit IRC (Read error: Operation timed out) [10:01] *** magus_bgf has joined #archiveteam [10:14] Hey guys. I'm looking for advice. Need to archive (continuously) a few dozen sites, up to 100-200 hundred thousand pages. Started with wget/bash, but they no longer cut it. Need something that supports incremental crawls, smart error handling/crawl delays/url parameter handling. Some reports would be nice, but preferably no database. Most importantly, it should be easy to restore a site from the archive, at least in html form [10:14] (and from what I understand, restoring from warc is not). So, what would be a good tool for this? [10:21] *** magus_bgf has quit IRC (Read error: Connection reset by peer) [10:22] *** magus_bgf has joined #archiveteam [10:25] it sounds like you have an exciting life of writing web crawler software ahead of you [10:25] *** magus_bgf has quit IRC (Remote host closed the connection) [10:28] *** magus_bgf has joined #archiveteam [10:29] *** Dragnog has quit IRC (Ping timeout: 246 seconds) [10:29] I think Heritrix supports incremental crawls? [10:29] let's take this to #archiveteam-ot [10:29] is it offtopic here? sorry [10:34] *** magus_bgf has left Leaving [11:31] *** zhongfu has quit IRC (Quit: cya losers) [11:33] *** zhongfu has joined #archiveteam [11:45] *** VADemon has quit IRC (Quit: left4dead) [12:10] *** godane has joined #archiveteam [12:31] *** BlueMax has quit IRC (Quit: Leaving) [14:59] *** killsushi has quit IRC (Read error: Connection reset by peer) [15:00] *** killsushi has joined #archiveteam [15:29] *** BartoCH has quit IRC (Ping timeout: 615 seconds) [15:31] *** deetwelve has quit IRC (Ping timeout: 745 seconds) [15:37] *** deetwelve has joined #archiveteam [15:46] *** killsushi has quit IRC (Quit: Leaving) [16:05] *** dhyan_nat has joined #archiveteam [16:42] *** godane has quit IRC (Ping timeout: 600 seconds) [16:44] *** Selanda has quit IRC (Quit: Lost terminal) [16:46] *** dhyan_nat has quit IRC (Read error: Operation timed out) [17:27] *** satoshi has joined #archiveteam [18:02] *** bsmith093 has joined #archiveteam [18:14] *** cerca has joined #archiveteam [18:27] *** Selanda has joined #archiveteam [18:49] *** BartoCH has joined #archiveteam [19:01] *** thejsa_ has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [19:02] *** thejsa has joined #archiveteam [20:03] *** Cameron_D has quit IRC (Read error: Operation timed out) [20:53] *** Ivy has joined #archiveteam [21:57] *** Cameron_D has joined #archiveteam [22:41] *** Pixi has quit IRC (Quit: Pixi) [23:01] *** Pixi has joined #archiveteam