Time |
Nickname |
Message |
00:20
🔗
|
|
satoshi has joined #archiveteam |
00:37
🔗
|
|
hive-mind has joined #archiveteam |
00:37
🔗
|
|
hive-min1 has quit IRC (Read error: Connection reset by peer) |
00:38
🔗
|
|
BlueMax has joined #archiveteam |
01:56
🔗
|
|
sirvy_ has joined #archiveteam |
02:16
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
02:39
🔗
|
|
satoshi has quit IRC (Remote host closed the connection) |
02:41
🔗
|
|
Raccoon has joined #archiveteam |
02:47
🔗
|
|
Fusl has quit IRC (Quit: K-Lined) |
02:47
🔗
|
|
Fusl__ has quit IRC (Quit: K-Lined) |
02:48
🔗
|
|
Fusl has joined #archiveteam |
02:48
🔗
|
|
svchfoo3 sets mode: +o Fusl |
02:49
🔗
|
|
Fusl is now known as Fusl__ |
02:49
🔗
|
|
Fusl_ sets mode: +o Fusl__ |
02:49
🔗
|
|
Fusl has joined #archiveteam |
02:49
🔗
|
|
Fusl__ sets mode: +o Fusl |
02:50
🔗
|
|
Fusl_ sets mode: +o Fusl |
02:51
🔗
|
|
Fusl__ has quit IRC (Client Quit) |
02:52
🔗
|
|
Fusl__ has joined #archiveteam |
02:52
🔗
|
|
Fusl_ sets mode: +o Fusl__ |
02:52
🔗
|
|
Fusl sets mode: +o Fusl__ |
03:41
🔗
|
|
m007a83_ has joined #archiveteam |
03:44
🔗
|
|
qw3rty116 has joined #archiveteam |
03:45
🔗
|
|
m007a83 has quit IRC (Ping timeout: 252 seconds) |
03:50
🔗
|
|
qw3rty115 has quit IRC (Ping timeout: 600 seconds) |
03:54
🔗
|
|
odemgi_ has joined #archiveteam |
03:56
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
04:00
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:10
🔗
|
|
odemg has joined #archiveteam |
04:46
🔗
|
|
Flashfloo has quit IRC (The Lounge - https://thelounge.chat) |
04:46
🔗
|
|
Flashfire has quit IRC (Quit: The Lounge - https://thelounge.chat) |
04:46
🔗
|
|
kiska has quit IRC (Quit: The Lounge - https://thelounge.chat) |
04:46
🔗
|
|
Flashfloo has joined #archiveteam |
04:46
🔗
|
|
kiska has joined #archiveteam |
04:46
🔗
|
|
Fusl sets mode: +o kiska |
04:46
🔗
|
|
Fusl__ sets mode: +o kiska |
04:46
🔗
|
|
Fusl_ sets mode: +o kiska |
04:46
🔗
|
|
Flashfire has joined #archiveteam |
05:21
🔗
|
|
cerca has quit IRC (Leaving) |
05:37
🔗
|
|
dhyan_nat has joined #archiveteam |
05:42
🔗
|
|
Ivy has quit IRC (Quit: Connection closed for inactivity) |
05:50
🔗
|
|
m007a83 has joined #archiveteam |
05:53
🔗
|
|
m007a83_ has quit IRC (Ping timeout: 252 seconds) |
07:46
🔗
|
|
jut has joined #archiveteam |
09:18
🔗
|
|
killsushi has joined #archiveteam |
09:38
🔗
|
|
magus_bgf has joined #archiveteam |
09:56
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
09:58
🔗
|
|
magus_bgf has quit IRC (Read error: Operation timed out) |
10:01
🔗
|
|
magus_bgf has joined #archiveteam |
10:14
🔗
|
magus_bgf |
Hey guys. I'm looking for advice. Need to archive (continuously) a few dozen sites, up to 100-200 hundred thousand pages. Started with wget/bash, but they no longer cut it. Need something that supports incremental crawls, smart error handling/crawl delays/url parameter handling. Some reports would be nice, but preferably no database. Most importantly, it should be easy to restore a site from the archive, at least in html form |
10:14
🔗
|
magus_bgf |
(and from what I understand, restoring from warc is not). So, what would be a good tool for this? |
10:21
🔗
|
|
magus_bgf has quit IRC (Read error: Connection reset by peer) |
10:22
🔗
|
|
magus_bgf has joined #archiveteam |
10:25
🔗
|
ivan_ |
it sounds like you have an exciting life of writing web crawler software ahead of you |
10:25
🔗
|
|
magus_bgf has quit IRC (Remote host closed the connection) |
10:28
🔗
|
|
magus_bgf has joined #archiveteam |
10:29
🔗
|
|
Dragnog has quit IRC (Ping timeout: 246 seconds) |
10:29
🔗
|
ivan_ |
I think Heritrix supports incremental crawls? |
10:29
🔗
|
ivan_ |
let's take this to #archiveteam-ot |
10:29
🔗
|
magus_bgf |
is it offtopic here? sorry |
10:34
🔗
|
|
magus_bgf has left Leaving |
11:31
🔗
|
|
zhongfu has quit IRC (Quit: cya losers) |
11:33
🔗
|
|
zhongfu has joined #archiveteam |
11:45
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
12:10
🔗
|
|
godane has joined #archiveteam |
12:31
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
14:59
🔗
|
|
killsushi has quit IRC (Read error: Connection reset by peer) |
15:00
🔗
|
|
killsushi has joined #archiveteam |
15:29
🔗
|
|
BartoCH has quit IRC (Ping timeout: 615 seconds) |
15:31
🔗
|
|
deetwelve has quit IRC (Ping timeout: 745 seconds) |
15:37
🔗
|
|
deetwelve has joined #archiveteam |
15:46
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
16:05
🔗
|
|
dhyan_nat has joined #archiveteam |
16:42
🔗
|
|
godane has quit IRC (Ping timeout: 600 seconds) |
16:44
🔗
|
|
Selanda has quit IRC (Quit: Lost terminal) |
16:46
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
17:27
🔗
|
|
satoshi has joined #archiveteam |
18:02
🔗
|
|
bsmith093 has joined #archiveteam |
18:14
🔗
|
|
cerca has joined #archiveteam |
18:27
🔗
|
|
Selanda has joined #archiveteam |
18:49
🔗
|
|
BartoCH has joined #archiveteam |
19:01
🔗
|
|
thejsa_ has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
19:02
🔗
|
|
thejsa has joined #archiveteam |
20:03
🔗
|
|
Cameron_D has quit IRC (Read error: Operation timed out) |
20:53
🔗
|
|
Ivy has joined #archiveteam |
21:57
🔗
|
|
Cameron_D has joined #archiveteam |
22:41
🔗
|
|
Pixi has quit IRC (Quit: Pixi) |
23:01
🔗
|
|
Pixi has joined #archiveteam |