Time |
Nickname |
Message |
15:28
🔗
|
balrog |
Nemo_bis: as the deadline for lists.apple.com gets closer, I'd like to re-archive just the 2014 messages. is there an easy way to grab the index URLs for these from the wget log and feed them to wget-warc to produce an "update warc"? |
15:28
🔗
|
balrog |
right now it's still going... there's a lot of stuff here. |
15:31
🔗
|
Nemo_bis |
can't you just use wget patterns so that it rejects anything not from 2014? |
15:31
🔗
|
balrog |
I could but then it would re-crawl a bunch of stuff |
15:31
🔗
|
Nemo_bis |
I've only archived pipermail archives, not that custom kind |
15:32
🔗
|
balrog |
they're custom but pretty simple; it's all html based |
15:32
🔗
|
balrog |
lists.apple.com/archives/LISTNAME/year/month/msg#####.html |
15:33
🔗
|
balrog |
I probably could run a regex on the log looking for urls matching lists.apple.com/archives/*/2014/January/index.html |
15:34
🔗
|
balrog |
hmm then again |
15:34
🔗
|
balrog |
that would miss stuff if currently there aren't posts from 2014 and someone adds a post |
15:34
🔗
|
balrog |
your method would probably be better |
23:05
🔗
|
SketchCow |
DFJustin: That Chatnfiles FTP grab is going to be a month, I can feel it. |
23:06
🔗
|
SketchCow |
The bandwidth is essentially smoke signals and one of the indians is drunk |
23:41
🔗
|
DFJustin |
may be better off contacting the guy and working something out, there is a shoutout to you on the front page of chatnfiles.com |
23:49
🔗
|
DFJustin |
so it's not enemy territory |