[00:28] *** PotcFdk has joined #archiveteam-bs [01:02] *** mistym has quit IRC (Quit: ZNC - http://znc.in) [01:08] *** mr_archiv has quit IRC (Read error: Operation timed out) [01:11] *** mr_archiv has joined #archiveteam-bs [01:27] *** Soni has quit IRC (Ping timeout: 260 seconds) [01:27] *** Soni has joined #archiveteam-bs [02:05] *** wacky_ has quit IRC (Read error: Operation timed out) [02:05] *** wacky has joined #archiveteam-bs [02:26] *** K4k has quit IRC (Read error: Connection reset by peer) [02:39] *** mistym has joined #archiveteam-bs [02:51] JAA: thanks, I may try to use that to pull the webcomics [03:04] JAA: you didn't have trouble with the cookies being URL decoded? [03:04] You'll notice if you see links referring to http://.*\.subcultura\.es/mayor_edad [03:04] In particular form refering to such addresses [03:22] *** nyaomi has quit IRC (Quit: meow) [04:36] *** ubahn has quit IRC (Ping timeout: 260 seconds) [04:40] *** ubahn has joined #archiveteam-bs [04:47] *** nyaomi has joined #archiveteam-bs [04:48] *** qw3rty113 has joined #archiveteam-bs [04:53] *** qw3rty112 has quit IRC (Read error: Operation timed out) [04:57] *** pizzaiolo has quit IRC (pizzaiolo) [05:30] SketchCow: i'm at 31k items so far this month/year [07:45] *** nyaomi has quit IRC (Read error: Operation timed out) [07:46] *** dashcloud has quit IRC (Read error: Operation timed out) [07:46] *** dashcloud has joined #archiveteam-bs [07:47] *** RichardG has quit IRC (Ping timeout: 250 seconds) [08:21] *** nyaomi has joined #archiveteam-bs [09:56] klondike: Nope, no problems with the cookies. I think Firefox sent it with + as well. [09:56] *** pizzaiolo has joined #archiveteam-bs [09:57] I've retrieved 705k URLs (16.3 GiB) so far; 1.74M are in the queue. [10:10] Doing almost 1000 URLs per minute currently. [11:20] *** BlueMaxim has quit IRC (Leaving) [11:28] *** odemg has quit IRC (Ping timeout: 250 seconds) [11:53] Uzerus: Can you try to figure out whether blog.pl stays online longer? The homepage mentions something that the bloggers will still be able to archive their blogs until end of February, but I'm not sure if that also means that the blogs will still be publicly accessible. [12:11] *** odemg has joined #archiveteam-bs [12:28] JAA: looks like subdomains of blog.pl will be accessible to 31.01.2018 , users can probably save their content by "something like admin panel?" [12:29] so... we have 3 days, 10 hours 30 min [12:30] Aww :-/ [12:38] BUT there is chance that they will be accessible to end of February! (maybe they forgot change the date in that question) http://www.blog.pl/pomoc/49 [12:53] *** RichardG has joined #archiveteam-bs [12:57] *** odemg has quit IRC (Read error: Operation timed out) [13:10] *** qw3rty114 has joined #archiveteam-bs [13:16] *** qw3rty113 has quit IRC (Read error: Operation timed out) [13:19] *** odemg has joined #archiveteam-bs [13:51] *** superkuh has quit IRC (Remote host closed the connection) [13:54] *** superkuh has joined #archiveteam-bs [14:06] *** qw3rty114 has quit IRC (Nettalk6 - www.ntalk.de) [14:27] *** jacketcha has quit IRC (Read error: Connection reset by peer) [14:44] *** Mateon1 has quit IRC (Quit: Mateon1) [14:46] *** Mateon1 has joined #archiveteam-bs [14:55] *** RichardG has quit IRC (Read error: Connection reset by peer) [14:56] *** RichardG has joined #archiveteam-bs [15:15] *** ubahn has quit IRC (Quit: ubahn) [16:02] JAA: You know if the warc files can be extracted for local browsing? [16:07] *** K4k has joined #archiveteam-bs [16:11] JAA: or better said is there a way to make wpull do both, save the files with local links on a folder and generate a warc file? [16:29] *** K4k has quit IRC (Read error: Connection reset by peer) [16:33] *** K4k has joined #archiveteam-bs [16:48] *** dashcloud has quit IRC (Read error: Connection reset by peer) [16:50] *** dashcloud has joined #archiveteam-bs [16:53] *** Atom has joined #archiveteam-bs [17:42] *** REiN^ has quit IRC (Read error: Operation timed out) [18:28] *** Mateon1 has quit IRC (Read error: Operation timed out) [18:28] *** Mateon1 has joined #archiveteam-bs [18:54] *** Pixi has quit IRC (Quit: Pixi) [19:01] *** Pixi has joined #archiveteam-bs [19:25] so i got most of 1995 nightline episodes [19:26] JAA: multi-usage of sublis3r, 2467 subdomains of republika.pl [19:47] *** Smiley has quit IRC (Ping timeout: 250 seconds) [19:50] *** Smiley has joined #archiveteam-bs [21:08] *** jacketcha has joined #archiveteam-bs [21:18] *** icedice has joined #archiveteam-bs [21:21] *** K4k has quit IRC (Read error: Connection reset by peer) [21:27] *** BlueMaxim has joined #archiveteam-bs [22:09] *** icedice2 has joined #archiveteam-bs [22:09] klondike: You can extract WARC files, but the better way is to run a local wayback machine, e.g. with pywb. [22:10] And yes, you can let it save files as well if you remove the --delete-after option. [22:10] *** MrDignity has quit IRC (Remote host closed the connection) [22:11] *** MrDignity has joined #archiveteam-bs [22:12] *** icedice has quit IRC (Ping timeout: 245 seconds) [22:12] JAA: I tried but it crashed when a file with trailing slash and no trailing slash conflicted :P [22:13] I see. Well yeah, pywb then. [22:14] Yeah It'd be cool to have a hook to modify urls xD [22:14] So that you can say "well this URL is the canonicalized to this one which you already have" [22:14] I've been thinking about that and will likely implement it at some point. [22:15] Hopefully as time goes by the number of sistes with unconsistent use of URLs will be smaller (I hope) [22:16] Hahahahaha [22:16] Yeah, right. [22:17] ;-) [22:19] https://losangeles.craigslist.org/lac/zip/d/100s-of-vhs-recordings-and/6473256717.html [22:27] *** MrDignity has quit IRC (Remote host closed the connection) [22:28] *** MrDignity has joined #archiveteam-bs [22:42] JAA: something I have noticed is that some of the comics are still updating how shall that be dealt with? [22:53] ¯\_(ツ)_/¯ [22:54] klondike: Sometimes, I will try to run another grab of all content that was updated in the meantime, but only if that's relatively easy to do. [22:54] Well I can try to write a script to do that for the comics [22:54] For snaps and forums may be harder [22:55] Is there a page which lists comics by date or something like that? [22:55] I mean, something that goes further than those 20ish entries on the homepage. [22:55] No [22:55] :-/ [22:56] But I can try to get the 20ish entries on the homepage every minute or so [22:56] To provide a listing [22:56] What I could do once my grab is done is extract the maximum ID for each comic that was grabbed, then bruteforce upwards from that. [22:57] E.g. if http://ibosim.subcultura.es/tira/196/ was grabbed, try 197, 198, 199, notice 199 doesn't exist, and abort. [22:57] Ahh good idea [22:57] Add a margin of say 5 ids just in case [22:58] And refresh the front page, previous comic and archive to make browsing easy [22:58] *** icedice2 has quit IRC (Quit: Leaving) [22:58] JAA: how about just checking the archive for new links? [22:59] *** icedice has joined #archiveteam-bs [23:01] I guess I could trick wpull into thinking that the homepage and archive page of each subdomain wasn't grabbed yet or something like that. Then it would find everything automatically. [23:01] Status update, by the way: 1.39M URLs grabbed (31.8 GiB), 1.06M queued. [23:02] Patroklo said to expect a size of 90 to 100 [23:02] (GB) [23:02] That number is the size of the gzipped WARCs. [23:03] And I hope it won't grow too large, because I don't have much space left on that machine. :-P [23:03] Keep into mind that avatars get duplicated accross webcomics (consistenwhat?) [23:03] Yeah, I noticed that. [23:03] As I said if you need another server I can pay for one [23:04] I just need to move stuff to the Internet Archive already. [23:04] I have a ton of old grabs here. [23:04] Hum [23:04] Like a 400ish GiB one of NeoGAF. [23:04] Did I miss anything? [23:05] If you have physical access to the server I can order a disk for you an Amazon [23:05] I don't. But don't worry about it, I can still move stuff to other machines for now. [23:06] Oki [23:06] SketchCow: ? [23:06] BTW JAA I tried using 8 concurrent connections without significant problems, you might want to update your scripts [23:09] Anything big. [23:11] Sorry, I'm not sure what you mean. Are you asking about my grabs or just in general? [23:13] In general. [23:13] It's not just you [23:13] Ah [23:13] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [23:15] Subcultura.es and Blog.pl are on fire. Nintendo is crushing another product, Miitomo this time. Can't think of anything else from the past few days. [23:15] I guess you heard about the lkml archive stuff already. [23:15] Oh, and have a look at the Craigslist link I posted here an hour ago. [23:15] In case you didn't see it yet. [23:18] klondike: Hm, I think I won't. The four workers already almost saturate one core, and if it continues at this speed, it should be close to finishing in about 24 hours. [23:18] Oki :) [23:21] *** icedice has quit IRC (Read error: Connection reset by peer) [23:22] *** icedice has joined #archiveteam-bs [23:49] *** Aranje has joined #archiveteam-bs [23:59] *** Ravenloft has joined #archiveteam-bs