#archiveteam-bs 2018-01-27,Sat

↑back Search

Time Nickname Message
00:28 🔗 PotcFdk has joined #archiveteam-bs
01:02 🔗 mistym has quit IRC (Quit: ZNC - http://znc.in)
01:08 🔗 mr_archiv has quit IRC (Read error: Operation timed out)
01:11 🔗 mr_archiv has joined #archiveteam-bs
01:27 🔗 Soni has quit IRC (Ping timeout: 260 seconds)
01:27 🔗 Soni has joined #archiveteam-bs
02:05 🔗 wacky_ has quit IRC (Read error: Operation timed out)
02:05 🔗 wacky has joined #archiveteam-bs
02:26 🔗 K4k has quit IRC (Read error: Connection reset by peer)
02:39 🔗 mistym has joined #archiveteam-bs
02:51 🔗 klondike JAA: thanks, I may try to use that to pull the webcomics
03:04 🔗 klondike JAA: you didn't have trouble with the cookies being URL decoded?
03:04 🔗 klondike You'll notice if you see links referring to http://.*\.subcultura\.es/mayor_edad
03:04 🔗 klondike In particular form refering to such addresses
03:22 🔗 nyaomi has quit IRC (Quit: meow)
04:36 🔗 ubahn has quit IRC (Ping timeout: 260 seconds)
04:40 🔗 ubahn has joined #archiveteam-bs
04:47 🔗 nyaomi has joined #archiveteam-bs
04:48 🔗 qw3rty113 has joined #archiveteam-bs
04:53 🔗 qw3rty112 has quit IRC (Read error: Operation timed out)
04:57 🔗 pizzaiolo has quit IRC (pizzaiolo)
05:30 🔗 godane SketchCow: i'm at 31k items so far this month/year
07:45 🔗 nyaomi has quit IRC (Read error: Operation timed out)
07:46 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:46 🔗 dashcloud has joined #archiveteam-bs
07:47 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
08:21 🔗 nyaomi has joined #archiveteam-bs
09:56 🔗 JAA klondike: Nope, no problems with the cookies. I think Firefox sent it with + as well.
09:56 🔗 pizzaiolo has joined #archiveteam-bs
09:57 🔗 JAA I've retrieved 705k URLs (16.3 GiB) so far; 1.74M are in the queue.
10:10 🔗 JAA Doing almost 1000 URLs per minute currently.
11:20 🔗 BlueMaxim has quit IRC (Leaving)
11:28 🔗 odemg has quit IRC (Ping timeout: 250 seconds)
11:53 🔗 JAA Uzerus: Can you try to figure out whether blog.pl stays online longer? The homepage mentions something that the bloggers will still be able to archive their blogs until end of February, but I'm not sure if that also means that the blogs will still be publicly accessible.
12:11 🔗 odemg has joined #archiveteam-bs
12:28 🔗 Uzerus JAA: looks like subdomains of blog.pl will be accessible to 31.01.2018 , users can probably save their content by "something like admin panel?"
12:29 🔗 Uzerus so... we have 3 days, 10 hours 30 min
12:30 🔗 JAA Aww :-/
12:38 🔗 Uzerus BUT there is chance that they will be accessible to end of February! (maybe they forgot change the date in that question) http://www.blog.pl/pomoc/49
12:53 🔗 RichardG has joined #archiveteam-bs
12:57 🔗 odemg has quit IRC (Read error: Operation timed out)
13:10 🔗 qw3rty114 has joined #archiveteam-bs
13:16 🔗 qw3rty113 has quit IRC (Read error: Operation timed out)
13:19 🔗 odemg has joined #archiveteam-bs
13:51 🔗 superkuh has quit IRC (Remote host closed the connection)
13:54 🔗 superkuh has joined #archiveteam-bs
14:06 🔗 qw3rty114 has quit IRC (Nettalk6 - www.ntalk.de)
14:27 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
14:44 🔗 Mateon1 has quit IRC (Quit: Mateon1)
14:46 🔗 Mateon1 has joined #archiveteam-bs
14:55 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
14:56 🔗 RichardG has joined #archiveteam-bs
15:15 🔗 ubahn has quit IRC (Quit: ubahn)
16:02 🔗 klondike JAA: You know if the warc files can be extracted for local browsing?
16:07 🔗 K4k has joined #archiveteam-bs
16:11 🔗 klondike JAA: or better said is there a way to make wpull do both, save the files with local links on a folder and generate a warc file?
16:29 🔗 K4k has quit IRC (Read error: Connection reset by peer)
16:33 🔗 K4k has joined #archiveteam-bs
16:48 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
16:50 🔗 dashcloud has joined #archiveteam-bs
16:53 🔗 Atom has joined #archiveteam-bs
17:42 🔗 REiN^ has quit IRC (Read error: Operation timed out)
18:28 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
18:28 🔗 Mateon1 has joined #archiveteam-bs
18:54 🔗 Pixi has quit IRC (Quit: Pixi)
19:01 🔗 Pixi has joined #archiveteam-bs
19:25 🔗 godane so i got most of 1995 nightline episodes
19:26 🔗 Uzerus JAA: multi-usage of sublis3r, 2467 subdomains of republika.pl
19:47 🔗 Smiley has quit IRC (Ping timeout: 250 seconds)
19:50 🔗 Smiley has joined #archiveteam-bs
21:08 🔗 jacketcha has joined #archiveteam-bs
21:18 🔗 icedice has joined #archiveteam-bs
21:21 🔗 K4k has quit IRC (Read error: Connection reset by peer)
21:27 🔗 BlueMaxim has joined #archiveteam-bs
22:09 🔗 icedice2 has joined #archiveteam-bs
22:09 🔗 JAA klondike: You can extract WARC files, but the better way is to run a local wayback machine, e.g. with pywb.
22:10 🔗 JAA And yes, you can let it save files as well if you remove the --delete-after option.
22:10 🔗 MrDignity has quit IRC (Remote host closed the connection)
22:11 🔗 MrDignity has joined #archiveteam-bs
22:12 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
22:12 🔗 klondike JAA: I tried but it crashed when a file with trailing slash and no trailing slash conflicted :P
22:13 🔗 JAA I see. Well yeah, pywb then.
22:14 🔗 klondike Yeah It'd be cool to have a hook to modify urls xD
22:14 🔗 klondike So that you can say "well this URL is the canonicalized to this one which you already have"
22:14 🔗 JAA I've been thinking about that and will likely implement it at some point.
22:15 🔗 klondike Hopefully as time goes by the number of sistes with unconsistent use of URLs will be smaller (I hope)
22:16 🔗 JAA Hahahahaha
22:16 🔗 JAA Yeah, right.
22:17 🔗 JAA ;-)
22:19 🔗 JAA https://losangeles.craigslist.org/lac/zip/d/100s-of-vhs-recordings-and/6473256717.html
22:27 🔗 MrDignity has quit IRC (Remote host closed the connection)
22:28 🔗 MrDignity has joined #archiveteam-bs
22:42 🔗 klondike JAA: something I have noticed is that some of the comics are still updating how shall that be dealt with?
22:53 🔗 JAA ¯\_(ツ)_/¯
22:54 🔗 JAA klondike: Sometimes, I will try to run another grab of all content that was updated in the meantime, but only if that's relatively easy to do.
22:54 🔗 klondike Well I can try to write a script to do that for the comics
22:54 🔗 klondike For snaps and forums may be harder
22:55 🔗 JAA Is there a page which lists comics by date or something like that?
22:55 🔗 JAA I mean, something that goes further than those 20ish entries on the homepage.
22:55 🔗 klondike No
22:55 🔗 JAA :-/
22:56 🔗 klondike But I can try to get the 20ish entries on the homepage every minute or so
22:56 🔗 klondike To provide a listing
22:56 🔗 JAA What I could do once my grab is done is extract the maximum ID for each comic that was grabbed, then bruteforce upwards from that.
22:57 🔗 JAA E.g. if http://ibosim.subcultura.es/tira/196/ was grabbed, try 197, 198, 199, notice 199 doesn't exist, and abort.
22:57 🔗 klondike Ahh good idea
22:57 🔗 klondike Add a margin of say 5 ids just in case
22:58 🔗 klondike And refresh the front page, previous comic and archive to make browsing easy
22:58 🔗 icedice2 has quit IRC (Quit: Leaving)
22:58 🔗 klondike JAA: how about just checking the archive for new links?
22:59 🔗 icedice has joined #archiveteam-bs
23:01 🔗 JAA I guess I could trick wpull into thinking that the homepage and archive page of each subdomain wasn't grabbed yet or something like that. Then it would find everything automatically.
23:01 🔗 JAA Status update, by the way: 1.39M URLs grabbed (31.8 GiB), 1.06M queued.
23:02 🔗 klondike Patroklo said to expect a size of 90 to 100
23:02 🔗 klondike (GB)
23:02 🔗 JAA That number is the size of the gzipped WARCs.
23:03 🔗 JAA And I hope it won't grow too large, because I don't have much space left on that machine. :-P
23:03 🔗 klondike Keep into mind that avatars get duplicated accross webcomics (consistenwhat?)
23:03 🔗 JAA Yeah, I noticed that.
23:03 🔗 klondike As I said if you need another server I can pay for one
23:04 🔗 JAA I just need to move stuff to the Internet Archive already.
23:04 🔗 JAA I have a ton of old grabs here.
23:04 🔗 klondike Hum
23:04 🔗 JAA Like a 400ish GiB one of NeoGAF.
23:04 🔗 SketchCow Did I miss anything?
23:05 🔗 klondike If you have physical access to the server I can order a disk for you an Amazon
23:05 🔗 JAA I don't. But don't worry about it, I can still move stuff to other machines for now.
23:06 🔗 klondike Oki
23:06 🔗 JAA SketchCow: ?
23:06 🔗 klondike BTW JAA I tried using 8 concurrent connections without significant problems, you might want to update your scripts
23:09 🔗 SketchCow Anything big.
23:11 🔗 JAA Sorry, I'm not sure what you mean. Are you asking about my grabs or just in general?
23:13 🔗 SketchCow In general.
23:13 🔗 SketchCow It's not just you
23:13 🔗 JAA Ah
23:13 🔗 Ravenloft has quit IRC (Read error: Connection reset by peer)
23:15 🔗 JAA Subcultura.es and Blog.pl are on fire. Nintendo is crushing another product, Miitomo this time. Can't think of anything else from the past few days.
23:15 🔗 JAA I guess you heard about the lkml archive stuff already.
23:15 🔗 JAA Oh, and have a look at the Craigslist link I posted here an hour ago.
23:15 🔗 JAA In case you didn't see it yet.
23:18 🔗 JAA klondike: Hm, I think I won't. The four workers already almost saturate one core, and if it continues at this speed, it should be close to finishing in about 24 hours.
23:18 🔗 klondike Oki :)
23:21 🔗 icedice has quit IRC (Read error: Connection reset by peer)
23:22 🔗 icedice has joined #archiveteam-bs
23:49 🔗 Aranje has joined #archiveteam-bs
23:59 🔗 Ravenloft has joined #archiveteam-bs

irclogger-viewer