#archiveteam-bs 2018-01-30,Tue

↑back Search

Time Nickname Message
00:06 🔗 klondike <JAA> Sometimes, I feel the urge to track down whoever wrote this shitty website code and strangle them. << I feel your pain, did you find the vulns yet?
00:07 🔗 JAA Hm?
00:08 🔗 klondike Not big stuff
00:43 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
00:58 🔗 pizzaiolo has quit IRC (pizzaiolo)
00:59 🔗 pizzaiolo has joined #archiveteam-bs
01:03 🔗 pizzaiolo has quit IRC (Client Quit)
01:03 🔗 pizzaiolo has joined #archiveteam-bs
01:14 🔗 Aranje has quit IRC (Read error: Operation timed out)
01:15 🔗 Aranje has joined #archiveteam-bs
01:21 🔗 Aranje has quit IRC (Read error: Operation timed out)
01:21 🔗 Aranje has joined #archiveteam-bs
01:56 🔗 dashcloud has joined #archiveteam-bs
02:39 🔗 pizzaiolo has quit IRC (pizzaiolo)
02:39 🔗 pizzaiolo has joined #archiveteam-bs
02:43 🔗 pizzaiolo has quit IRC (Client Quit)
02:43 🔗 pizzaiolo has joined #archiveteam-bs
03:34 🔗 pizzaiolo has quit IRC (pizzaiolo)
03:35 🔗 jacketcha has joined #archiveteam-bs
03:41 🔗 Aranje has quit IRC (Read error: Operation timed out)
03:41 🔗 Aranje has joined #archiveteam-bs
03:45 🔗 Aranje has quit IRC (Read error: Operation timed out)
03:45 🔗 Aranje has joined #archiveteam-bs
04:03 🔗 Dimtree has quit IRC (Read error: Operation timed out)
04:17 🔗 Aranje has quit IRC (Read error: Operation timed out)
04:21 🔗 Aranje has joined #archiveteam-bs
04:45 🔗 qw3rty117 has joined #archiveteam-bs
04:50 🔗 qw3rty116 has quit IRC (Read error: Operation timed out)
05:03 🔗 Aranje has quit IRC (Read error: Operation timed out)
05:03 🔗 Aranje has joined #archiveteam-bs
05:13 🔗 robogoat SketchCow: That isn't even possible, IIRC.
05:15 🔗 robogoat Deleted stuff from wikipedia is really deleted. They do periodic drops of full-meta-history, but it's not complete in the truest sense of the word AFAIK.
05:38 🔗 robogoat November full en wiki dump looks like it's 127GB 7zipped, not as bad as I thought.
05:49 🔗 Aranje has quit IRC (Read error: Operation timed out)
05:50 🔗 Aranje has joined #archiveteam-bs
07:05 🔗 Aranje has quit IRC (Read error: Operation timed out)
07:06 🔗 Aranje has joined #archiveteam-bs
07:14 🔗 Aranje has quit IRC (Read error: Operation timed out)
07:14 🔗 Aranje has joined #archiveteam-bs
07:29 🔗 Aranje has quit IRC (Read error: Operation timed out)
07:29 🔗 Aranje has joined #archiveteam-bs
07:54 🔗 omglolbah robogoat, I would presume it would be 'day 0' defined as 'when the solution is put into production' rather than 'when en.wikipedia.org came online'
08:35 🔗 wp494 has quit IRC (Read error: Operation timed out)
08:58 🔗 godane latest digitize tapes: https://www.patreon.com/posts/digitize-tapes-16689791
09:07 🔗 Aranje has quit IRC (Read error: Operation timed out)
09:07 🔗 Aranje has joined #archiveteam-bs
09:29 🔗 ZexaronS has joined #archiveteam-bs
09:45 🔗 JAA The Subcultura grab went down a number of rabbit holes over night. Loads of crap like .../facebook.com/facebook.com/twitter.com/bullshit.com/...
09:47 🔗 JAA Also beautiful links such as http://subcultura.es/blogs/marieporm/tom-sandoval-wedding-is-ariana-madix-ready-to-tie-the-knot-after-pump-rules-divorce-29230/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Ephotos:
09:48 🔗 JAA Ignoring these brought the queue down from ~130k to 410. :-)
09:51 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
10:30 🔗 schbirid has joined #archiveteam-bs
11:13 🔗 JAA Queue is empty since about 15 minutes. It's now retrying ~38k URLs which produced errors before (e.g. max_conn, timeouts).
11:19 🔗 pizzaiolo has joined #archiveteam-bs
11:25 🔗 Aranje has quit IRC (Read error: Operation timed out)
11:25 🔗 Aranje has joined #archiveteam-bs
11:40 🔗 godane looks like i will have another GB of ERIC archive files
11:40 🔗 godane its about 700+ files now
11:47 🔗 schbirid has quit IRC (Read error: Operation timed out)
11:49 🔗 Dimtree has joined #archiveteam-bs
11:49 🔗 schbirid has joined #archiveteam-bs
12:02 🔗 klondike "<robogoat> Deleted stuff from wikipedia is really deleted" Not articles, these can be restored by an operator, but not sure if forever.
12:03 🔗 klondike " <JAA> Ignoring these brought the queue down from ~130k to 410. :-)" XDDDDDDD
12:03 🔗 klondike Yeah spammers on the rise is one of the reasons why they closed.
12:28 🔗 Mateon1 has quit IRC (Ping timeout: 255 seconds)
12:29 🔗 Mateon1 has joined #archiveteam-bs
13:23 🔗 wp494 has joined #archiveteam-bs
14:37 🔗 robogoat klondike: you appear to be correct, now I'm wondering why I thought that.
14:38 🔗 robogoat Probably 'cause I'm not an operator, and so maybe I was looking for something and it wasn't available to me.
15:40 🔗 JAA I just tested the ArchiveBot WARC of playmxm.com with pywb. Unsurprisingly, it doesn't work properly.
15:40 🔗 JAA But at least the data itself is there. Looks like it grabbed images and videos as well.
15:42 🔗 JAA It might work better with a real MITM proxy since the URLs in the browser are still messed up in pywb, but I didn't test that.
17:29 🔗 JAA Heh, just saw that many of the errors were also broken URLs like the ones above, and I didn't throw those out before. Around 500 URLs waiting to be retried now, down from 18k. :-P
17:57 🔗 JAA Seeing some timeouts in the forums. I guess I'll have to requeue those URLs and increase the timeout.
18:02 🔗 klondike JAA: my initial testing showed times of around 1second per request on somethreads
18:02 🔗 klondike I wonder if they have a good index
18:03 🔗 JAA I'm sure they don't.
18:03 🔗 JAA I have a timeout of 30 seconds, I think, and multiple pages triggered that already.
18:03 🔗 JAA Ah no, 60 seconds even.
18:04 🔗 JAA But http://subcultura.es/foro/webcomics/2400 failed three times, for example.
18:07 🔗 klondike If you check at the botton of the source, they show how long did it take to generate the page
18:39 🔗 BnAboyZ has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 kisspunch has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 PurpleSym has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Zebranky has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 MrRadar2 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 BnARobin has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 jtn2 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Tenebrae has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Fusl has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 hook54321 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 ez has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Polylith has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Sk1d has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Boppen has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 nyany has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Kagee has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 altlabel has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Xibalba has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 klondike has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 tuluu has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 SN4T14 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Kaz has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Lord_Nigh has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 yuitimoth has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Rai-chan has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Aoede has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Gfy has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 pikhq has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 svchfoo1 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 tsr has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 nightpool has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Dimtree has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 qw3rty117 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 jacketcha has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 REiN^ has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 rolfoid has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 balrog has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 PotcFdk has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Kimmer has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 rsznik has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 will has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 ivan has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 ndiddy has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 sep332_ has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Odd0002 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 robink has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Jusque has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 chfoo has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 squires has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 bwn has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 C4K3 has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 unlobito has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 w0rp has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 JAA has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 twigfoot has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 beardicus has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 PoorHomie has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Mayonaise has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 Kenshin has quit IRC (hub.efnet.us ircd.choopa.net)
18:39 🔗 bsmith093 has quit IRC (hub.efnet.us ircd.choopa.net)
18:40 🔗 slyphic has quit IRC (Read error: Operation timed out)
18:41 🔗 slyphic has joined #archiveteam-bs
18:44 🔗 Dimtree has joined #archiveteam-bs
18:44 🔗 qw3rty117 has joined #archiveteam-bs
18:44 🔗 nyany has joined #archiveteam-bs
18:44 🔗 REiN^ has joined #archiveteam-bs
18:44 🔗 rolfoid has joined #archiveteam-bs
18:44 🔗 balrog has joined #archiveteam-bs
18:44 🔗 tuluu has joined #archiveteam-bs
18:44 🔗 PotcFdk has joined #archiveteam-bs
18:44 🔗 Kimmer has joined #archiveteam-bs
18:44 🔗 rsznik has joined #archiveteam-bs
18:44 🔗 klondike has joined #archiveteam-bs
18:44 🔗 SN4T14 has joined #archiveteam-bs
18:44 🔗 BnAboyZ has joined #archiveteam-bs
18:44 🔗 will has joined #archiveteam-bs
18:44 🔗 Kaz has joined #archiveteam-bs
18:44 🔗 Kagee has joined #archiveteam-bs
18:44 🔗 Lord_Nigh has joined #archiveteam-bs
18:44 🔗 ivan has joined #archiveteam-bs
18:44 🔗 ndiddy has joined #archiveteam-bs
18:44 🔗 sep332_ has joined #archiveteam-bs
18:44 🔗 Odd0002 has joined #archiveteam-bs
18:44 🔗 robink has joined #archiveteam-bs
18:44 🔗 Jusque has joined #archiveteam-bs
18:44 🔗 chfoo has joined #archiveteam-bs
18:44 🔗 altlabel has joined #archiveteam-bs
18:44 🔗 kisspunch has joined #archiveteam-bs
18:44 🔗 yuitimoth has joined #archiveteam-bs
18:44 🔗 squires has joined #archiveteam-bs
18:44 🔗 PurpleSym has joined #archiveteam-bs
18:44 🔗 Zebranky has joined #archiveteam-bs
18:44 🔗 bwn has joined #archiveteam-bs
18:44 🔗 C4K3 has joined #archiveteam-bs
18:44 🔗 MrRadar2 has joined #archiveteam-bs
18:44 🔗 Sk1d has joined #archiveteam-bs
18:44 🔗 unlobito has joined #archiveteam-bs
18:44 🔗 bsmith093 has joined #archiveteam-bs
18:44 🔗 Rai-chan has joined #archiveteam-bs
18:44 🔗 w0rp has joined #archiveteam-bs
18:44 🔗 BnARobin has joined #archiveteam-bs
18:44 🔗 Aoede has joined #archiveteam-bs
18:44 🔗 JAA has joined #archiveteam-bs
18:44 🔗 jtn2 has joined #archiveteam-bs
18:44 🔗 twigfoot has joined #archiveteam-bs
18:44 🔗 Tenebrae has joined #archiveteam-bs
18:44 🔗 Gfy has joined #archiveteam-bs
18:44 🔗 ircd.choopa.net sets mode: +ooo balrog chfoo JAA
18:44 🔗 beardicus has joined #archiveteam-bs
18:44 🔗 PoorHomie has joined #archiveteam-bs
18:44 🔗 Fusl has joined #archiveteam-bs
18:44 🔗 Xibalba has joined #archiveteam-bs
18:44 🔗 pikhq has joined #archiveteam-bs
18:44 🔗 hook54321 has joined #archiveteam-bs
18:44 🔗 Mayonaise has joined #archiveteam-bs
18:44 🔗 Polylith has joined #archiveteam-bs
18:44 🔗 ez has joined #archiveteam-bs
18:44 🔗 svchfoo1 has joined #archiveteam-bs
18:44 🔗 Kenshin has joined #archiveteam-bs
18:44 🔗 Boppen has joined #archiveteam-bs
18:44 🔗 tsr has joined #archiveteam-bs
18:44 🔗 nightpool has joined #archiveteam-bs
18:44 🔗 ircd.choopa.net sets mode: +o svchfoo1
18:44 🔗 swebb sets mode: +o balrog
18:44 🔗 swebb sets mode: +o JAA
18:52 🔗 jschwart has joined #archiveteam-bs
18:57 🔗 WubTheCa1 has joined #archiveteam-bs
18:57 🔗 WubTheCap has quit IRC (Read error: Connection reset by peer)
18:57 🔗 WubTheCa1 is now known as WubTheCap
19:33 🔗 klondike JAA: 126 seconds... wow!
19:50 🔗 mistym has quit IRC (ZNC - http://znc.in)
20:04 🔗 JAA I'm now using a generic ignore pattern to handle these nasty infinite loops. Any URL with more than 10 path segments is ignored. I haven't seen any legitimate URLs like that, so it should be fine.
20:04 🔗 JAA Any URL under subcultura.es with more than 10 path segments*
20:52 🔗 SketchCow https://www.twitch.tv/textfilesdotcom
20:57 🔗 Smiley I'm kinda scared to click
20:57 🔗 * Smiley clicks anyuway
20:57 🔗 astrid well you're the one named Slimey
20:57 🔗 Smiley I'm also wearing a hat :P
20:58 🔗 astrid is it a good hat?
20:58 🔗 Smiley SketchCow: not sure you care, but the audio is screwed here.
20:58 🔗 Smiley astrid: it's a warm hat?
20:58 🔗 astrid good.
20:58 🔗 Smiley it also fits over my headphones.
20:58 🔗 astrid ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;now that's an achievement
21:42 🔗 schbirid is there a wget setting to make it quit after n 4xx/5xx errors?
21:58 🔗 JAA schbirid: As far as I know, nope.
21:58 🔗 JAA You'd have to do it with wget-lua or wpull and a hook script.
22:06 🔗 schbirid :\
22:06 🔗 schbirid thx
22:08 🔗 JAA klondike: I've increased the timeout to five minutes and am now regrabbing those 62 URLs that failed repeatedly before.
22:09 🔗 JAA Plenty of internal server errors among those, which will most likely persist.
22:10 🔗 JAA But hopefully at least those forum errors go away.
22:21 🔗 klondike Internal Error, I timed out because my programming sucks :P
22:30 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
22:36 🔗 JAA klondike: Any idea when exactly they'll shut down?
22:36 🔗 schbirid has quit IRC (Quit: Leaving)
22:37 🔗 klondike JAA: all I know s what they say on the post regarding times
22:37 🔗 JAA Alright
22:37 🔗 klondike Regarding method they say they'll remove the index.php files but keep the images a bit more
22:37 🔗 JAA Pretty much the only thing that I'm still missing are the forums.
22:37 🔗 JAA Well, part of them.
22:38 🔗 klondike And the updates since you started :P
22:38 🔗 JAA Right
22:39 🔗 JAA That's why I asked, mainly. Not sure if there will still be time to grab those.
22:40 🔗 JAA It's hard to estimate right now how much of the forums I'm still missing, but I guess it'll be a while until that completes.
22:41 🔗 klondike Hum
22:43 🔗 klondike Need me to fix a script to try fetching the updates of the last week or so on the comics?
22:43 🔗 JAA It's possible that my grab has some of those.
22:43 🔗 JAA It depends really.
22:43 🔗 klondike Well to be sincere each comic is a world
22:44 🔗 klondike And I was expecting losing data as they haven't done a clean shutdown
22:44 🔗 JAA Yeah, they'd have to freeze the entire thing, then let us archive it or something like that.
22:45 🔗 * klondike *nods*
22:45 🔗 klondike Sadly things are the way they are :P
22:45 🔗 JAA Yep
22:46 🔗 klondike Anyways I'll try to make the script
22:47 🔗 klondike JAA: a list of all the URLS with an update in the last say week would be enough?
22:48 🔗 JAA klondike: A list of URLs is mostly useless. I can easily discover that. I'd just regrab the homepages and archive pages of all comics (as you suggested some days ago). Then it would automatically discover any new comics.
22:49 🔗 klondike You might want to add also the snaps and blog pages
22:50 🔗 JAA Those pages load much more quickly than forums and user profiles, so that would probably be done very quickly. The problem's just that there might not be enough time.
22:50 🔗 JAA I can't start this until my current grab has finished. (And I also don't want to, as I'm already putting a lot of stress on those servers through the forums crawl.)
22:50 🔗 klondike Hum
22:51 🔗 klondike So what can I do about it?
22:51 🔗 JAA Not sure, to be honest.
22:51 🔗 JAA Unless you want to start another grab based on my scripts or something.
22:52 🔗 JAA But no idea how much of the new content that would grab.
22:52 🔗 klondike I started one sometime ago, my server CPU is att 100% xD
22:52 🔗 JAA Hehe, yep, that tends to happen.
22:52 🔗 klondike But I can set up an account for you if you want
22:52 🔗 JAA My grab currently uses just a few percent CPU because the forums are so slow.
22:55 🔗 klondike JAA: Want me to set up a shell account for you on my server?
22:55 🔗 klondike I'll set up wpull on it and you should be able to upload your dump DB if you need to
22:56 🔗 JAA Hm, not sure how that would help honestly. As I said, I can't start that regrab until the current grab is done. I could easily run another wpull process on my server (resource-wise).
22:59 🔗 klondike I can stop the grab I was doing from my side, that should free a few conns from the server
23:12 🔗 JAA klondike: FWIW, my grab just picked up http://subcultura.es/blogs/Neyebur/un-nuevo-comienzo-33105/ a few minutes ago.
23:12 🔗 JAA So at least some new content is in there.
23:18 🔗 pizzaiolo has quit IRC (pizzaiolo)
23:18 🔗 pizzaiolo has joined #archiveteam-bs
23:23 🔗 pizzaiolo has quit IRC (Client Quit)
23:23 🔗 BlueMaxim has joined #archiveteam-bs
23:23 🔗 pizzaiolo has joined #archiveteam-bs
23:53 🔗 Aranje has quit IRC (Read error: Operation timed out)
23:54 🔗 Aranje has joined #archiveteam-bs

irclogger-viewer