[00:06] Sometimes, I feel the urge to track down whoever wrote this shitty website code and strangle them. << I feel your pain, did you find the vulns yet? [00:07] Hm? [00:08] Not big stuff [00:43] *** jacketcha has quit IRC (Read error: Connection reset by peer) [00:58] *** pizzaiolo has quit IRC (pizzaiolo) [00:59] *** pizzaiolo has joined #archiveteam-bs [01:03] *** pizzaiolo has quit IRC (Client Quit) [01:03] *** pizzaiolo has joined #archiveteam-bs [01:14] *** Aranje has quit IRC (Read error: Operation timed out) [01:15] *** Aranje has joined #archiveteam-bs [01:21] *** Aranje has quit IRC (Read error: Operation timed out) [01:21] *** Aranje has joined #archiveteam-bs [01:56] *** dashcloud has joined #archiveteam-bs [02:39] *** pizzaiolo has quit IRC (pizzaiolo) [02:39] *** pizzaiolo has joined #archiveteam-bs [02:43] *** pizzaiolo has quit IRC (Client Quit) [02:43] *** pizzaiolo has joined #archiveteam-bs [03:34] *** pizzaiolo has quit IRC (pizzaiolo) [03:35] *** jacketcha has joined #archiveteam-bs [03:41] *** Aranje has quit IRC (Read error: Operation timed out) [03:41] *** Aranje has joined #archiveteam-bs [03:45] *** Aranje has quit IRC (Read error: Operation timed out) [03:45] *** Aranje has joined #archiveteam-bs [04:03] *** Dimtree has quit IRC (Read error: Operation timed out) [04:17] *** Aranje has quit IRC (Read error: Operation timed out) [04:21] *** Aranje has joined #archiveteam-bs [04:45] *** qw3rty117 has joined #archiveteam-bs [04:50] *** qw3rty116 has quit IRC (Read error: Operation timed out) [05:03] *** Aranje has quit IRC (Read error: Operation timed out) [05:03] *** Aranje has joined #archiveteam-bs [05:13] SketchCow: That isn't even possible, IIRC. [05:15] Deleted stuff from wikipedia is really deleted. They do periodic drops of full-meta-history, but it's not complete in the truest sense of the word AFAIK. [05:38] November full en wiki dump looks like it's 127GB 7zipped, not as bad as I thought. [05:49] *** Aranje has quit IRC (Read error: Operation timed out) [05:50] *** Aranje has joined #archiveteam-bs [07:05] *** Aranje has quit IRC (Read error: Operation timed out) [07:06] *** Aranje has joined #archiveteam-bs [07:14] *** Aranje has quit IRC (Read error: Operation timed out) [07:14] *** Aranje has joined #archiveteam-bs [07:29] *** Aranje has quit IRC (Read error: Operation timed out) [07:29] *** Aranje has joined #archiveteam-bs [07:54] robogoat, I would presume it would be 'day 0' defined as 'when the solution is put into production' rather than 'when en.wikipedia.org came online' [08:35] *** wp494 has quit IRC (Read error: Operation timed out) [08:58] latest digitize tapes: https://www.patreon.com/posts/digitize-tapes-16689791 [09:07] *** Aranje has quit IRC (Read error: Operation timed out) [09:07] *** Aranje has joined #archiveteam-bs [09:29] *** ZexaronS has joined #archiveteam-bs [09:45] The Subcultura grab went down a number of rabbit holes over night. Loads of crap like .../facebook.com/facebook.com/twitter.com/bullshit.com/... [09:47] Also beautiful links such as http://subcultura.es/blogs/marieporm/tom-sandoval-wedding-is-ariana-madix-ready-to-tie-the-knot-after-pump-rules-divorce-29230/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Cbr%20/%3Ephotos: [09:48] Ignoring these brought the queue down from ~130k to 410. :-) [09:51] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [10:30] *** schbirid has joined #archiveteam-bs [11:13] Queue is empty since about 15 minutes. It's now retrying ~38k URLs which produced errors before (e.g. max_conn, timeouts). [11:19] *** pizzaiolo has joined #archiveteam-bs [11:25] *** Aranje has quit IRC (Read error: Operation timed out) [11:25] *** Aranje has joined #archiveteam-bs [11:40] looks like i will have another GB of ERIC archive files [11:40] its about 700+ files now [11:47] *** schbirid has quit IRC (Read error: Operation timed out) [11:49] *** Dimtree has joined #archiveteam-bs [11:49] *** schbirid has joined #archiveteam-bs [12:02] " Deleted stuff from wikipedia is really deleted" Not articles, these can be restored by an operator, but not sure if forever. [12:03] " Ignoring these brought the queue down from ~130k to 410. :-)" XDDDDDDD [12:03] Yeah spammers on the rise is one of the reasons why they closed. [12:28] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [12:29] *** Mateon1 has joined #archiveteam-bs [13:23] *** wp494 has joined #archiveteam-bs [14:37] klondike: you appear to be correct, now I'm wondering why I thought that. [14:38] Probably 'cause I'm not an operator, and so maybe I was looking for something and it wasn't available to me. [15:40] I just tested the ArchiveBot WARC of playmxm.com with pywb. Unsurprisingly, it doesn't work properly. [15:40] But at least the data itself is there. Looks like it grabbed images and videos as well. [15:42] It might work better with a real MITM proxy since the URLs in the browser are still messed up in pywb, but I didn't test that. [17:29] Heh, just saw that many of the errors were also broken URLs like the ones above, and I didn't throw those out before. Around 500 URLs waiting to be retried now, down from 18k. :-P [17:57] Seeing some timeouts in the forums. I guess I'll have to requeue those URLs and increase the timeout. [18:02] JAA: my initial testing showed times of around 1second per request on somethreads [18:02] I wonder if they have a good index [18:03] I'm sure they don't. [18:03] I have a timeout of 30 seconds, I think, and multiple pages triggered that already. [18:03] Ah no, 60 seconds even. [18:04] But http://subcultura.es/foro/webcomics/2400 failed three times, for example. [18:07] If you check at the botton of the source, they show how long did it take to generate the page [18:39] *** BnAboyZ has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** kisspunch has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** PurpleSym has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Zebranky has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** MrRadar2 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** BnARobin has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** jtn2 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Tenebrae has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Fusl has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** hook54321 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** ez has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Polylith has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Sk1d has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Boppen has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** nyany has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Kagee has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** altlabel has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Xibalba has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** klondike has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** tuluu has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** SN4T14 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Kaz has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Lord_Nigh has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** yuitimoth has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Rai-chan has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Aoede has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Gfy has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** pikhq has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** svchfoo1 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** tsr has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** nightpool has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Dimtree has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** qw3rty117 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** jacketcha has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** REiN^ has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** rolfoid has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** balrog has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** PotcFdk has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Kimmer has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** rsznik has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** will has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** ivan has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** ndiddy has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** sep332_ has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Odd0002 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** robink has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Jusque has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** chfoo has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** squires has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** bwn has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** C4K3 has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** unlobito has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** w0rp has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** JAA has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** twigfoot has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** beardicus has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** PoorHomie has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Mayonaise has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** Kenshin has quit IRC (hub.efnet.us ircd.choopa.net) [18:39] *** bsmith093 has quit IRC (hub.efnet.us ircd.choopa.net) [18:40] *** slyphic has quit IRC (Read error: Operation timed out) [18:41] *** slyphic has joined #archiveteam-bs [18:44] *** Dimtree has joined #archiveteam-bs [18:44] *** qw3rty117 has joined #archiveteam-bs [18:44] *** nyany has joined #archiveteam-bs [18:44] *** REiN^ has joined #archiveteam-bs [18:44] *** rolfoid has joined #archiveteam-bs [18:44] *** balrog has joined #archiveteam-bs [18:44] *** tuluu has joined #archiveteam-bs [18:44] *** PotcFdk has joined #archiveteam-bs [18:44] *** Kimmer has joined #archiveteam-bs [18:44] *** rsznik has joined #archiveteam-bs [18:44] *** klondike has joined #archiveteam-bs [18:44] *** SN4T14 has joined #archiveteam-bs [18:44] *** BnAboyZ has joined #archiveteam-bs [18:44] *** will has joined #archiveteam-bs [18:44] *** Kaz has joined #archiveteam-bs [18:44] *** Kagee has joined #archiveteam-bs [18:44] *** Lord_Nigh has joined #archiveteam-bs [18:44] *** ivan has joined #archiveteam-bs [18:44] *** ndiddy has joined #archiveteam-bs [18:44] *** sep332_ has joined #archiveteam-bs [18:44] *** Odd0002 has joined #archiveteam-bs [18:44] *** robink has joined #archiveteam-bs [18:44] *** Jusque has joined #archiveteam-bs [18:44] *** chfoo has joined #archiveteam-bs [18:44] *** altlabel has joined #archiveteam-bs [18:44] *** kisspunch has joined #archiveteam-bs [18:44] *** yuitimoth has joined #archiveteam-bs [18:44] *** squires has joined #archiveteam-bs [18:44] *** PurpleSym has joined #archiveteam-bs [18:44] *** Zebranky has joined #archiveteam-bs [18:44] *** bwn has joined #archiveteam-bs [18:44] *** C4K3 has joined #archiveteam-bs [18:44] *** MrRadar2 has joined #archiveteam-bs [18:44] *** Sk1d has joined #archiveteam-bs [18:44] *** unlobito has joined #archiveteam-bs [18:44] *** bsmith093 has joined #archiveteam-bs [18:44] *** Rai-chan has joined #archiveteam-bs [18:44] *** w0rp has joined #archiveteam-bs [18:44] *** BnARobin has joined #archiveteam-bs [18:44] *** Aoede has joined #archiveteam-bs [18:44] *** JAA has joined #archiveteam-bs [18:44] *** jtn2 has joined #archiveteam-bs [18:44] *** twigfoot has joined #archiveteam-bs [18:44] *** Tenebrae has joined #archiveteam-bs [18:44] *** Gfy has joined #archiveteam-bs [18:44] *** ircd.choopa.net sets mode: +ooo balrog chfoo JAA [18:44] *** beardicus has joined #archiveteam-bs [18:44] *** PoorHomie has joined #archiveteam-bs [18:44] *** Fusl has joined #archiveteam-bs [18:44] *** Xibalba has joined #archiveteam-bs [18:44] *** pikhq has joined #archiveteam-bs [18:44] *** hook54321 has joined #archiveteam-bs [18:44] *** Mayonaise has joined #archiveteam-bs [18:44] *** Polylith has joined #archiveteam-bs [18:44] *** ez has joined #archiveteam-bs [18:44] *** svchfoo1 has joined #archiveteam-bs [18:44] *** Kenshin has joined #archiveteam-bs [18:44] *** Boppen has joined #archiveteam-bs [18:44] *** tsr has joined #archiveteam-bs [18:44] *** nightpool has joined #archiveteam-bs [18:44] *** ircd.choopa.net sets mode: +o svchfoo1 [18:44] *** swebb sets mode: +o balrog [18:44] *** swebb sets mode: +o JAA [18:52] *** jschwart has joined #archiveteam-bs [18:57] *** WubTheCa1 has joined #archiveteam-bs [18:57] *** WubTheCap has quit IRC (Read error: Connection reset by peer) [18:57] *** WubTheCa1 is now known as WubTheCap [19:33] JAA: 126 seconds... wow! [19:50] *** mistym has quit IRC (ZNC - http://znc.in) [20:04] I'm now using a generic ignore pattern to handle these nasty infinite loops. Any URL with more than 10 path segments is ignored. I haven't seen any legitimate URLs like that, so it should be fine. [20:04] Any URL under subcultura.es with more than 10 path segments* [20:52] https://www.twitch.tv/textfilesdotcom [20:57] I'm kinda scared to click [20:57] * Smiley clicks anyuway [20:57] well you're the one named Slimey [20:57] I'm also wearing a hat :P [20:58] is it a good hat? [20:58] SketchCow: not sure you care, but the audio is screwed here. [20:58] astrid: it's a warm hat? [20:58] good. [20:58] it also fits over my headphones. [20:58] ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;now that's an achievement [21:42] is there a wget setting to make it quit after n 4xx/5xx errors? [21:58] schbirid: As far as I know, nope. [21:58] You'd have to do it with wget-lua or wpull and a hook script. [22:06] :\ [22:06] thx [22:08] klondike: I've increased the timeout to five minutes and am now regrabbing those 62 URLs that failed repeatedly before. [22:09] Plenty of internal server errors among those, which will most likely persist. [22:10] But hopefully at least those forum errors go away. [22:21] Internal Error, I timed out because my programming sucks :P [22:30] *** jschwart has quit IRC (Quit: Konversation terminated!) [22:36] klondike: Any idea when exactly they'll shut down? [22:36] *** schbirid has quit IRC (Quit: Leaving) [22:37] JAA: all I know s what they say on the post regarding times [22:37] Alright [22:37] Regarding method they say they'll remove the index.php files but keep the images a bit more [22:37] Pretty much the only thing that I'm still missing are the forums. [22:37] Well, part of them. [22:38] And the updates since you started :P [22:38] Right [22:39] That's why I asked, mainly. Not sure if there will still be time to grab those. [22:40] It's hard to estimate right now how much of the forums I'm still missing, but I guess it'll be a while until that completes. [22:41] Hum [22:43] Need me to fix a script to try fetching the updates of the last week or so on the comics? [22:43] It's possible that my grab has some of those. [22:43] It depends really. [22:43] Well to be sincere each comic is a world [22:44] And I was expecting losing data as they haven't done a clean shutdown [22:44] Yeah, they'd have to freeze the entire thing, then let us archive it or something like that. [22:45] * klondike *nods* [22:45] Sadly things are the way they are :P [22:45] Yep [22:46] Anyways I'll try to make the script [22:47] JAA: a list of all the URLS with an update in the last say week would be enough? [22:48] klondike: A list of URLs is mostly useless. I can easily discover that. I'd just regrab the homepages and archive pages of all comics (as you suggested some days ago). Then it would automatically discover any new comics. [22:49] You might want to add also the snaps and blog pages [22:50] Those pages load much more quickly than forums and user profiles, so that would probably be done very quickly. The problem's just that there might not be enough time. [22:50] I can't start this until my current grab has finished. (And I also don't want to, as I'm already putting a lot of stress on those servers through the forums crawl.) [22:50] Hum [22:51] So what can I do about it? [22:51] Not sure, to be honest. [22:51] Unless you want to start another grab based on my scripts or something. [22:52] But no idea how much of the new content that would grab. [22:52] I started one sometime ago, my server CPU is att 100% xD [22:52] Hehe, yep, that tends to happen. [22:52] But I can set up an account for you if you want [22:52] My grab currently uses just a few percent CPU because the forums are so slow. [22:55] JAA: Want me to set up a shell account for you on my server? [22:55] I'll set up wpull on it and you should be able to upload your dump DB if you need to [22:56] Hm, not sure how that would help honestly. As I said, I can't start that regrab until the current grab is done. I could easily run another wpull process on my server (resource-wise). [22:59] I can stop the grab I was doing from my side, that should free a few conns from the server [23:12] klondike: FWIW, my grab just picked up http://subcultura.es/blogs/Neyebur/un-nuevo-comienzo-33105/ a few minutes ago. [23:12] So at least some new content is in there. [23:18] *** pizzaiolo has quit IRC (pizzaiolo) [23:18] *** pizzaiolo has joined #archiveteam-bs [23:23] *** pizzaiolo has quit IRC (Client Quit) [23:23] *** BlueMaxim has joined #archiveteam-bs [23:23] *** pizzaiolo has joined #archiveteam-bs [23:53] *** Aranje has quit IRC (Read error: Operation timed out) [23:54] *** Aranje has joined #archiveteam-bs