[00:01] JAA: in fact there may even be some new webcomics... [00:01] *** Aranje has quit IRC (Read error: Operation timed out) [00:01] http://deliriosprocatinoapocaliticos.subcultura.es/tira/1/ [00:02] *** Aranje has joined #archiveteam-bs [00:02] * klondike mumbles in Spanish again [00:51] *** jacketcha has joined #archiveteam-bs [01:01] *** Pixi has quit IRC (Quit: Pixi) [01:08] *** Pixi has joined #archiveteam-bs [01:28] *** pizzaiolo has quit IRC (Ping timeout: 246 seconds) [01:29] *** pizzaiolo has joined #archiveteam-bs [01:33] *** pizzaiolo has quit IRC (Client Quit) [01:33] *** pizzaiolo has joined #archiveteam-bs [01:42] *** pizzaiolo has quit IRC (Remote host closed the connection) [01:43] *** BlueMaxim has quit IRC (Leaving) [01:44] *** ranavalon has quit IRC (Read error: Connection reset by peer) [01:46] *** ranavalon has joined #archiveteam-bs [01:47] *** ranavalon has quit IRC (Remote host closed the connection) [01:47] *** ranavalon has joined #archiveteam-bs [02:02] *** jacketcha has quit IRC (Read error: Connection reset by peer) [02:11] *** jacketcha has joined #archiveteam-bs [02:22] *** BlueMaxim has joined #archiveteam-bs [02:31] *** mistym has joined #archiveteam-bs [03:05] *** Aranje has quit IRC (Read error: Operation timed out) [03:05] *** Aranje has joined #archiveteam-bs [03:10] *** Aranje has quit IRC (Read error: Operation timed out) [03:10] *** Aranje has joined #archiveteam-bs [03:32] *** BlueMaxim has quit IRC (Leaving) [03:51] *** BlueMaxim has joined #archiveteam-bs [04:45] *** qw3rty118 has joined #archiveteam-bs [04:47] *** jacketcha has quit IRC (Read error: Connection reset by peer) [04:47] *** jacketcha has joined #archiveteam-bs [04:49] *** qw3rty117 has quit IRC (Read error: Operation timed out) [04:51] *** Aranje has quit IRC (Quit: Three sheets to the wind) [05:08] *** ranav has joined #archiveteam-bs [05:12] *** zhongfu has quit IRC (Remote host closed the connection) [05:15] *** ranavalon has quit IRC (Read error: Operation timed out) [05:47] *** antomatic has quit IRC (Ping timeout: 252 seconds) [05:47] *** jacketcha has quit IRC (Read error: Connection reset by peer) [05:48] *** jacketcha has joined #archiveteam-bs [06:06] *** Stilett0 is now known as Stiletto [06:30] *** antomatic has joined #archiveteam-bs [06:30] *** swebb sets mode: +o antomatic [06:40] *** antomatic has quit IRC (Read error: Operation timed out) [06:40] *** antomatic has joined #archiveteam-bs [06:40] *** swebb sets mode: +o antomatic [07:25] *** pikhq has quit IRC (Ping timeout: 250 seconds) [07:40] *** pikhq has joined #archiveteam-bs [10:12] klondike: The grab finished a few hours ago. As far as I can tell, it grabbed the entire forums successfully, except for a few thread pages which always cause error 500 (e.g. http://subcultura.es/foro/taller/post/1285/10 ). A few comic pages have the same issue, by the way (e.g. http://twindragonscomic.subcultura.es/tira/26 ). [10:13] *** schbirid has joined #archiveteam-bs [10:13] Intriguing [10:14] Meanwhile my comic only run is still fetching stuff :( [10:14] So what now? I think refreshing the archives should take at most 4 hours. [10:15] *** SilSte has quit IRC (Read error: Connection reset by peer) [10:32] *** Mateon1 has quit IRC (Read error: Operation timed out) [10:33] *** Mateon1 has joined #archiveteam-bs [10:39] I'm busy with other stuff right now, but I'll set up a regrab of the comics (homepages of all subdomains + archives). [10:39] When that's done, I'll see what I can do about blogs, snaps, and forums. [11:30] About 15.5k URLs requeued. [11:54] I'll requeue the webcomic directory (http://subcultura.es/directorio/ ) as well to discover any new comics. [12:21] *** BlueMaxim has quit IRC (Leaving) [13:44] Under 100 URLs remaining, but it's very slow again because forums. [13:54] No new comics discovered in the directory regrab, so I guess I should have all of them. [14:14] Regrabbed the snaps list (http://subcultura.es/snaps/, http://subcultura.es/snaps/pagina/32, etc.) and discovered 42 new snaps. [14:34] Uh oh, I just got a number of spurious 404s, followed by a short connection refusal, followed by everything back to normal. I hope there aren't too many of those in the archives. [14:34] Regrabbing blogs at the moment. [14:54] That's done (including requeueing those 404s). All blog entry IDs since 33012 (the current post when I started my grab) except 33015 and 33085 (which don't seem to exist...?) are done. [15:02] Forums, why you so slow? I can understand some slowness, but seriously, it's like they stored all posts including their metadata in text fields and have to scan the entire database every time anyone requests a page. [15:52] *** ld1 has quit IRC (Quit: ld1) [15:55] *** ld1 has joined #archiveteam-bs [16:07] JAA: for the directory you'll need to regrab per category [16:07] Or I can just give you the list of URLs from the top lists (that contains all if using the right command) [16:08] JAA: I'd say I'm surprised at the forum slowness, but I'm not :P [16:09] Yeah, I did regrab the category pages. [16:09] okey, we are good then, I guess :) [16:09] Updated parts of the forums are done as well. I decided to regrab all threads that have been updated since I started the grab entirely. There may be some duplication in there, but this way it should really capture everything up to now. [16:10] I do find it interesting that the forums almost always take a bit over two minutes to render. Seems like they are doing something internally which takes too long (or is simply broken) and times out after two minutes or something like that. [16:11] Hm, actually, now that I look at the archives, most pages take only 3-5 seconds. [16:11] Nevermind then. [16:12] Final statistics: 2.71 million URLs grabbed (some twice), 137 GiB of compressed WARCs. [16:21] the url shortners, whath appens with that data? [16:22] JAA: cool! my webcomic only dump is still running U_U [16:22] I thought my computer was more powerful than that to be sincere. [16:23] JAA: Anyways you deserve all the credit for it. Thanks a lot! [16:23] I have also learnt a bit out of this :) [16:24] So JAA how can I pay you back? If you'll be at FOSDEM I'll gladly get you some beers. [16:29] *** RichardG has quit IRC (Ping timeout: 506 seconds) [16:35] Okay which one of you bought all the Scaleway boxes -_- it's almost all out of stock [16:36] klondike: Thanks for telling me about it. I hate seeing communities die, and I'd never have discovered this one on my own. [16:37] I'd like to come to FOSDEM, but I won't be there. Make a donation to IA if you want. Otherwise, don't worry about it. It's what I do and what I have these machines for. :-) [16:38] Smiley: https://archive.org/search.php?query=subject%3A%22urlteam%22 [16:40] JAA: what prevents you from getting to fosdem, the place to stay or the transportation? [16:40] Time [16:41] Ahh, yeah, that I can't fix :P [16:41] It's the worst enemy ever. [16:42] Well I'm glad I have a work that allows me to compaginate both :) [16:42] More or less... [17:10] *** atrocity has quit IRC () [17:15] *** RichardG has joined #archiveteam-bs [18:04] if i find something cheap SSD 60GB< RAM 2GB< can i get donation for that vps to run pipeline? [18:05] i have ~1-2$/m with cpu like 1x1.5Ghz 1GB RAM and 20GB hdd ... using it for chat and ... trying archive something by grab-site ... [18:07] and have no money for better vps :/ studying, just got a work on practics, temporary work... 3$/h for 2 weeks [18:12] *** jschwart has joined #archiveteam-bs [18:19] JAA: what do you think? [19:35] I think if you want to help, learn python [19:36] for example, 9$/m https://servercheap.net/vps-pricing.html 100GB SSD, 4GB RAM, 4 cores with OpenVZ [19:37] or for 9$/m 50GB SSD, 4GB RAM, 2 cores with kvm [19:40] 1Gbps [19:41] yea, learn python, learn android and write apps, my income from ads/paid services will pay my vps... [19:41] ... [19:41] not my thought [19:41] :D i am learning [19:41] my thought was, paying for a vps for you won't relaly help anything that anyone can't just do themselve [19:41] but, if you can learn python and start fixing/improving warriors, then things are great [19:42] i want to, especially ftp project [19:42] when i will know a little programming [19:44] i was wondering if we can create warrior templates for projects [19:46] we have one, URLTeam can be one template, other can be ftp, but some services works like using lot of them at the sime time [19:46] *** octothorp has quit IRC (Read error: Connection reset by peer) [19:46] they should be designed to be a little atomic [19:47] *** octothorp has joined #archiveteam-bs [19:47] i think, maybe i am bad (begginer) but i am just thinking to make things right, with knowladge i have [19:57] *** Soni has quit IRC (Ping timeout: 255 seconds) [20:16] *** Soni has joined #archiveteam-bs [20:38] *** RichardG has quit IRC (Read error: Connection reset by peer) [21:07] *** RichardG has joined #archiveteam-bs [21:35] *** Atom-- has joined #archiveteam-bs [21:40] *** WubTheCap has quit IRC (Read error: Connection reset by peer) [21:41] *** Atom has quit IRC (Read error: Operation timed out) [21:44] *** WubTheCap has joined #archiveteam-bs [21:45] JAA: seems some people made some last minute posts... ugh [21:45] I generated the urls for those manually and fetched them, will send you the relevant files. [21:46] Admins said they'll close in 1hour and 15 minutes [21:46] *** Atom-- has quit IRC (Read error: Operation timed out) [21:47] I didn't touch the forums though, too much work [21:47] Same with the snaps, only comics and blogposts. [22:01] *** pizzaiolo has joined #archiveteam-bs [22:03] klondike: So they'll shut down at 00:00 local time? [22:03] Yeah CET time [22:03] I'm keeping an eye for any new blogposts/comics [22:14] I think I should be able to easily grab all new comics, snaps, and blog posts up to now. [22:14] Forums are a mess. [22:15] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [22:16] *** pizzaiolo has joined #archiveteam-bs [22:18] Comics, snaps, and blogs up to now done. [22:19] That took 34 seconds. [22:21] (I just regrabbed the homepage, /snaps/, and /blogs/ and let it grab any new links it didn't see before, by the way.) [22:22] Checked before that there was an overlap on all of those lists, so it should really be complete. [22:22] *** schbirid has quit IRC (Quit: Leaving) [22:22] Now I'm wondering if there's an easy way to get the forums. [22:24] A hired gun shooting the admins until they give away the DB? [22:24] (I was being sarcastic BTW) [22:25] JAA I made a bit more uptodate snapshot for those though so that the archive and homepage of the comics is up to date for the updated comics [22:26] That might be the easiest way, yes. [22:26] Ah yeah, that's good. [22:26] Not that easy to do for me. [22:30] JAA: no worries I just use my own url list and a slightly modified hook :D [22:34] Ok, I have a way to regrab the forum threads as well. [22:38] *** jschwart has quit IRC (Quit: Konversation terminated!) [22:38] It's a glorious five-line combination of sqlite3, grep, awk, tr, and sed. Sometimes, I'm a masochist. [22:38] ¿No perl into the blend? [22:38] You like it softcore ;) [22:39] Well, I use grep's -P option, so there's Perl as well. :-P [22:41] xD [22:43] Forums done as well, that took about two minutes. [22:46] Gah, fucking max_conn. [22:46] *** pizzaiolo has quit IRC (Remote host closed the connection) [22:48] Three new comics, one new snap, and two new blog posts in the past half hour. [22:50] Yeah I get those manually :) [22:51] The admins basically tried to get everybody on the chat thing, so lots of load I guess :( [22:51] "the chat thing"? [22:52] That forum thread? [22:52] You need to log in [22:52] http://subcultura.es/lab/chat [22:52] Ah [22:52] If you want feel free to use stduser:stduser [22:52] I made that one when I was trying to figure out how the site worked [22:53] Can't access anything at the moment. [22:55] IA continues to have slowdown [22:57] JAA: same here [22:57] what's wrong? just a technical issue, or something else? [22:57] haven't been paying attention to anything [22:57] Kaz: most likely overloaded servers on subcultura [22:57] eh [22:58] one site doesn't kill the IA [22:58] klondike: Lol, someone even created a new comic a few minutes ago. [22:58] (my question was aimed at SketchCow if that wasn't clear) [23:00] JAA: not specially relevant IMHO http://subcultura.es/webcomics/prueba111/344551.jpg [23:00] It's just they slammed in a lot of code, did a lot of upgrades, inefficiencies are being found. [23:00] ah, okay [23:01] klondike: Yeah, just noticed it. [23:01] Yup, they did shut down at midnight. [23:01] And it's gone error 500 everywhere [23:02] JAA: you only need the .warc.gz files true? [23:03] Correct, those are the important ones. [23:03] There should be ones with -00000.warc.gz etc., which are the actual data, and a -meta.warc.gz which has the wpull logs. [23:03] Oki [23:04] JAA: https://klondike.es/lastwarcs.tar [23:05] Those are for the last minute updates except for that last comic [23:07] Got it, thanks. [23:09] As for the last comic I have the page saved with firefox, but I doubt there is any worth in it. [23:14] JAA: nothing to thank me for. Thank you for all you have done. [23:20] *** MrDignity has joined #archiveteam-bs [23:28] *** BlueMaxim has joined #archiveteam-bs [23:32] klondike: I have grabbed the prueba111 "comic" in time, I think. [23:32] o/ [23:33] In my case all started failing xD [23:33] Yeah, same here, had to retry a dozen times. [23:33] I am not looking forward to processing these WARCs. [23:33] So you don't need the firefox dumped page then? [23:33] Heh because of the max_conns? [23:34] Need to filter out the max_conn bullshit and the spurious 404s. [23:34] Ah [23:34] Is there a magic tool that can do that? [23:35] I don't think so. [23:35] I have another case like this. [23:35] Well, now I have to recover all my work shit [23:36] But I didn't really look into it yet because I like to procrastinate. [23:36] But I can try to make a python tool for that later on when I am less stressed. [23:36] Probably need to write something up with warcat or so. [23:36] This is around one or two months from now FYI :P [23:37] https://github.com/chfoo/warcat [23:38] JAA, want to upload a copy of the WARCs somewhere so that there is a backup? I can try to prepare something on my tiny server [23:40] Yeah, I need to look into that for all of my stuff actually. [23:40] Eventually, it'll all end up at IA of course. [23:40] Okay how much is "all of your stuf" in GB? [23:41] I don't really know at the moment. Definitely over 1 TB. [23:41] Ugh, I don't have so much storage on the server [23:42] Yeah, I'll figure something out. [23:42] More like 500G or so free [23:42] A Hetzner storage box or something. [23:43] I can try to fix something when I come back from FOSDEM though [23:43] Maybe buy one of this USB3 HDDs and attach it to this computer so you can sftp into it [23:43] Not ultra reliable or fast, but better than nothing. [23:44] I should just get another HDD and put it in my tower. [23:45] A Seagate Archive 8TB or something like that. [23:45] What is preventing you from doing so? [23:48] My laziness, mostly. [23:49] Ahh, that can't be helped xD [23:50] Also, my situation at home regarding computers and storage is really messy currently (lots of drives spread through multiple computers), and I kind of don't want to make that even worse while planning for a complete redesign. [23:51] Get a permanent marker for the HDDs then? [23:51] We use them at work, specially to write "DEAD don't use" on some of them :P [23:52] DON'T DEAD USE INSIDE [23:54] Helping the forensics team can be cool at times :P [23:54] It's more about me never knowing where there's still free space (and how much), where a certain dataset is located, etc. Not really solvable with a marker. [23:55] Just needs a full redesign really with a single location for all data. [23:56] Anyway, getting off-topicky [23:57] hook54321: I'm stopping the Catalan webcam grabs now. [23:59] This is going to be fun to upload. :-| [23:59] I have roughly 50k directories from those. [23:59] With two files each