#archiveteam-bs 2018-01-31,Wed

↑back Search

Time Nickname Message
00:01 🔗 klondike JAA: in fact there may even be some new webcomics...
00:01 🔗 Aranje has quit IRC (Read error: Operation timed out)
00:01 🔗 klondike http://deliriosprocatinoapocaliticos.subcultura.es/tira/1/
00:02 🔗 Aranje has joined #archiveteam-bs
00:02 🔗 * klondike mumbles in Spanish again
00:51 🔗 jacketcha has joined #archiveteam-bs
01:01 🔗 Pixi has quit IRC (Quit: Pixi)
01:08 🔗 Pixi has joined #archiveteam-bs
01:28 🔗 pizzaiolo has quit IRC (Ping timeout: 246 seconds)
01:29 🔗 pizzaiolo has joined #archiveteam-bs
01:33 🔗 pizzaiolo has quit IRC (Client Quit)
01:33 🔗 pizzaiolo has joined #archiveteam-bs
01:42 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
01:43 🔗 BlueMaxim has quit IRC (Leaving)
01:44 🔗 ranavalon has quit IRC (Read error: Connection reset by peer)
01:46 🔗 ranavalon has joined #archiveteam-bs
01:47 🔗 ranavalon has quit IRC (Remote host closed the connection)
01:47 🔗 ranavalon has joined #archiveteam-bs
02:02 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
02:11 🔗 jacketcha has joined #archiveteam-bs
02:22 🔗 BlueMaxim has joined #archiveteam-bs
02:31 🔗 mistym has joined #archiveteam-bs
03:05 🔗 Aranje has quit IRC (Read error: Operation timed out)
03:05 🔗 Aranje has joined #archiveteam-bs
03:10 🔗 Aranje has quit IRC (Read error: Operation timed out)
03:10 🔗 Aranje has joined #archiveteam-bs
03:32 🔗 BlueMaxim has quit IRC (Leaving)
03:51 🔗 BlueMaxim has joined #archiveteam-bs
04:45 🔗 qw3rty118 has joined #archiveteam-bs
04:47 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
04:47 🔗 jacketcha has joined #archiveteam-bs
04:49 🔗 qw3rty117 has quit IRC (Read error: Operation timed out)
04:51 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
05:08 🔗 ranav has joined #archiveteam-bs
05:12 🔗 zhongfu has quit IRC (Remote host closed the connection)
05:15 🔗 ranavalon has quit IRC (Read error: Operation timed out)
05:47 🔗 antomatic has quit IRC (Ping timeout: 252 seconds)
05:47 🔗 jacketcha has quit IRC (Read error: Connection reset by peer)
05:48 🔗 jacketcha has joined #archiveteam-bs
06:06 🔗 Stilett0 is now known as Stiletto
06:30 🔗 antomatic has joined #archiveteam-bs
06:30 🔗 swebb sets mode: +o antomatic
06:40 🔗 antomatic has quit IRC (Read error: Operation timed out)
06:40 🔗 antomatic has joined #archiveteam-bs
06:40 🔗 swebb sets mode: +o antomatic
07:25 🔗 pikhq has quit IRC (Ping timeout: 250 seconds)
07:40 🔗 pikhq has joined #archiveteam-bs
10:12 🔗 JAA klondike: The grab finished a few hours ago. As far as I can tell, it grabbed the entire forums successfully, except for a few thread pages which always cause error 500 (e.g. http://subcultura.es/foro/taller/post/1285/10 ). A few comic pages have the same issue, by the way (e.g. http://twindragonscomic.subcultura.es/tira/26 ).
10:13 🔗 schbirid has joined #archiveteam-bs
10:13 🔗 klondike Intriguing
10:14 🔗 klondike Meanwhile my comic only run is still fetching stuff :(
10:14 🔗 klondike So what now? I think refreshing the archives should take at most 4 hours.
10:15 🔗 SilSte has quit IRC (Read error: Connection reset by peer)
10:32 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
10:33 🔗 Mateon1 has joined #archiveteam-bs
10:39 🔗 JAA I'm busy with other stuff right now, but I'll set up a regrab of the comics (homepages of all subdomains + archives).
10:39 🔗 JAA When that's done, I'll see what I can do about blogs, snaps, and forums.
11:30 🔗 JAA About 15.5k URLs requeued.
11:54 🔗 JAA I'll requeue the webcomic directory (http://subcultura.es/directorio/ ) as well to discover any new comics.
12:21 🔗 BlueMaxim has quit IRC (Leaving)
13:44 🔗 JAA Under 100 URLs remaining, but it's very slow again because forums.
13:54 🔗 JAA No new comics discovered in the directory regrab, so I guess I should have all of them.
14:14 🔗 JAA Regrabbed the snaps list (http://subcultura.es/snaps/, http://subcultura.es/snaps/pagina/32, etc.) and discovered 42 new snaps.
14:34 🔗 JAA Uh oh, I just got a number of spurious 404s, followed by a short connection refusal, followed by everything back to normal. I hope there aren't too many of those in the archives.
14:34 🔗 JAA Regrabbing blogs at the moment.
14:54 🔗 JAA That's done (including requeueing those 404s). All blog entry IDs since 33012 (the current post when I started my grab) except 33015 and 33085 (which don't seem to exist...?) are done.
15:02 🔗 JAA Forums, why you so slow? I can understand some slowness, but seriously, it's like they stored all posts including their metadata in text fields and have to scan the entire database every time anyone requests a page.
15:52 🔗 ld1 has quit IRC (Quit: ld1)
15:55 🔗 ld1 has joined #archiveteam-bs
16:07 🔗 klondike JAA: for the directory you'll need to regrab per category
16:07 🔗 klondike Or I can just give you the list of URLs from the top lists (that contains all if using the right command)
16:08 🔗 klondike JAA: I'd say I'm surprised at the forum slowness, but I'm not :P
16:09 🔗 JAA Yeah, I did regrab the category pages.
16:09 🔗 klondike okey, we are good then, I guess :)
16:09 🔗 JAA Updated parts of the forums are done as well. I decided to regrab all threads that have been updated since I started the grab entirely. There may be some duplication in there, but this way it should really capture everything up to now.
16:10 🔗 JAA I do find it interesting that the forums almost always take a bit over two minutes to render. Seems like they are doing something internally which takes too long (or is simply broken) and times out after two minutes or something like that.
16:11 🔗 JAA Hm, actually, now that I look at the archives, most pages take only 3-5 seconds.
16:11 🔗 JAA Nevermind then.
16:12 🔗 JAA Final statistics: 2.71 million URLs grabbed (some twice), 137 GiB of compressed WARCs.
16:21 🔗 Smiley the url shortners, whath appens with that data?
16:22 🔗 klondike JAA: cool! my webcomic only dump is still running U_U
16:22 🔗 klondike I thought my computer was more powerful than that to be sincere.
16:23 🔗 klondike JAA: Anyways you deserve all the credit for it. Thanks a lot!
16:23 🔗 klondike I have also learnt a bit out of this :)
16:24 🔗 klondike So JAA how can I pay you back? If you'll be at FOSDEM I'll gladly get you some beers.
16:29 🔗 RichardG has quit IRC (Ping timeout: 506 seconds)
16:35 🔗 ThisAsYou Okay which one of you bought all the Scaleway boxes -_- it's almost all out of stock
16:36 🔗 JAA klondike: Thanks for telling me about it. I hate seeing communities die, and I'd never have discovered this one on my own.
16:37 🔗 JAA I'd like to come to FOSDEM, but I won't be there. Make a donation to IA if you want. Otherwise, don't worry about it. It's what I do and what I have these machines for. :-)
16:38 🔗 JAA Smiley: https://archive.org/search.php?query=subject%3A%22urlteam%22
16:40 🔗 klondike JAA: what prevents you from getting to fosdem, the place to stay or the transportation?
16:40 🔗 JAA Time
16:41 🔗 klondike Ahh, yeah, that I can't fix :P
16:41 🔗 JAA It's the worst enemy ever.
16:42 🔗 klondike Well I'm glad I have a work that allows me to compaginate both :)
16:42 🔗 klondike More or less...
17:10 🔗 atrocity has quit IRC ()
17:15 🔗 RichardG has joined #archiveteam-bs
18:04 🔗 Uzerus if i find something cheap SSD 60GB< RAM 2GB< can i get donation for that vps to run pipeline?
18:05 🔗 Uzerus i have ~1-2$/m with cpu like 1x1.5Ghz 1GB RAM and 20GB hdd ... using it for chat and ... trying archive something by grab-site ...
18:07 🔗 Uzerus and have no money for better vps :/ studying, just got a work on practics, temporary work... 3$/h for 2 weeks
18:12 🔗 jschwart has joined #archiveteam-bs
18:19 🔗 Uzerus JAA: what do you think?
19:35 🔗 Smiley I think if you want to help, learn python
19:36 🔗 Uzerus for example, 9$/m https://servercheap.net/vps-pricing.html 100GB SSD, 4GB RAM, 4 cores with OpenVZ
19:37 🔗 Uzerus or for 9$/m 50GB SSD, 4GB RAM, 2 cores with kvm
19:40 🔗 Uzerus 1Gbps
19:41 🔗 Uzerus yea, learn python, learn android and write apps, my income from ads/paid services will pay my vps...
19:41 🔗 Smiley ...
19:41 🔗 Smiley not my thought
19:41 🔗 Uzerus :D i am learning
19:41 🔗 Smiley my thought was, paying for a vps for you won't relaly help anything that anyone can't just do themselve
19:41 🔗 Smiley but, if you can learn python and start fixing/improving warriors, then things are great
19:42 🔗 Uzerus i want to, especially ftp project
19:42 🔗 Uzerus when i will know a little programming
19:44 🔗 Uzerus i was wondering if we can create warrior templates for projects
19:46 🔗 Uzerus we have one, URLTeam can be one template, other can be ftp, but some services works like using lot of them at the sime time
19:46 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
19:46 🔗 Uzerus they should be designed to be a little atomic
19:47 🔗 octothorp has joined #archiveteam-bs
19:47 🔗 Uzerus i think, maybe i am bad (begginer) but i am just thinking to make things right, with knowladge i have
19:57 🔗 Soni has quit IRC (Ping timeout: 255 seconds)
20:16 🔗 Soni has joined #archiveteam-bs
20:38 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
21:07 🔗 RichardG has joined #archiveteam-bs
21:35 🔗 Atom-- has joined #archiveteam-bs
21:40 🔗 WubTheCap has quit IRC (Read error: Connection reset by peer)
21:41 🔗 Atom has quit IRC (Read error: Operation timed out)
21:44 🔗 WubTheCap has joined #archiveteam-bs
21:45 🔗 klondike JAA: seems some people made some last minute posts... ugh
21:45 🔗 klondike I generated the urls for those manually and fetched them, will send you the relevant files.
21:46 🔗 klondike Admins said they'll close in 1hour and 15 minutes
21:46 🔗 Atom-- has quit IRC (Read error: Operation timed out)
21:47 🔗 klondike I didn't touch the forums though, too much work
21:47 🔗 klondike Same with the snaps, only comics and blogposts.
22:01 🔗 pizzaiolo has joined #archiveteam-bs
22:03 🔗 JAA klondike: So they'll shut down at 00:00 local time?
22:03 🔗 klondike Yeah CET time
22:03 🔗 klondike I'm keeping an eye for any new blogposts/comics
22:14 🔗 JAA I think I should be able to easily grab all new comics, snaps, and blog posts up to now.
22:14 🔗 JAA Forums are a mess.
22:15 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
22:16 🔗 pizzaiolo has joined #archiveteam-bs
22:18 🔗 JAA Comics, snaps, and blogs up to now done.
22:19 🔗 JAA That took 34 seconds.
22:21 🔗 JAA (I just regrabbed the homepage, /snaps/, and /blogs/ and let it grab any new links it didn't see before, by the way.)
22:22 🔗 JAA Checked before that there was an overlap on all of those lists, so it should really be complete.
22:22 🔗 schbirid has quit IRC (Quit: Leaving)
22:22 🔗 JAA Now I'm wondering if there's an easy way to get the forums.
22:24 🔗 klondike A hired gun shooting the admins until they give away the DB?
22:24 🔗 klondike (I was being sarcastic BTW)
22:25 🔗 klondike JAA I made a bit more uptodate snapshot for those though so that the archive and homepage of the comics is up to date for the updated comics
22:26 🔗 JAA That might be the easiest way, yes.
22:26 🔗 JAA Ah yeah, that's good.
22:26 🔗 JAA Not that easy to do for me.
22:30 🔗 klondike JAA: no worries I just use my own url list and a slightly modified hook :D
22:34 🔗 JAA Ok, I have a way to regrab the forum threads as well.
22:38 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
22:38 🔗 JAA It's a glorious five-line combination of sqlite3, grep, awk, tr, and sed. Sometimes, I'm a masochist.
22:38 🔗 klondike ¿No perl into the blend?
22:38 🔗 klondike You like it softcore ;)
22:39 🔗 JAA Well, I use grep's -P option, so there's Perl as well. :-P
22:41 🔗 klondike xD
22:43 🔗 JAA Forums done as well, that took about two minutes.
22:46 🔗 JAA Gah, fucking max_conn.
22:46 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
22:48 🔗 JAA Three new comics, one new snap, and two new blog posts in the past half hour.
22:50 🔗 klondike Yeah I get those manually :)
22:51 🔗 klondike The admins basically tried to get everybody on the chat thing, so lots of load I guess :(
22:51 🔗 JAA "the chat thing"?
22:52 🔗 JAA That forum thread?
22:52 🔗 klondike You need to log in
22:52 🔗 klondike http://subcultura.es/lab/chat
22:52 🔗 JAA Ah
22:52 🔗 klondike If you want feel free to use stduser:stduser
22:52 🔗 klondike I made that one when I was trying to figure out how the site worked
22:53 🔗 JAA Can't access anything at the moment.
22:55 🔗 SketchCow IA continues to have slowdown
22:57 🔗 klondike JAA: same here
22:57 🔗 Kaz what's wrong? just a technical issue, or something else?
22:57 🔗 Kaz haven't been paying attention to anything
22:57 🔗 klondike Kaz: most likely overloaded servers on subcultura
22:57 🔗 Kaz eh
22:58 🔗 Kaz one site doesn't kill the IA
22:58 🔗 JAA klondike: Lol, someone even created a new comic a few minutes ago.
22:58 🔗 Kaz (my question was aimed at SketchCow if that wasn't clear)
23:00 🔗 klondike JAA: not specially relevant IMHO http://subcultura.es/webcomics/prueba111/344551.jpg
23:00 🔗 SketchCow It's just they slammed in a lot of code, did a lot of upgrades, inefficiencies are being found.
23:00 🔗 Kaz ah, okay
23:01 🔗 JAA klondike: Yeah, just noticed it.
23:01 🔗 JAA Yup, they did shut down at midnight.
23:01 🔗 klondike And it's gone error 500 everywhere
23:02 🔗 klondike JAA: you only need the .warc.gz files true?
23:03 🔗 JAA Correct, those are the important ones.
23:03 🔗 JAA There should be ones with -00000.warc.gz etc., which are the actual data, and a -meta.warc.gz which has the wpull logs.
23:03 🔗 klondike Oki
23:04 🔗 klondike JAA: https://klondike.es/lastwarcs.tar
23:05 🔗 klondike Those are for the last minute updates except for that last comic
23:07 🔗 JAA Got it, thanks.
23:09 🔗 klondike As for the last comic I have the page saved with firefox, but I doubt there is any worth in it.
23:14 🔗 klondike JAA: nothing to thank me for. Thank you for all you have done.
23:20 🔗 MrDignity has joined #archiveteam-bs
23:28 🔗 BlueMaxim has joined #archiveteam-bs
23:32 🔗 JAA klondike: I have grabbed the prueba111 "comic" in time, I think.
23:32 🔗 klondike o/
23:33 🔗 klondike In my case all started failing xD
23:33 🔗 JAA Yeah, same here, had to retry a dozen times.
23:33 🔗 JAA I am not looking forward to processing these WARCs.
23:33 🔗 klondike So you don't need the firefox dumped page then?
23:33 🔗 klondike Heh because of the max_conns?
23:34 🔗 JAA Need to filter out the max_conn bullshit and the spurious 404s.
23:34 🔗 klondike Ah
23:34 🔗 klondike Is there a magic tool that can do that?
23:35 🔗 JAA I don't think so.
23:35 🔗 JAA I have another case like this.
23:35 🔗 klondike Well, now I have to recover all my work shit
23:36 🔗 JAA But I didn't really look into it yet because I like to procrastinate.
23:36 🔗 klondike But I can try to make a python tool for that later on when I am less stressed.
23:36 🔗 JAA Probably need to write something up with warcat or so.
23:36 🔗 klondike This is around one or two months from now FYI :P
23:37 🔗 JAA https://github.com/chfoo/warcat
23:38 🔗 klondike JAA, want to upload a copy of the WARCs somewhere so that there is a backup? I can try to prepare something on my tiny server
23:40 🔗 JAA Yeah, I need to look into that for all of my stuff actually.
23:40 🔗 JAA Eventually, it'll all end up at IA of course.
23:40 🔗 klondike Okay how much is "all of your stuf" in GB?
23:41 🔗 JAA I don't really know at the moment. Definitely over 1 TB.
23:41 🔗 klondike Ugh, I don't have so much storage on the server
23:42 🔗 JAA Yeah, I'll figure something out.
23:42 🔗 klondike More like 500G or so free
23:42 🔗 JAA A Hetzner storage box or something.
23:43 🔗 klondike I can try to fix something when I come back from FOSDEM though
23:43 🔗 klondike Maybe buy one of this USB3 HDDs and attach it to this computer so you can sftp into it
23:43 🔗 klondike Not ultra reliable or fast, but better than nothing.
23:44 🔗 JAA I should just get another HDD and put it in my tower.
23:45 🔗 JAA A Seagate Archive 8TB or something like that.
23:45 🔗 klondike What is preventing you from doing so?
23:48 🔗 JAA My laziness, mostly.
23:49 🔗 klondike Ahh, that can't be helped xD
23:50 🔗 JAA Also, my situation at home regarding computers and storage is really messy currently (lots of drives spread through multiple computers), and I kind of don't want to make that even worse while planning for a complete redesign.
23:51 🔗 klondike Get a permanent marker for the HDDs then?
23:51 🔗 klondike We use them at work, specially to write "DEAD don't use" on some of them :P
23:52 🔗 JAA DON'T DEAD USE INSIDE
23:54 🔗 klondike Helping the forensics team can be cool at times :P
23:54 🔗 JAA It's more about me never knowing where there's still free space (and how much), where a certain dataset is located, etc. Not really solvable with a marker.
23:55 🔗 JAA Just needs a full redesign really with a single location for all data.
23:56 🔗 JAA Anyway, getting off-topicky
23:57 🔗 JAA hook54321: I'm stopping the Catalan webcam grabs now.
23:59 🔗 JAA This is going to be fun to upload. :-|
23:59 🔗 JAA I have roughly 50k directories from those.
23:59 🔗 JAA With two files each

irclogger-viewer