#archiveteam-bs 2018-01-25,Thu

↑back Search

Time Nickname Message
00:25 🔗 cf godane I've been working on archving that for years :)
00:25 🔗 cf had to restart when they changed format a year or so ago
00:25 🔗 cf but been steadily going since
00:27 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
00:28 🔗 Mateon1 has joined #archiveteam-bs
00:30 🔗 cf looks like I've got though '90 done so far
00:32 🔗 godane i found maybe way to grab 60 minutes
01:11 🔗 BlueMaxim has quit IRC (Leaving)
01:13 🔗 BlueMaxim has joined #archiveteam-bs
02:17 🔗 antomatic has quit IRC (Read error: Connection reset by peer)
02:18 🔗 antomatic has joined #archiveteam-bs
02:18 🔗 swebb sets mode: +o antomatic
02:18 🔗 decay_ has quit IRC (Read error: Operation timed out)
02:19 🔗 decay_ has joined #archiveteam-bs
02:46 🔗 username1 has joined #archiveteam-bs
02:48 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
02:50 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
03:09 🔗 zhongfu has joined #archiveteam-bs
03:53 🔗 underscor has quit IRC (Remote host closed the connection)
04:10 🔗 octothorp has quit IRC (Remote host closed the connection)
04:11 🔗 octothorp has joined #archiveteam-bs
04:40 🔗 nyaomi has quit IRC (Quit: meow)
04:51 🔗 qw3rty111 has joined #archiveteam-bs
04:57 🔗 qw3rty119 has quit IRC (Read error: Operation timed out)
05:03 🔗 nyaomi has joined #archiveteam-bs
05:33 🔗 nyaomi has quit IRC (Ping timeout: 245 seconds)
05:56 🔗 Fletcher has joined #archiveteam-bs
06:00 🔗 nyaomi has joined #archiveteam-bs
06:00 🔗 ranav has joined #archiveteam-bs
06:07 🔗 ranavalon has quit IRC (Read error: Operation timed out)
06:19 🔗 Pixi has quit IRC (Quit: Pixi)
06:28 🔗 Pixi has joined #archiveteam-bs
06:40 🔗 godane has quit IRC (Quit: Leaving.)
07:03 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
07:04 🔗 BlueMaxim has joined #archiveteam-bs
07:14 🔗 BlueMaxim has quit IRC (Ping timeout: 600 seconds)
07:16 🔗 BlueMaxim has joined #archiveteam-bs
07:24 🔗 godane has joined #archiveteam-bs
07:35 🔗 BlueMaxim has quit IRC (Ping timeout: 252 seconds)
07:47 🔗 BlueMaxim has joined #archiveteam-bs
07:56 🔗 BlueMaxim has quit IRC (Ping timeout: 252 seconds)
08:18 🔗 schbirid2 has joined #archiveteam-bs
08:23 🔗 username1 has quit IRC (Read error: Operation timed out)
08:29 🔗 BlueMaxim has joined #archiveteam-bs
09:06 🔗 pizzaiolo has joined #archiveteam-bs
10:03 🔗 BnAboyZ has quit IRC (Quit: Ping timeout (120 seconds))
10:08 🔗 BnAboyZ has joined #archiveteam-bs
10:17 🔗 SN4T14 has quit IRC (Ping timeout: 260 seconds)
10:22 🔗 SN4T14 has joined #archiveteam-bs
10:37 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
10:56 🔗 Fletcher_ has quit IRC (Read error: Operation timed out)
10:56 🔗 Fletcher_ has joined #archiveteam-bs
11:31 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:33 🔗 pizzaiolo has joined #archiveteam-bs
11:36 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:37 🔗 pizzaiolo has joined #archiveteam-bs
11:46 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:47 🔗 pizzaiolo has joined #archiveteam-bs
11:47 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:48 🔗 pizzaiolo has joined #archiveteam-bs
11:54 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:55 🔗 pizzaiolo has joined #archiveteam-bs
11:55 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
11:56 🔗 pizzaiolo has joined #archiveteam-bs
12:11 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:13 🔗 pizzaiolo has joined #archiveteam-bs
12:14 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:16 🔗 pizzaiolo has joined #archiveteam-bs
12:17 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:19 🔗 pizzaiolo has joined #archiveteam-bs
12:19 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
12:21 🔗 pizzaiolo has joined #archiveteam-bs
12:22 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:22 🔗 pizzaiolo has joined #archiveteam-bs
12:22 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
12:23 🔗 pizzaiolo has joined #archiveteam-bs
12:23 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
12:24 🔗 pizzaiolo has joined #archiveteam-bs
12:24 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:28 🔗 pizzaiolo has joined #archiveteam-bs
12:28 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
12:30 🔗 pizzaiolo has joined #archiveteam-bs
12:30 🔗 JAA sets mode: +b *!*pizzaiolo@186.205.2.*
12:30 🔗 pizzaiolo was kicked by JAA (Fix your connection please.)
12:47 🔗 RichardG_ has joined #archiveteam-bs
12:54 🔗 RichardG has quit IRC (Read error: Operation timed out)
12:56 🔗 RichardG_ has quit IRC (Read error: Connection reset by peer)
12:56 🔗 RichardG has joined #archiveteam-bs
13:30 🔗 JAA sets mode: -b *!*pizzaiolo@186.205.2.*
14:07 🔗 zyphlar has quit IRC (Max SendQ exceeded)
14:07 🔗 zyphlar has joined #archiveteam-bs
16:01 🔗 odemg has quit IRC (Quit: Leaving)
16:19 🔗 ubahn has joined #archiveteam-bs
16:42 🔗 klondike has joined #archiveteam-bs
16:44 🔗 klondike Hi, I want to report a popular spanish devianart like site that is dying on the 31st of January as per http://subcultura.es/blogs/Neverwolf/anuncio-importante-32407/
16:45 🔗 klondike I have started archiving the site myself with httrack and managed to bypass age restrictions (and the we use cookies announcement), but there is no way I can get the whole thing done in time.
16:46 🔗 klondike Another hacker told me about you, so I'd like to know if and how can I get help from you archiving the site and it's subdomains as you are a lot more skilled than I.
16:47 🔗 odemg has joined #archiveteam-bs
16:52 🔗 Igloo Ok, We can look into it
16:52 🔗 Igloo What are you using so far to do this? What limits have you come up against?
16:52 🔗 klondike2 has joined #archiveteam-bs
16:52 🔗 Igloo I see you mention cookies, have you generated a byunch of user accounts etc?
16:52 🔗 Igloo 16:52 < Igloo> Ok, We can look into it
16:52 🔗 Igloo 16:52 < Igloo> What are you using so far to do this? What limits have you come up against?
16:53 🔗 Igloo 16:52 -!- klondike2 [~klondike@c80-216-57-193.bredband.comhem.se] has joined #archiveteam-bs
16:53 🔗 Igloo 16:52 < Igloo> I see you mention cookies, have you generated a byunch of user accounts etc?
16:53 🔗 JAA Also, do you have any idea how large it is in total (how many subdomains, etc.)?
16:54 🔗 klondike2 Igloo: so far I have been using httrack, the list of domains I got from the ranking site but it should be reachable from the main domain
16:54 🔗 klondike2 Basically the site spans subcultura.es and *.subcultura.es
16:55 🔗 klondike2 All of them point to the same server though.
16:55 🔗 klondike2 I have used httrack, the main issue is that database accesses are slow (as said by and admin) so most dynamic pages (like forum posts or webcomic entries) are slow to generate
16:56 🔗 klondike has quit IRC (Quit: Page closed)
16:56 🔗 klondike2 is now known as klondike
16:56 🔗 klondike The total of subdomains is around 8100
16:57 🔗 klondike subcultura.es is the largest one as it contains a lot of things including author and user profiles amongst other things.
16:58 🔗 klondike I also have removed it from the list except for any content used from the subdomains as there is no way I can back up the whole thing.
16:58 🔗 klondike Regarding the limits so far
16:58 🔗 klondike 1. Some of the content from the subdomains is hosted on the main subcultura.es I'm afraid I may be missing something.
16:59 🔗 klondike 2. Some domains do start with hyphen so I had to patch my server glibc to be able to resolve those domains correctly
16:59 🔗 klondike 3. EU law requires a stupid banner saying they use cookies, I have figured out how to get rid of it.
17:01 🔗 klondike 4. Some sites are behind a confirm you are over 18yo post-wall I have addressed that with cookies too.
17:01 🔗 klondike 5. I don't think more than 4 sockets sending requests from my IP will be appreciated, this is slow so I had to exclude the main website (subcultura.es) except for content I got by reversing the modules of the subdomains.
17:02 🔗 JAA Re: 2. Wow, who the hell thought that's a good idea? I don't think such domain names are actually allowed by the RFCs even. Do you have an example?
17:02 🔗 klondike 6. Some sites have a hidden image if you as a registered user carry out an action that can only be done once every 24 hours in the whole site so I have also excluded that.
17:03 🔗 klondike JAA: http://--.subcultura.es/ it should work on windows and MAC, not Linux though.
17:03 🔗 JAA Re: 4. Do those blocks ever appear on subcultura.es itself or only on subdomains? (I've clicked around a bit and didn't see one.)
17:04 🔗 klondike JAA: may appear on subcultura.es too, but I haven't found any. You can find them on http://666.subcultura.es for example.
17:04 🔗 klondike The related cookie for them is called Maria (the devs did have a good sense of humor)
17:04 🔗 JAA Checked RFC 1035, and that's definitely not legal: "<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]"
17:05 🔗 JAA (I.e. every label of a domain has to start with a letter.)
17:05 🔗 klondike JAA: well, apparently the DNS doesn't complain about it and I can verify this works
17:06 🔗 JAA I guess it might depend on the resolver or something.
17:06 🔗 klondike It does
17:06 🔗 klondike https://twitter.com/klon/status/956334953473683456
17:07 🔗 klondike These are my comments about it, if you look for the debian/ubuntu bugs you'll see some other sites are affected
17:07 🔗 klondike Igloo: as for the user accounts, other than the hidden images (I think one per site) it shouldn't be necessary (I think)
17:10 🔗 JAA Interesting, thanks.
17:10 🔗 klondike I can share my current httrack filter, I'm mostly targetting the webcomics as I said.
17:11 🔗 JAA Also, I misread RFC 1035 earlier (well, read it too quickly); that <label> definition is only a recommendation to avoid problems.
17:13 🔗 JAA It looks like all subdomains of subcultura.es are simply CNAMEs to subcultura.es, so it should be easy enough to work around this in wpull.
17:13 🔗 klondike well I never used wpull but yes, all are cnames
17:13 🔗 JAA I think we might be able to grab subcultura.es through ArchiveBot.
17:14 🔗 JAA The subdomains will obviously need special treatment due to the cookies.
17:14 🔗 klondike I recommend you use the cookies also on the main domain, the session handling mechanism is the same AFAIK
17:15 🔗 JAA Yeah, I guess that would be best, but then we can't do it with ArchiveBot. (We can't set cookies there.)
17:17 🔗 klondike Even static cookies?
17:18 🔗 klondike Given how their code is designed the I agree with you cookie handling cookie and the I'm old enough cookie are static.
17:18 🔗 JAA Yeah, we can't set any cookies manually in ArchiveBot.
17:19 🔗 JAA Not that it would be terribly difficult to implement, but nobody has done so until now.
17:21 🔗 klondike Is it python? If so I can try to do it after I sleep a bit :)
17:24 🔗 JAA ArchiveBot is our equivalent of Frankenstein's monster. It's written in Ruby and Python (plus some JavaScript/Haxe) and uses Redis and CouchDB.
17:26 🔗 JAA Also, it takes a long time for new code to become active on the pipelines (worker machines) because individual jobs run for months at a time.
17:26 🔗 JAA In other words, don't bother.
17:26 🔗 klondike But
17:26 🔗 klondike I can acquire a host on the same provider subcultura uses
17:27 🔗 klondike Set up a pipeline
17:27 🔗 klondike Add cookie support
17:27 🔗 klondike patch the glibc
17:27 🔗 klondike Try to not to kill me in the process :)
17:27 🔗 JAA Not worth the effort, to be honest.
17:28 🔗 klondike And then run it from there, will that work? It should also help minimize latencies
17:28 🔗 JAA It'd be easier to just run wpull directly.
17:28 🔗 JAA A machine in the same datacentre might be nice, but I'm not sure if it would really improve things.
17:29 🔗 JAA If the archival is limited by their server performance anyway, it probably doesn't matter.
17:30 🔗 klondike Well they also have some maximum connection limit
17:30 🔗 SmileyG has quit IRC (Read error: Operation timed out)
17:30 🔗 Smiley has joined #archiveteam-bs
17:30 🔗 JAA Yeah, but is that actually because of network issues or because more connections would simply overload their servers?
17:31 🔗 klondike I sincerely don't know
17:31 🔗 JAA I'm thinking about setting something up on my server in France. (That poor thing; it *just* finished Batoto...)
17:31 🔗 klondike I suspect because they hered some odd PAAS thing.
17:31 🔗 klondike The server is in Catalonia, the one I'm using is on Valencia with a Telefonica line.
17:33 🔗 klondike I know it has at least 100mbps download but it is a standard end-consumer fiber line so no guarantees about availability or the likes.
17:33 🔗 JAA Yeah, I get a route via Madrid > Barcelona.
17:33 🔗 JAA I see.
17:34 🔗 klondike No luck with catnix? That's odd
17:34 🔗 pizzaiolo has joined #archiveteam-bs
17:35 🔗 JAA lol, the value of the Maria cookie is the first 12 numbers of the Fibonacci series.
17:36 🔗 klondike I told you these guys had a sense of humor, also the cookie with the registered user session is called... Oreo
17:37 🔗 JAA Is malditos_odiatoder the "I read the cookie disclaimer" cookie?
17:38 🔗 klondike yup
17:38 🔗 klondike value derp
17:39 🔗 JAA Alright. What filters did you use so far?
17:39 🔗 klondike for httrack?
17:39 🔗 JAA Yeah
17:40 🔗 klondike -* (list of domains here) +*[name].subcultura.es/* +subcultura.es/fotos/* +subcultura.es/avatar/* +subcultura.es/img/* +subcultura.es/css/* +subcultura.es/personajes/* +subcultura.es/webcomics/* +subcultura.es/*.txt -*/"*
17:41 🔗 klondike fotos and avatar hold user profile pics and avatars
17:41 🔗 klondike img has static common images
17:41 🔗 klondike CSS the css data
17:41 🔗 klondike personajes contains the character pictures of some unconverted old comics
17:41 🔗 klondike webcomics has the rest of the custom content
17:42 🔗 klondike the .txt I'm just keeping for completenes (humans.txt and the likes)
17:43 🔗 JAA Ah I see. I tend to do "grab everything except A, B, C".
17:47 🔗 klondike Yeah ifn you do that you end on the forums and user profiles which take long to generate
17:48 🔗 klondike When I started I basically was alone and didn't even knew you existed so I had to preiritize, for a cultural stand points the comics have much more value than the forums
17:48 🔗 JAA True, but we probably want to archive those as well.
17:48 🔗 JAA Hmm
17:49 🔗 klondike Yes we want :)
17:50 🔗 klondike And a round by archivebot may be a good start for subcultura.es
17:53 🔗 klondike I can write code to detect the adult walls if needed
18:02 🔗 JAA I think my setup is working. I need to leave now, will finalise it and start the grab later.
18:03 🔗 JAA Reminder to myself to test whether --.subcultura.es is working correctly.
18:08 🔗 klondike I can send you patches for that but you'll need to recompile your libc
18:08 🔗 klondike Thanks JAA knowing I'm not alone doing this helps a lot :)
18:27 🔗 Arctic has joined #archiveteam-bs
18:38 🔗 Arctic Miitomo is shutting down on May 9th.
18:38 🔗 Arctic Is there any way to archive the messages on the servers?
19:08 🔗 Arctic Perhaps we could use wget to grab the posts?...
19:19 🔗 rsznik has joined #archiveteam-bs
19:21 🔗 Arctic Hello rsznik!
19:21 🔗 Soni has quit IRC (Read error: Connection reset by peer)
19:55 🔗 Soni has joined #archiveteam-bs
19:58 🔗 Arctic Hello Soni!
20:06 🔗 Arctic has quit IRC (Quit: Page closed)
20:31 🔗 Ravenloft has joined #archiveteam-bs
20:33 🔗 BlueMaxim has joined #archiveteam-bs
21:14 🔗 JAA klondike: Thanks, but I won't do that. wpull has a resolve_dns hook, so I can just return the IP from there directly. Also gets rid of the potential network latency.
21:31 🔗 JAA --. works fine. :-)
21:40 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
21:41 🔗 pizzaiolo has joined #archiveteam-bs
21:44 🔗 JAA Grab started.
21:46 🔗 Ravenloft has quit IRC (Ping timeout: 506 seconds)
22:27 🔗 JAA hook54321: I'll stop the Catalan webcam grabs at the end of the month.
22:29 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
22:29 🔗 Mateon1 has joined #archiveteam-bs
22:39 🔗 Ravenloft has joined #archiveteam-bs
22:47 🔗 odemg has quit IRC (Read error: Operation timed out)
22:50 🔗 odemg has joined #archiveteam-bs
23:10 🔗 Ravenloft has quit IRC (Ping timeout: 252 seconds)
23:13 🔗 pizzaiolo has quit IRC (pizzaiolo)

irclogger-viewer