Time |
Nickname |
Message |
00:25
🔗
|
cf |
godane I've been working on archving that for years :) |
00:25
🔗
|
cf |
had to restart when they changed format a year or so ago |
00:25
🔗
|
cf |
but been steadily going since |
00:27
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
00:28
🔗
|
|
Mateon1 has joined #archiveteam-bs |
00:30
🔗
|
cf |
looks like I've got though '90 done so far |
00:32
🔗
|
godane |
i found maybe way to grab 60 minutes |
01:11
🔗
|
|
BlueMaxim has quit IRC (Leaving) |
01:13
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
02:17
🔗
|
|
antomatic has quit IRC (Read error: Connection reset by peer) |
02:18
🔗
|
|
antomatic has joined #archiveteam-bs |
02:18
🔗
|
|
swebb sets mode: +o antomatic |
02:18
🔗
|
|
decay_ has quit IRC (Read error: Operation timed out) |
02:19
🔗
|
|
decay_ has joined #archiveteam-bs |
02:46
🔗
|
|
username1 has joined #archiveteam-bs |
02:48
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
02:50
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
zhongfu has joined #archiveteam-bs |
03:53
🔗
|
|
underscor has quit IRC (Remote host closed the connection) |
04:10
🔗
|
|
octothorp has quit IRC (Remote host closed the connection) |
04:11
🔗
|
|
octothorp has joined #archiveteam-bs |
04:40
🔗
|
|
nyaomi has quit IRC (Quit: meow) |
04:51
🔗
|
|
qw3rty111 has joined #archiveteam-bs |
04:57
🔗
|
|
qw3rty119 has quit IRC (Read error: Operation timed out) |
05:03
🔗
|
|
nyaomi has joined #archiveteam-bs |
05:33
🔗
|
|
nyaomi has quit IRC (Ping timeout: 245 seconds) |
05:56
🔗
|
|
Fletcher has joined #archiveteam-bs |
06:00
🔗
|
|
nyaomi has joined #archiveteam-bs |
06:00
🔗
|
|
ranav has joined #archiveteam-bs |
06:07
🔗
|
|
ranavalon has quit IRC (Read error: Operation timed out) |
06:19
🔗
|
|
Pixi has quit IRC (Quit: Pixi) |
06:28
🔗
|
|
Pixi has joined #archiveteam-bs |
06:40
🔗
|
|
godane has quit IRC (Quit: Leaving.) |
07:03
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
07:04
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
07:14
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 600 seconds) |
07:16
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
07:24
🔗
|
|
godane has joined #archiveteam-bs |
07:35
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 252 seconds) |
07:47
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
07:56
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 252 seconds) |
08:18
🔗
|
|
schbirid2 has joined #archiveteam-bs |
08:23
🔗
|
|
username1 has quit IRC (Read error: Operation timed out) |
08:29
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
09:06
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
10:03
🔗
|
|
BnAboyZ has quit IRC (Quit: Ping timeout (120 seconds)) |
10:08
🔗
|
|
BnAboyZ has joined #archiveteam-bs |
10:17
🔗
|
|
SN4T14 has quit IRC (Ping timeout: 260 seconds) |
10:22
🔗
|
|
SN4T14 has joined #archiveteam-bs |
10:37
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
10:56
🔗
|
|
Fletcher_ has quit IRC (Read error: Operation timed out) |
10:56
🔗
|
|
Fletcher_ has joined #archiveteam-bs |
11:31
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:33
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:36
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:37
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:46
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:47
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:47
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:48
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:54
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:55
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:55
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
11:56
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:11
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:13
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:14
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:16
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:17
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:19
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:19
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
12:21
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:22
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:22
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:22
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
12:23
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:23
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
12:24
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:24
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:28
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:28
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:30
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:30
🔗
|
|
JAA sets mode: +b *!*pizzaiolo@186.205.2.* |
12:30
🔗
|
|
pizzaiolo was kicked by JAA (Fix your connection please.) |
12:47
🔗
|
|
RichardG_ has joined #archiveteam-bs |
12:54
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
12:56
🔗
|
|
RichardG_ has quit IRC (Read error: Connection reset by peer) |
12:56
🔗
|
|
RichardG has joined #archiveteam-bs |
13:30
🔗
|
|
JAA sets mode: -b *!*pizzaiolo@186.205.2.* |
14:07
🔗
|
|
zyphlar has quit IRC (Max SendQ exceeded) |
14:07
🔗
|
|
zyphlar has joined #archiveteam-bs |
16:01
🔗
|
|
odemg has quit IRC (Quit: Leaving) |
16:19
🔗
|
|
ubahn has joined #archiveteam-bs |
16:42
🔗
|
|
klondike has joined #archiveteam-bs |
16:44
🔗
|
klondike |
Hi, I want to report a popular spanish devianart like site that is dying on the 31st of January as per http://subcultura.es/blogs/Neverwolf/anuncio-importante-32407/ |
16:45
🔗
|
klondike |
I have started archiving the site myself with httrack and managed to bypass age restrictions (and the we use cookies announcement), but there is no way I can get the whole thing done in time. |
16:46
🔗
|
klondike |
Another hacker told me about you, so I'd like to know if and how can I get help from you archiving the site and it's subdomains as you are a lot more skilled than I. |
16:47
🔗
|
|
odemg has joined #archiveteam-bs |
16:52
🔗
|
Igloo |
Ok, We can look into it |
16:52
🔗
|
Igloo |
What are you using so far to do this? What limits have you come up against? |
16:52
🔗
|
|
klondike2 has joined #archiveteam-bs |
16:52
🔗
|
Igloo |
I see you mention cookies, have you generated a byunch of user accounts etc? |
16:52
🔗
|
Igloo |
16:52 < Igloo> Ok, We can look into it |
16:52
🔗
|
Igloo |
16:52 < Igloo> What are you using so far to do this? What limits have you come up against? |
16:53
🔗
|
Igloo |
16:52 -!- klondike2 [~klondike@c80-216-57-193.bredband.comhem.se] has joined #archiveteam-bs |
16:53
🔗
|
Igloo |
16:52 < Igloo> I see you mention cookies, have you generated a byunch of user accounts etc? |
16:53
🔗
|
JAA |
Also, do you have any idea how large it is in total (how many subdomains, etc.)? |
16:54
🔗
|
klondike2 |
Igloo: so far I have been using httrack, the list of domains I got from the ranking site but it should be reachable from the main domain |
16:54
🔗
|
klondike2 |
Basically the site spans subcultura.es and *.subcultura.es |
16:55
🔗
|
klondike2 |
All of them point to the same server though. |
16:55
🔗
|
klondike2 |
I have used httrack, the main issue is that database accesses are slow (as said by and admin) so most dynamic pages (like forum posts or webcomic entries) are slow to generate |
16:56
🔗
|
|
klondike has quit IRC (Quit: Page closed) |
16:56
🔗
|
|
klondike2 is now known as klondike |
16:56
🔗
|
klondike |
The total of subdomains is around 8100 |
16:57
🔗
|
klondike |
subcultura.es is the largest one as it contains a lot of things including author and user profiles amongst other things. |
16:58
🔗
|
klondike |
I also have removed it from the list except for any content used from the subdomains as there is no way I can back up the whole thing. |
16:58
🔗
|
klondike |
Regarding the limits so far |
16:58
🔗
|
klondike |
1. Some of the content from the subdomains is hosted on the main subcultura.es I'm afraid I may be missing something. |
16:59
🔗
|
klondike |
2. Some domains do start with hyphen so I had to patch my server glibc to be able to resolve those domains correctly |
16:59
🔗
|
klondike |
3. EU law requires a stupid banner saying they use cookies, I have figured out how to get rid of it. |
17:01
🔗
|
klondike |
4. Some sites are behind a confirm you are over 18yo post-wall I have addressed that with cookies too. |
17:01
🔗
|
klondike |
5. I don't think more than 4 sockets sending requests from my IP will be appreciated, this is slow so I had to exclude the main website (subcultura.es) except for content I got by reversing the modules of the subdomains. |
17:02
🔗
|
JAA |
Re: 2. Wow, who the hell thought that's a good idea? I don't think such domain names are actually allowed by the RFCs even. Do you have an example? |
17:02
🔗
|
klondike |
6. Some sites have a hidden image if you as a registered user carry out an action that can only be done once every 24 hours in the whole site so I have also excluded that. |
17:03
🔗
|
klondike |
JAA: http://--.subcultura.es/ it should work on windows and MAC, not Linux though. |
17:03
🔗
|
JAA |
Re: 4. Do those blocks ever appear on subcultura.es itself or only on subdomains? (I've clicked around a bit and didn't see one.) |
17:04
🔗
|
klondike |
JAA: may appear on subcultura.es too, but I haven't found any. You can find them on http://666.subcultura.es for example. |
17:04
🔗
|
klondike |
The related cookie for them is called Maria (the devs did have a good sense of humor) |
17:04
🔗
|
JAA |
Checked RFC 1035, and that's definitely not legal: "<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]" |
17:05
🔗
|
JAA |
(I.e. every label of a domain has to start with a letter.) |
17:05
🔗
|
klondike |
JAA: well, apparently the DNS doesn't complain about it and I can verify this works |
17:06
🔗
|
JAA |
I guess it might depend on the resolver or something. |
17:06
🔗
|
klondike |
It does |
17:06
🔗
|
klondike |
https://twitter.com/klon/status/956334953473683456 |
17:07
🔗
|
klondike |
These are my comments about it, if you look for the debian/ubuntu bugs you'll see some other sites are affected |
17:07
🔗
|
klondike |
Igloo: as for the user accounts, other than the hidden images (I think one per site) it shouldn't be necessary (I think) |
17:10
🔗
|
JAA |
Interesting, thanks. |
17:10
🔗
|
klondike |
I can share my current httrack filter, I'm mostly targetting the webcomics as I said. |
17:11
🔗
|
JAA |
Also, I misread RFC 1035 earlier (well, read it too quickly); that <label> definition is only a recommendation to avoid problems. |
17:13
🔗
|
JAA |
It looks like all subdomains of subcultura.es are simply CNAMEs to subcultura.es, so it should be easy enough to work around this in wpull. |
17:13
🔗
|
klondike |
well I never used wpull but yes, all are cnames |
17:13
🔗
|
JAA |
I think we might be able to grab subcultura.es through ArchiveBot. |
17:14
🔗
|
JAA |
The subdomains will obviously need special treatment due to the cookies. |
17:14
🔗
|
klondike |
I recommend you use the cookies also on the main domain, the session handling mechanism is the same AFAIK |
17:15
🔗
|
JAA |
Yeah, I guess that would be best, but then we can't do it with ArchiveBot. (We can't set cookies there.) |
17:17
🔗
|
klondike |
Even static cookies? |
17:18
🔗
|
klondike |
Given how their code is designed the I agree with you cookie handling cookie and the I'm old enough cookie are static. |
17:18
🔗
|
JAA |
Yeah, we can't set any cookies manually in ArchiveBot. |
17:19
🔗
|
JAA |
Not that it would be terribly difficult to implement, but nobody has done so until now. |
17:21
🔗
|
klondike |
Is it python? If so I can try to do it after I sleep a bit :) |
17:24
🔗
|
JAA |
ArchiveBot is our equivalent of Frankenstein's monster. It's written in Ruby and Python (plus some JavaScript/Haxe) and uses Redis and CouchDB. |
17:26
🔗
|
JAA |
Also, it takes a long time for new code to become active on the pipelines (worker machines) because individual jobs run for months at a time. |
17:26
🔗
|
JAA |
In other words, don't bother. |
17:26
🔗
|
klondike |
But |
17:26
🔗
|
klondike |
I can acquire a host on the same provider subcultura uses |
17:27
🔗
|
klondike |
Set up a pipeline |
17:27
🔗
|
klondike |
Add cookie support |
17:27
🔗
|
klondike |
patch the glibc |
17:27
🔗
|
klondike |
Try to not to kill me in the process :) |
17:27
🔗
|
JAA |
Not worth the effort, to be honest. |
17:28
🔗
|
klondike |
And then run it from there, will that work? It should also help minimize latencies |
17:28
🔗
|
JAA |
It'd be easier to just run wpull directly. |
17:28
🔗
|
JAA |
A machine in the same datacentre might be nice, but I'm not sure if it would really improve things. |
17:29
🔗
|
JAA |
If the archival is limited by their server performance anyway, it probably doesn't matter. |
17:30
🔗
|
klondike |
Well they also have some maximum connection limit |
17:30
🔗
|
|
SmileyG has quit IRC (Read error: Operation timed out) |
17:30
🔗
|
|
Smiley has joined #archiveteam-bs |
17:30
🔗
|
JAA |
Yeah, but is that actually because of network issues or because more connections would simply overload their servers? |
17:31
🔗
|
klondike |
I sincerely don't know |
17:31
🔗
|
JAA |
I'm thinking about setting something up on my server in France. (That poor thing; it *just* finished Batoto...) |
17:31
🔗
|
klondike |
I suspect because they hered some odd PAAS thing. |
17:31
🔗
|
klondike |
The server is in Catalonia, the one I'm using is on Valencia with a Telefonica line. |
17:33
🔗
|
klondike |
I know it has at least 100mbps download but it is a standard end-consumer fiber line so no guarantees about availability or the likes. |
17:33
🔗
|
JAA |
Yeah, I get a route via Madrid > Barcelona. |
17:33
🔗
|
JAA |
I see. |
17:34
🔗
|
klondike |
No luck with catnix? That's odd |
17:34
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
17:35
🔗
|
JAA |
lol, the value of the Maria cookie is the first 12 numbers of the Fibonacci series. |
17:36
🔗
|
klondike |
I told you these guys had a sense of humor, also the cookie with the registered user session is called... Oreo |
17:37
🔗
|
JAA |
Is malditos_odiatoder the "I read the cookie disclaimer" cookie? |
17:38
🔗
|
klondike |
yup |
17:38
🔗
|
klondike |
value derp |
17:39
🔗
|
JAA |
Alright. What filters did you use so far? |
17:39
🔗
|
klondike |
for httrack? |
17:39
🔗
|
JAA |
Yeah |
17:40
🔗
|
klondike |
-* (list of domains here) +*[name].subcultura.es/* +subcultura.es/fotos/* +subcultura.es/avatar/* +subcultura.es/img/* +subcultura.es/css/* +subcultura.es/personajes/* +subcultura.es/webcomics/* +subcultura.es/*.txt -*/"* |
17:41
🔗
|
klondike |
fotos and avatar hold user profile pics and avatars |
17:41
🔗
|
klondike |
img has static common images |
17:41
🔗
|
klondike |
CSS the css data |
17:41
🔗
|
klondike |
personajes contains the character pictures of some unconverted old comics |
17:41
🔗
|
klondike |
webcomics has the rest of the custom content |
17:42
🔗
|
klondike |
the .txt I'm just keeping for completenes (humans.txt and the likes) |
17:43
🔗
|
JAA |
Ah I see. I tend to do "grab everything except A, B, C". |
17:47
🔗
|
klondike |
Yeah ifn you do that you end on the forums and user profiles which take long to generate |
17:48
🔗
|
klondike |
When I started I basically was alone and didn't even knew you existed so I had to preiritize, for a cultural stand points the comics have much more value than the forums |
17:48
🔗
|
JAA |
True, but we probably want to archive those as well. |
17:48
🔗
|
JAA |
Hmm |
17:49
🔗
|
klondike |
Yes we want :) |
17:50
🔗
|
klondike |
And a round by archivebot may be a good start for subcultura.es |
17:53
🔗
|
klondike |
I can write code to detect the adult walls if needed |
18:02
🔗
|
JAA |
I think my setup is working. I need to leave now, will finalise it and start the grab later. |
18:03
🔗
|
JAA |
Reminder to myself to test whether --.subcultura.es is working correctly. |
18:08
🔗
|
klondike |
I can send you patches for that but you'll need to recompile your libc |
18:08
🔗
|
klondike |
Thanks JAA knowing I'm not alone doing this helps a lot :) |
18:27
🔗
|
|
Arctic has joined #archiveteam-bs |
18:38
🔗
|
Arctic |
Miitomo is shutting down on May 9th. |
18:38
🔗
|
Arctic |
Is there any way to archive the messages on the servers? |
19:08
🔗
|
Arctic |
Perhaps we could use wget to grab the posts?... |
19:19
🔗
|
|
rsznik has joined #archiveteam-bs |
19:21
🔗
|
Arctic |
Hello rsznik! |
19:21
🔗
|
|
Soni has quit IRC (Read error: Connection reset by peer) |
19:55
🔗
|
|
Soni has joined #archiveteam-bs |
19:58
🔗
|
Arctic |
Hello Soni! |
20:06
🔗
|
|
Arctic has quit IRC (Quit: Page closed) |
20:31
🔗
|
|
Ravenloft has joined #archiveteam-bs |
20:33
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
21:14
🔗
|
JAA |
klondike: Thanks, but I won't do that. wpull has a resolve_dns hook, so I can just return the IP from there directly. Also gets rid of the potential network latency. |
21:31
🔗
|
JAA |
--. works fine. :-) |
21:40
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
21:41
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
21:44
🔗
|
JAA |
Grab started. |
21:46
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 506 seconds) |
22:27
🔗
|
JAA |
hook54321: I'll stop the Catalan webcam grabs at the end of the month. |
22:29
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
22:29
🔗
|
|
Mateon1 has joined #archiveteam-bs |
22:39
🔗
|
|
Ravenloft has joined #archiveteam-bs |
22:47
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
22:50
🔗
|
|
odemg has joined #archiveteam-bs |
23:10
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 252 seconds) |
23:13
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |