#archiveteam-bs 2018-01-26,Fri

↑back Search

Time Nickname Message
00:56 🔗 zyphlar has quit IRC (Max SendQ exceeded)
00:56 🔗 zyphlar has joined #archiveteam-bs
00:58 🔗 odemg arkiver, SketchCow we know about this? https://the-eye.eu/public/Books/IT%20Various/Learning%20SPARQL%2C%202nd%20Edition.pdf
00:58 🔗 odemg that was meant to be this: https://www.kotaku.com.au/2018/01/miitomo-is-shutting-down-in-may/
01:47 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
01:47 🔗 octothorp has joined #archiveteam-bs
02:04 🔗 username1 has joined #archiveteam-bs
02:10 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
02:27 🔗 zyphlar has quit IRC (Max SendQ exceeded)
02:27 🔗 zyphlar has joined #archiveteam-bs
02:44 🔗 zyphlar has quit IRC (Max SendQ exceeded)
02:44 🔗 zyphlar has joined #archiveteam-bs
04:39 🔗 ubahn has quit IRC (Ping timeout: 260 seconds)
04:41 🔗 ubahn has joined #archiveteam-bs
04:50 🔗 qw3rty112 has joined #archiveteam-bs
04:56 🔗 qw3rty111 has quit IRC (Read error: Operation timed out)
05:49 🔗 ranav has quit IRC (Remote host closed the connection)
05:50 🔗 ranavalon has joined #archiveteam-bs
06:58 🔗 godane https://www.reddit.com/r/linux/comments/7swi6r/kernelorg_is_collecting_pre2000_lkml_archives/
07:43 🔗 username1 has quit IRC (Quit: Leaving)
08:51 🔗 JAA odemg: Yeah, two people mentioned it previously in the main chan.
08:54 🔗 JAA The subcultura.es grab is running well. 212k URLs retrieved (2.5 GiB), 1.2M in the queue.
09:53 🔗 pizzaiolo has joined #archiveteam-bs
10:09 🔗 dashcloud has quit IRC (Ping timeout: 493 seconds)
10:11 🔗 dashcloud has joined #archiveteam-bs
10:23 🔗 schbirid has joined #archiveteam-bs
10:41 🔗 Valentine has joined #archiveteam-bs
11:44 🔗 BlueMaxim has quit IRC (Leaving)
12:08 🔗 schbirid has quit IRC (Ping timeout: 252 seconds)
12:13 🔗 schbirid has joined #archiveteam-bs
12:27 🔗 klondike You seem to be doing better than me JAA, you didn't hit the maxconn limit?
12:41 🔗 JAA klondike: I don't think so. What does their limiting look like? I have only seen very few timeouts, connection resets, 429s, etc. so far.
12:41 🔗 klondike JAA: looks like a redirect to a page saying maxconn
12:41 🔗 JAA Hmm
12:41 🔗 klondike Better said, to a page with maxconn in the URL itself
12:42 🔗 JAA Ah, ok, let me check.
12:42 🔗 JAA Nope, no such URLs in the logs.
12:43 🔗 JAA Oh, max_conn
12:43 🔗 JAA Yeah, I got a few of those.
12:44 🔗 klondike Those may need refetching
12:45 🔗 JAA It's a 302 redirect to http://subcultura.es/max_conn to be precise.
12:45 🔗 klondike Yeah that could be
12:45 🔗 klondike And an infinite one too
12:46 🔗 JAA Yeah, just saw that in the logs, lol.
12:49 🔗 JAA Hmm, this will be a bit annoying.
12:49 🔗 klondike JAA: if you reuse http/1.1 connections it should be okay
12:51 🔗 JAA I'm not sure if wpull does that, to be honest.
12:54 🔗 klondike It should unless you use --no-http-keep-alive
12:54 🔗 klondike The question is for how long does it keep the connection
13:02 🔗 JAA I've implemented a workaround. 302s to that max_conn page are now considered errors.
13:03 🔗 JAA Meaning those URLs will be retried at the end.
13:05 🔗 klondike Cool
13:06 🔗 klondike JAA: what happens with the ones that have already failed?
13:06 🔗 JAA I'm marking those as errors manually.
13:07 🔗 klondike Oh, I hope you didn't get many :(
13:07 🔗 JAA There are just under 1000 occurrences of "max_conn" in the log files.
13:08 🔗 JAA But there are two lines for each URL (request + response), and many of them are the infinite loops.
13:08 🔗 klondike :(
13:09 🔗 JAA Only 32 URLs actually failed.
13:10 🔗 JAA At most, that is.
13:11 🔗 JAA There are 32 URLs which produced a 302 redirect, but I can't easily check which of these redirected to the max_conn page.
13:11 🔗 JAA Probably most of them though.
13:11 🔗 JAA Out of over 300k URLs by now, I'd say that's a pretty good ratio. :-)
13:12 🔗 klondike I'm thinking, do you prefer if I fix a dedicated server in Spain for you?
13:12 🔗 klondike I called their hosting service yesterday and they no longer offer servers with root access, but I can try to find one somewhere else.
13:16 🔗 JAA I don't think it would make too big of a difference. The main bottlenecks are Subcultura's server (some requests take quite a long time) and wpull's HTML parsing, not the network.
13:17 🔗 klondike Well I can pay for something with a fast processor too :P
13:18 🔗 klondike Not for subcultura but for your side
13:19 🔗 JAA I was about to say, Subcultura would certainly appreciate a better server. :-P
13:21 🔗 klondike I doubt I can fix anything on that side, but I can ask the admin I'm talking with
13:26 🔗 JAA I've switched wpull's HTML parsing to lxml. That should be significantly faster.
13:28 🔗 klondike Is there any difference between one and the other?
13:30 🔗 JAA html5lib is said to be a bit more robust, but it's pure-Python and therefore really slow. lxml is based on libxml2, i.e. C, and seems to work well enough for almost all cases.
13:31 🔗 JAA By the way, I only had to mark 13 URLs as errors; the other 19 were already considered errors because of the redirect loops (I assume).
13:32 🔗 klondike Aha
13:32 🔗 klondike Well the subcultura codebase is from 2012 or so, not much HTML5 I guess
13:33 🔗 JAA lxml should handle HTML5 just fine. I think it's only edge cases and broken markup which *can* lead to errors.
13:40 🔗 klondike Ahh
13:41 🔗 klondike there might be some broken markup
13:41 🔗 klondike Maybe, I really can't say, I found some weird links through httrack
13:49 🔗 JAA Yeah, I've seen stuff like http://http://conmaskara.blogspot.com.es/.blogspot.com/ for example.
13:50 🔗 JAA But the markup regarding that link is fine.
13:50 🔗 JAA (It's from http://subcultura.es/user/florvieja/ )
13:54 🔗 JAA Well, I don't really see a speedup compared to before I switched to lxml. So I guess the HTML parser isn't limiting after all.
13:56 🔗 klondike Well the command top may help you see CPU usage there.
13:57 🔗 JAA Yeah, I am monitoring CPU usage, but it fluctuates strongly.
13:57 🔗 JAA htop > top, by the way.
13:58 🔗 klondike I guess I'm too old for all those fancy things like htop :P
14:00 🔗 JAA Average load on the machine did go down compared to before, by the way. It's just that the request rate is still about the same.
14:21 🔗 REiN^ has quit IRC (Read error: Operation timed out)
14:41 🔗 klondike *shrug*
14:42 🔗 * klondike mumbles in Spanish
14:55 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
14:57 🔗 RichardG has joined #archiveteam-bs
14:59 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
15:01 🔗 RichardG has joined #archiveteam-bs
15:02 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
15:07 🔗 Uzerus JAA: can you analyze my piece of code? i don;t know why my code is running without things inside if
15:07 🔗 Uzerus look,
15:08 🔗 Uzerus https://pastebin.com/stdekh5H
15:09 🔗 Uzerus it looks like not processing the for loop and go to else:
15:10 🔗 schbirid which one?
15:11 🔗 Uzerus all
15:11 🔗 schbirid are your files mpty?
15:11 🔗 schbirid empty
15:11 🔗 klondike Uzerus: is it python2 or 3?
15:12 🔗 Uzerus https://pastebin.com/yK18JEKu for example, python 3
15:12 🔗 Uzerus 3.5
15:13 🔗 klondike Ahh
15:13 🔗 Uzerus file ignore: empty, file done is not empty
15:13 🔗 klondike You see you have one open in mode rt and the other in mode r, true?
15:13 🔗 Uzerus but... it can make sense IF the file has been opened as empty and not saved!
15:14 🔗 JAA You're also opening the files many, many times instead of just once.
15:14 🔗 Uzerus yup JAA, will repair it
15:15 🔗 JAA readlines() returns a list of lines, but each line still has the linebreak at the end. Did you account for that?
15:16 🔗 Uzerus eh, no
15:16 🔗 klondike Uzerus: also your execution time will be N*M with N being the input file lines and M the ignore file lines.
15:17 🔗 Uzerus i lost ~6 hours how click works, useless documentation in website
15:17 🔗 JAA Use argparse next time. click seems to be a wrapper around optparse, which should just die already.
15:17 🔗 Uzerus klindike: it's prototype, i want to do regex ignores in future
15:18 🔗 Uzerus not copy & paste from grab-site f.e.
15:18 🔗 klondike Oki Uzerus just saying because if the ignore file can fit in memory, a python set will make that O(N+M)
15:20 🔗 Uzerus JAA: fortunetaly i have version of this in ARGPARSE, ppl on freenodes #python said me "argparse is not the easy way"
15:22 🔗 JAA No idea what they mean by that.
15:23 🔗 JAA You might have to implement the file existence check yourself, but otherwise, it would be very similar (just function calls instead of decorators, and you have a namespace object in the end instead of individual variables).
15:30 🔗 klondike JAA: also you have a lot of duplicated code
15:32 🔗 JAA Uzerus: ^
15:34 🔗 klondike Uzerus: here you can see a reasonably simple way to have different opening functions https://pastebin.com/FSMDd2jn
15:34 🔗 klondike Basically if the interface is the same, assign the function you want to call to a variable and then use the variable to do the call
15:35 🔗 JAA You could even do with (gzip.open if gzipp else open)(...) as logfile: but that's harder to read.
15:35 🔗 Uzerus hah, klondike, nice code
15:36 🔗 klondike Uzerus: nah, it's just one of many small tricks I have learnt over time.
15:37 🔗 Uzerus last thing in my minds is that the file was not saved (and programm is operating on empty file all the time
15:39 🔗 klondike Uzerus: just cat the file and see
15:39 🔗 klondike (Supposing you are running this on a Linuz system)
15:39 🔗 klondike you can also do this (since it's gzipped)
15:39 🔗 Uzerus it save to the 'done' file but it do not save to 'out' file
15:39 🔗 klondike gzip -dc file
15:40 🔗 Uzerus gzipped is inly the input file, in read-only mode
15:40 🔗 JAA less and zless are also useful.
15:41 🔗 Uzerus JAA: i am just learning :D
15:41 🔗 JAA Also, zcat if you really want to output the entire file.
15:41 🔗 JAA Because remembering options is hard.
15:41 🔗 JAA But in my experience, you rarely actually want to print the entire file. less also gives you the advantage of being able to search for stuff etc.
15:41 🔗 klondike Uzerus: less and zless are other commands :)
15:42 🔗 Uzerus not only, i never made any programm in my life, it's verry usefull exprience, building that app
15:47 🔗 JAA klondike: My Subcultura grab is now in the forums and has slowed down to a crawl. :-/
15:48 🔗 klondike I'd say I'm surprised but... :P
15:48 🔗 JAA Yeah
15:48 🔗 JAA Oh well
15:57 🔗 klondike I still would kill for just a full backup of the site even if that meant having to modify their old PHP code to generate the pages locally xD
16:01 🔗 JAA Probably wouldn't even have to modify much if anything at all.
16:01 🔗 JAA But yeah, if they're willing to give out the data, that's an option in principle.
16:04 🔗 JAA data + all relevant information to set up a similar system*
16:05 🔗 klondike I don't think they'll do that though
16:05 🔗 klondike At least didn't look like
16:14 🔗 JAA Yeah, I'm not surprised. Most people wouldn't do that.
16:18 🔗 Uzerus analyzed my code once more, i have an logical bug....
16:21 🔗 Uzerus heh, mindduck
16:23 🔗 Uzerus ewww... how to do that ...
16:36 🔗 icedice has joined #archiveteam-bs
17:03 🔗 Uzerus is it possible that too many variables (information) can make an bug?
17:04 🔗 Uzerus JAA:?
17:11 🔗 JAA Uzerus: You could run out of memory, but other than that, I'd be quite surprised if you managed to break things by creating many variables.
17:22 🔗 klondike Unless you overwrite something
17:22 🔗 klondike For example in python you can do True=False and enjoy the mess ;)
17:23 🔗 Uzerus WHAT?!
17:25 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
17:26 🔗 Kimmer has quit IRC (Read error: Operation timed out)
17:26 🔗 icedice has quit IRC (Ping timeout: 250 seconds)
17:27 🔗 Uzerus https://stackoverflow.com/questions/19007383/compare-two-different-files-line-by-line-in-python
17:27 🔗 Uzerus lol
17:28 🔗 Uzerus ok, but i must first process the files by regex, later i can add "files" to this... em...
17:31 🔗 Uzerus em... :/ there are so many ways that can and looks like they work! but they do not work at the end mostly xD
17:32 🔗 Uzerus but ok, i am gaining knowladge... ... but why this do not work?!
17:32 🔗 Uzerus :/
17:33 🔗 klondike What doesn't work?
17:37 🔗 klondike JAA: seems my htcollect also got hit by the max_conn bug
17:37 🔗 Uzerus my script, anyway ill rewrite it again
17:38 🔗 klondike JAA: Can you share your script? I'd like to make something able to download the comics
17:38 🔗 Uzerus ill stay with click instead of argparse
17:38 🔗 klondike Uzerus: mind pasting it and saying what's it supposed to do?
17:39 🔗 Uzerus https://pastebin.com/stdekh5H
17:40 🔗 klondike What options are you calling it with? What is it intended to do?
17:41 🔗 Uzerus it should check ignorefile line by line and compare, when found equal, breake and go to the next line from logfile/inputfile... if is not equal in ignore, go to done and check there, if found, break , else write to done and to output file
17:42 🔗 Uzerus ignore is for clear list, done is when we resume some work, or have something in parts to process
17:43 🔗 klondike And what is the problem? Are entries on done not filtered?
17:44 🔗 BartoCH has joined #archiveteam-bs
18:08 🔗 klondike Well I have to leave now, I'll answer your questions later.
18:22 🔗 Uzerus klondike: content between if ... else is not processing
18:22 🔗 kvieta has quit IRC (Quit: greedo shot first)
18:22 🔗 kvieta- is now known as kvieta
18:23 🔗 Uzerus especially in donefile, looks like it always go to else: continue
18:25 🔗 JAA Print out what you're comparing in the condition so you can see why they're never the same.
18:26 🔗 JAA By the way, your domainfrom* functions are most likely not doing what you want them to do.
18:26 🔗 JAA In particular, look at what the ''.join lines are doing.
18:27 🔗 Uzerus they are converting list to string
18:27 🔗 Uzerus cos comparing lists is a little... nonsense i think
18:28 🔗 JAA Well yeah, they do, but not in the way you think.
18:29 🔗 JAA Just print the values after those lines, or of the return value, and you'll see what I mean.
18:29 🔗 JAA By the way, a tip regarding printing: print(repr(variable)) can be useful.
18:30 🔗 Uzerus but... wait i will run with bigger wait time to have the beggining of log
18:31 🔗 Uzerus script is designed to print always debugging info, ... when processing file and line, print line
18:31 🔗 Kaz -ot?
18:31 🔗 godane has quit IRC (Ping timeout: 506 seconds)
18:31 🔗 Uzerus but it's not printing, lel, maybe cos of multiple with open... as...
18:31 🔗 JAA Yeah, this belongs in #archiveteam-ot.
18:33 🔗 JAA klondike: https://gist.github.com/JustAnotherArchivist/3e2b6c9e7c276a79a60c137b67a20798
18:34 🔗 JAA That's my current code. This is using wpull 1.2.3, not the more recent 2.0.x version (which is so buggy that it's hardly usable).
18:35 🔗 Uzerus i found next logic bug in my script ...
18:35 🔗 JAA Uzerus: -ot...
18:38 🔗 JAA (klondike, you might want to join #archiveteam-ot if you want to continue that discussion.)
18:42 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
18:44 🔗 BartoCH has joined #archiveteam-bs
18:54 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
18:56 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:18 🔗 RichardG has joined #archiveteam-bs
19:22 🔗 Atom has quit IRC (Read error: Operation timed out)
19:50 🔗 BartoCH has joined #archiveteam-bs
20:08 🔗 Kimmer has joined #archiveteam-bs
20:13 🔗 schbirid has quit IRC (Quit: Leaving)
20:30 🔗 jrwr has quit IRC (Max SendQ exceeded)
20:30 🔗 Ravenloft has joined #archiveteam-bs
20:31 🔗 zyphlar has quit IRC (Read error: Connection reset by peer)
20:31 🔗 zyphlar has joined #archiveteam-bs
20:31 🔗 jrwr has joined #archiveteam-bs
20:31 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
20:32 🔗 Mateon1 has joined #archiveteam-bs
20:32 🔗 svchfoo1 sets mode: +o jrwr
21:20 🔗 REiN^ has joined #archiveteam-bs
21:32 🔗 octothorp has joined #archiveteam-bs
21:52 🔗 schbirid has joined #archiveteam-bs
22:04 🔗 MrDignity has joined #archiveteam-bs
22:23 🔗 godane has joined #archiveteam-bs
23:02 🔗 PotcFdk has quit IRC (~'o'/)
23:03 🔗 ranav has joined #archiveteam-bs
23:07 🔗 ranavalon has quit IRC (Read error: Operation timed out)
23:11 🔗 BlueMaxim has joined #archiveteam-bs

irclogger-viewer