[00:56] *** zyphlar has quit IRC (Max SendQ exceeded)
[00:56] *** zyphlar has joined #archiveteam-bs
[00:58] <odemg> arkiver, SketchCow we know about this? https://the-eye.eu/public/Books/IT%20Various/Learning%20SPARQL%2C%202nd%20Edition.pdf
[00:58] <odemg> that was meant to be this: https://www.kotaku.com.au/2018/01/miitomo-is-shutting-down-in-may/
[01:47] *** octothorp has quit IRC (Read error: Connection reset by peer)
[01:47] *** octothorp has joined #archiveteam-bs
[02:04] *** username1 has joined #archiveteam-bs
[02:10] *** schbirid2 has quit IRC (Read error: Operation timed out)
[02:27] *** zyphlar has quit IRC (Max SendQ exceeded)
[02:27] *** zyphlar has joined #archiveteam-bs
[02:44] *** zyphlar has quit IRC (Max SendQ exceeded)
[02:44] *** zyphlar has joined #archiveteam-bs
[04:39] *** ubahn has quit IRC (Ping timeout: 260 seconds)
[04:41] *** ubahn has joined #archiveteam-bs
[04:50] *** qw3rty112 has joined #archiveteam-bs
[04:56] *** qw3rty111 has quit IRC (Read error: Operation timed out)
[05:49] *** ranav has quit IRC (Remote host closed the connection)
[05:50] *** ranavalon has joined #archiveteam-bs
[06:58] <godane> https://www.reddit.com/r/linux/comments/7swi6r/kernelorg_is_collecting_pre2000_lkml_archives/
[07:43] *** username1 has quit IRC (Quit: Leaving)
[08:51] <JAA> odemg: Yeah, two people mentioned it previously in the main chan.
[08:54] <JAA> The subcultura.es grab is running well. 212k URLs retrieved (2.5 GiB), 1.2M in the queue.
[09:53] *** pizzaiolo has joined #archiveteam-bs
[10:09] *** dashcloud has quit IRC (Ping timeout: 493 seconds)
[10:11] *** dashcloud has joined #archiveteam-bs
[10:23] *** schbirid has joined #archiveteam-bs
[10:41] *** Valentine has joined #archiveteam-bs
[11:44] *** BlueMaxim has quit IRC (Leaving)
[12:08] *** schbirid has quit IRC (Ping timeout: 252 seconds)
[12:13] *** schbirid has joined #archiveteam-bs
[12:27] <klondike> You seem to be doing better than me JAA, you didn't hit the maxconn limit?
[12:41] <JAA> klondike: I don't think so. What does their limiting look like? I have only seen very few timeouts, connection resets, 429s, etc. so far.
[12:41] <klondike> JAA: looks like a redirect to a page saying maxconn
[12:41] <JAA> Hmm
[12:41] <klondike> Better said, to a page with maxconn in the URL itself
[12:42] <JAA> Ah, ok, let me check.
[12:42] <JAA> Nope, no such URLs in the logs.
[12:43] <JAA> Oh, max_conn
[12:43] <JAA> Yeah, I got a few of those.
[12:44] <klondike> Those may need refetching
[12:45] <JAA> It's a 302 redirect to http://subcultura.es/max_conn to be precise.
[12:45] <klondike> Yeah that could be
[12:45] <klondike> And an infinite one too
[12:46] <JAA> Yeah, just saw that in the logs, lol.
[12:49] <JAA> Hmm, this will be a bit annoying.
[12:49] <klondike> JAA: if you reuse http/1.1 connections it should be okay
[12:51] <JAA> I'm not sure if wpull does that, to be honest.
[12:54] <klondike> It should unless you use --no-http-keep-alive
[12:54] <klondike> The question is for how long does it keep the connection
[13:02] <JAA> I've implemented a workaround. 302s to that max_conn page are now considered errors.
[13:03] <JAA> Meaning those URLs will be retried at the end.
[13:05] <klondike> Cool
[13:06] <klondike> JAA: what happens with the ones that have already failed?
[13:06] <JAA> I'm marking those as errors manually.
[13:07] <klondike> Oh, I hope you didn't get many :(
[13:07] <JAA> There are just under 1000 occurrences of "max_conn" in the log files.
[13:08] <JAA> But there are two lines for each URL (request + response), and many of them are the infinite loops.
[13:08] <klondike> :(
[13:09] <JAA> Only 32 URLs actually failed.
[13:10] <JAA> At most, that is.
[13:11] <JAA> There are 32 URLs which produced a 302 redirect, but I can't easily check which of these redirected to the max_conn page.
[13:11] <JAA> Probably most of them though.
[13:11] <JAA> Out of over 300k URLs by now, I'd say that's a pretty good ratio. :-)
[13:12] <klondike> I'm thinking, do you prefer if I fix a dedicated server in Spain for you?
[13:12] <klondike> I called their hosting service yesterday and they no longer offer servers with root access, but I can try to find one somewhere else.
[13:16] <JAA> I don't think it would make too big of a difference. The main bottlenecks are Subcultura's server (some requests take quite a long time) and wpull's HTML parsing, not the network.
[13:17] <klondike> Well I can pay for something with a fast processor too :P
[13:18] <klondike> Not for subcultura but for your side
[13:19] <JAA> I was about to say, Subcultura would certainly appreciate a better server. :-P
[13:21] <klondike> I doubt I can fix anything on that side, but I can ask the admin I'm talking with
[13:26] <JAA> I've switched wpull's HTML parsing to lxml. That should be significantly faster.
[13:28] <klondike> Is there any difference between one and the other?
[13:30] <JAA> html5lib is said to be a bit more robust, but it's pure-Python and therefore really slow. lxml is based on libxml2, i.e. C, and seems to work well enough for almost all cases.
[13:31] <JAA> By the way, I only had to mark 13 URLs as errors; the other 19 were already considered errors because of the redirect loops (I assume).
[13:32] <klondike> Aha
[13:32] <klondike> Well the subcultura codebase is from 2012 or so, not much HTML5 I guess
[13:33] <JAA> lxml should handle HTML5 just fine. I think it's only edge cases and broken markup which *can* lead to errors.
[13:40] <klondike> Ahh
[13:41] <klondike> there might be some broken markup
[13:41] <klondike> Maybe, I really can't say, I found some weird links through httrack
[13:49] <JAA> Yeah, I've seen stuff like http://http://conmaskara.blogspot.com.es/.blogspot.com/ for example.
[13:50] <JAA> But the markup regarding that link is fine.
[13:50] <JAA> (It's from http://subcultura.es/user/florvieja/ )
[13:54] <JAA> Well, I don't really see a speedup compared to before I switched to lxml. So I guess the HTML parser isn't limiting after all.
[13:56] <klondike> Well the command top may help you see CPU usage there.
[13:57] <JAA> Yeah, I am monitoring CPU usage, but it fluctuates strongly.
[13:57] <JAA> htop > top, by the way.
[13:58] <klondike> I guess I'm too old for all those fancy things like htop :P
[14:00] <JAA> Average load on the machine did go down compared to before, by the way. It's just that the request rate is still about the same.
[14:21] *** REiN^ has quit IRC (Read error: Operation timed out)
[14:41] <klondike> *shrug*
[14:42] * klondike mumbles in Spanish
[14:55] *** RichardG has quit IRC (Read error: Connection reset by peer)
[14:57] *** RichardG has joined #archiveteam-bs
[14:59] *** RichardG has quit IRC (Read error: Connection reset by peer)
[15:01] *** RichardG has joined #archiveteam-bs
[15:02] *** RichardG has quit IRC (Read error: Connection reset by peer)
[15:07] <Uzerus> JAA: can you analyze my piece of code? i don;t know why my code is running without things inside if
[15:07] <Uzerus> look, 
[15:08] <Uzerus> https://pastebin.com/stdekh5H
[15:09] <Uzerus> it looks like not processing the for loop and go to else:
[15:10] <schbirid> which one?
[15:11] <Uzerus> all
[15:11] <schbirid> are your files mpty?
[15:11] <schbirid> empty
[15:11] <klondike> Uzerus: is it python2 or 3?
[15:12] <Uzerus> https://pastebin.com/yK18JEKu for example, python 3
[15:12] <Uzerus> 3.5
[15:13] <klondike> Ahh
[15:13] <Uzerus> file ignore: empty, file done is not empty
[15:13] <klondike> You see you have one open in mode rt and the other in mode r, true?
[15:13] <Uzerus> but... it can make sense IF the file has been opened as empty and not saved!
[15:14] <JAA> You're also opening the files many, many times instead of just once.
[15:14] <Uzerus> yup JAA, will repair it
[15:15] <JAA> readlines() returns a list of lines, but each line still has the linebreak at the end. Did you account for that?
[15:16] <Uzerus> eh, no
[15:16] <klondike> Uzerus: also your execution time will be N*M with N being the input file lines and M the ignore file lines.
[15:17] <Uzerus> i lost ~6 hours how click works, useless documentation in website
[15:17] <JAA> Use argparse next time. click seems to be a wrapper around optparse, which should just die already.
[15:17] <Uzerus> klindike: it's prototype, i want to do regex ignores in future
[15:18] <Uzerus> not copy & paste from grab-site f.e.
[15:18] <klondike> Oki Uzerus just saying because if the ignore file can fit in memory, a python set will make that O(N+M)
[15:20] <Uzerus> JAA: fortunetaly i have version of this in ARGPARSE, ppl on freenodes #python said me "argparse is not the easy way"
[15:22] <JAA> No idea what they mean by that.
[15:23] <JAA> You might have to implement the file existence check yourself, but otherwise, it would be very similar (just function calls instead of decorators, and you have a namespace object in the end instead of individual variables).
[15:30] <klondike> JAA: also you have a lot of duplicated code
[15:32] <JAA> Uzerus: ^
[15:34] <klondike> Uzerus: here you can see a reasonably simple way to have different opening functions https://pastebin.com/FSMDd2jn
[15:34] <klondike> Basically if the interface is the same, assign the function you want to call to a variable and then use the variable to do the call
[15:35] <JAA> You could even do  with (gzip.open if gzipp else open)(...) as logfile:  but that's harder to read.
[15:35] <Uzerus> hah, klondike, nice code
[15:36] <klondike> Uzerus: nah, it's just one of many small tricks I have learnt over time.
[15:37] <Uzerus> last thing in my minds is that the file was not saved (and programm is operating on empty file all the time
[15:39] <klondike> Uzerus: just cat the file and see
[15:39] <klondike> (Supposing you are running this on a Linuz system)
[15:39] <klondike> you can also do this (since it's gzipped)
[15:39] <Uzerus> it save to the 'done' file but it do not save to 'out' file
[15:39] <klondike> gzip -dc file 
[15:40] <Uzerus> gzipped is inly the input file, in read-only mode
[15:40] <JAA> less and zless are also useful.
[15:41] <Uzerus> JAA: i am just learning :D 
[15:41] <JAA> Also, zcat if you really want to output the entire file.
[15:41] <JAA> Because remembering options is hard.
[15:41] <JAA> But in my experience, you rarely actually want to print the entire file. less also gives you the advantage of being able to search for stuff etc.
[15:41] <klondike> Uzerus: less and zless are other commands :)
[15:42] <Uzerus> not only, i never made any programm in my life, it's verry usefull exprience, building that app
[15:47] <JAA> klondike: My Subcultura grab is now in the forums and has slowed down to a crawl. :-/
[15:48] <klondike> I'd say I'm surprised but... :P
[15:48] <JAA> Yeah
[15:48] <JAA> Oh well
[15:57] <klondike> I still would kill for just a full backup of the site even if that meant having to modify their old PHP code to generate the pages locally xD
[16:01] <JAA> Probably wouldn't even have to modify much if anything at all.
[16:01] <JAA> But yeah, if they're willing to give out the data, that's an option in principle.
[16:04] <JAA> data + all relevant information to set up a similar system*
[16:05] <klondike> I don't think they'll do that though
[16:05] <klondike> At least didn't look like
[16:14] <JAA> Yeah, I'm not surprised. Most people wouldn't do that.
[16:18] <Uzerus> analyzed my code once more, i have an logical bug....
[16:21] <Uzerus> heh, mindduck
[16:23] <Uzerus> ewww... how to do that ...
[16:36] *** icedice has joined #archiveteam-bs
[17:03] <Uzerus> is it possible that too many variables (information) can make an bug?
[17:04] <Uzerus> JAA:?
[17:11] <JAA> Uzerus: You could run out of memory, but other than that, I'd be quite surprised if you managed to break things by creating many variables.
[17:22] <klondike> Unless you overwrite something
[17:22] <klondike> For example in python you can do True=False and enjoy the mess ;)
[17:23] <Uzerus> WHAT?!
[17:25] *** BartoCH has quit IRC (Ping timeout: 260 seconds)
[17:26] *** Kimmer has quit IRC (Read error: Operation timed out)
[17:26] *** icedice has quit IRC (Ping timeout: 250 seconds)
[17:27] <Uzerus> https://stackoverflow.com/questions/19007383/compare-two-different-files-line-by-line-in-python
[17:27] <Uzerus> lol
[17:28] <Uzerus> ok, but i must first process the files by regex, later i can add "files" to this... em...
[17:31] <Uzerus> em... :/ there are so many ways that can and looks like they work! but they do not work at the end mostly xD
[17:32] <Uzerus> but ok, i am gaining knowladge... ... but why this do not work?!
[17:32] <Uzerus> :/
[17:33] <klondike> What doesn't work?
[17:37] <klondike> JAA: seems my htcollect also got hit by the max_conn bug
[17:37] <Uzerus> my script, anyway ill rewrite it again
[17:38] <klondike> JAA: Can you share your script? I'd like to make something able to download the comics
[17:38] <Uzerus> ill stay with click instead of argparse
[17:38] <klondike> Uzerus: mind pasting it and saying what's it supposed to do?
[17:39] <Uzerus> https://pastebin.com/stdekh5H
[17:40] <klondike> What options are you calling it with? What is it intended to do?
[17:41] <Uzerus> it should check ignorefile line by line and compare, when found equal, breake and go to the next line from logfile/inputfile... if is not equal in ignore, go to done and check there, if found, break , else write to done and to output file
[17:42] <Uzerus> ignore is for clear list, done is when we resume some work, or have something in parts to process
[17:43] <klondike> And what is the problem? Are entries on done not filtered?
[17:44] *** BartoCH has joined #archiveteam-bs
[18:08] <klondike> Well I have to leave now, I'll answer your questions later.
[18:22] <Uzerus> klondike: content between if ... else is not processing
[18:22] *** kvieta has quit IRC (Quit: greedo shot first)
[18:22] *** kvieta- is now known as kvieta
[18:23] <Uzerus> especially in donefile, looks like it always go to else: continue
[18:25] <JAA> Print out what you're comparing in the condition so you can see why they're never the same.
[18:26] <JAA> By the way, your domainfrom* functions are most likely not doing what you want them to do.
[18:26] <JAA> In particular, look at what the ''.join lines are doing.
[18:27] <Uzerus> they are converting list to string
[18:27] <Uzerus> cos comparing lists is a little... nonsense i think
[18:28] <JAA> Well yeah, they do, but not in the way you think.
[18:29] <JAA> Just print the values after those lines, or of the return value, and you'll see what I mean.
[18:29] <JAA> By the way, a tip regarding printing: print(repr(variable)) can be useful.
[18:30] <Uzerus> but... wait i will run with bigger wait time to have the beggining of log
[18:31] <Uzerus> script is designed to print always debugging info, ... when processing file and line, print line
[18:31] <Kaz> -ot?
[18:31] *** godane has quit IRC (Ping timeout: 506 seconds)
[18:31] <Uzerus> but it's not printing, lel, maybe cos of multiple with open... as...
[18:31] <JAA> Yeah, this belongs in #archiveteam-ot.
[18:33] <JAA> klondike: https://gist.github.com/JustAnotherArchivist/3e2b6c9e7c276a79a60c137b67a20798
[18:34] <JAA> That's my current code. This is using wpull 1.2.3, not the more recent 2.0.x version (which is so buggy that it's hardly usable).
[18:35] <Uzerus> i found next logic bug in my script ...
[18:35] <JAA> Uzerus: -ot...
[18:38] <JAA> (klondike, you might want to join #archiveteam-ot if you want to continue that discussion.)
[18:42] *** BartoCH has quit IRC (Ping timeout: 260 seconds)
[18:44] *** BartoCH has joined #archiveteam-bs
[18:54] *** octothorp has quit IRC (Read error: Connection reset by peer)
[18:56] *** BartoCH has quit IRC (Ping timeout: 260 seconds)
[19:18] *** RichardG has joined #archiveteam-bs
[19:22] *** Atom has quit IRC (Read error: Operation timed out)
[19:50] *** BartoCH has joined #archiveteam-bs
[20:08] *** Kimmer has joined #archiveteam-bs
[20:13] *** schbirid has quit IRC (Quit: Leaving)
[20:30] *** jrwr has quit IRC (Max SendQ exceeded)
[20:30] *** Ravenloft has joined #archiveteam-bs
[20:31] *** zyphlar has quit IRC (Read error: Connection reset by peer)
[20:31] *** zyphlar has joined #archiveteam-bs
[20:31] *** jrwr has joined #archiveteam-bs
[20:31] *** Mateon1 has quit IRC (Read error: Operation timed out)
[20:32] *** Mateon1 has joined #archiveteam-bs
[20:32] *** svchfoo1 sets mode: +o jrwr
[21:20] *** REiN^ has joined #archiveteam-bs
[21:32] *** octothorp has joined #archiveteam-bs
[21:52] *** schbirid has joined #archiveteam-bs
[22:04] *** MrDignity has joined #archiveteam-bs
[22:23] *** godane has joined #archiveteam-bs
[23:02] *** PotcFdk has quit IRC (~'o'/)
[23:03] *** ranav has joined #archiveteam-bs
[23:07] *** ranavalon has quit IRC (Read error: Operation timed out)
[23:11] *** BlueMaxim has joined #archiveteam-bs