[00:56] *** zyphlar has quit IRC (Max SendQ exceeded) [00:56] *** zyphlar has joined #archiveteam-bs [00:58] arkiver, SketchCow we know about this? https://the-eye.eu/public/Books/IT%20Various/Learning%20SPARQL%2C%202nd%20Edition.pdf [00:58] that was meant to be this: https://www.kotaku.com.au/2018/01/miitomo-is-shutting-down-in-may/ [01:47] *** octothorp has quit IRC (Read error: Connection reset by peer) [01:47] *** octothorp has joined #archiveteam-bs [02:04] *** username1 has joined #archiveteam-bs [02:10] *** schbirid2 has quit IRC (Read error: Operation timed out) [02:27] *** zyphlar has quit IRC (Max SendQ exceeded) [02:27] *** zyphlar has joined #archiveteam-bs [02:44] *** zyphlar has quit IRC (Max SendQ exceeded) [02:44] *** zyphlar has joined #archiveteam-bs [04:39] *** ubahn has quit IRC (Ping timeout: 260 seconds) [04:41] *** ubahn has joined #archiveteam-bs [04:50] *** qw3rty112 has joined #archiveteam-bs [04:56] *** qw3rty111 has quit IRC (Read error: Operation timed out) [05:49] *** ranav has quit IRC (Remote host closed the connection) [05:50] *** ranavalon has joined #archiveteam-bs [06:58] https://www.reddit.com/r/linux/comments/7swi6r/kernelorg_is_collecting_pre2000_lkml_archives/ [07:43] *** username1 has quit IRC (Quit: Leaving) [08:51] odemg: Yeah, two people mentioned it previously in the main chan. [08:54] The subcultura.es grab is running well. 212k URLs retrieved (2.5 GiB), 1.2M in the queue. [09:53] *** pizzaiolo has joined #archiveteam-bs [10:09] *** dashcloud has quit IRC (Ping timeout: 493 seconds) [10:11] *** dashcloud has joined #archiveteam-bs [10:23] *** schbirid has joined #archiveteam-bs [10:41] *** Valentine has joined #archiveteam-bs [11:44] *** BlueMaxim has quit IRC (Leaving) [12:08] *** schbirid has quit IRC (Ping timeout: 252 seconds) [12:13] *** schbirid has joined #archiveteam-bs [12:27] You seem to be doing better than me JAA, you didn't hit the maxconn limit? [12:41] klondike: I don't think so. What does their limiting look like? I have only seen very few timeouts, connection resets, 429s, etc. so far. [12:41] JAA: looks like a redirect to a page saying maxconn [12:41] Hmm [12:41] Better said, to a page with maxconn in the URL itself [12:42] Ah, ok, let me check. [12:42] Nope, no such URLs in the logs. [12:43] Oh, max_conn [12:43] Yeah, I got a few of those. [12:44] Those may need refetching [12:45] It's a 302 redirect to http://subcultura.es/max_conn to be precise. [12:45] Yeah that could be [12:45] And an infinite one too [12:46] Yeah, just saw that in the logs, lol. [12:49] Hmm, this will be a bit annoying. [12:49] JAA: if you reuse http/1.1 connections it should be okay [12:51] I'm not sure if wpull does that, to be honest. [12:54] It should unless you use --no-http-keep-alive [12:54] The question is for how long does it keep the connection [13:02] I've implemented a workaround. 302s to that max_conn page are now considered errors. [13:03] Meaning those URLs will be retried at the end. [13:05] Cool [13:06] JAA: what happens with the ones that have already failed? [13:06] I'm marking those as errors manually. [13:07] Oh, I hope you didn't get many :( [13:07] There are just under 1000 occurrences of "max_conn" in the log files. [13:08] But there are two lines for each URL (request + response), and many of them are the infinite loops. [13:08] :( [13:09] Only 32 URLs actually failed. [13:10] At most, that is. [13:11] There are 32 URLs which produced a 302 redirect, but I can't easily check which of these redirected to the max_conn page. [13:11] Probably most of them though. [13:11] Out of over 300k URLs by now, I'd say that's a pretty good ratio. :-) [13:12] I'm thinking, do you prefer if I fix a dedicated server in Spain for you? [13:12] I called their hosting service yesterday and they no longer offer servers with root access, but I can try to find one somewhere else. [13:16] I don't think it would make too big of a difference. The main bottlenecks are Subcultura's server (some requests take quite a long time) and wpull's HTML parsing, not the network. [13:17] Well I can pay for something with a fast processor too :P [13:18] Not for subcultura but for your side [13:19] I was about to say, Subcultura would certainly appreciate a better server. :-P [13:21] I doubt I can fix anything on that side, but I can ask the admin I'm talking with [13:26] I've switched wpull's HTML parsing to lxml. That should be significantly faster. [13:28] Is there any difference between one and the other? [13:30] html5lib is said to be a bit more robust, but it's pure-Python and therefore really slow. lxml is based on libxml2, i.e. C, and seems to work well enough for almost all cases. [13:31] By the way, I only had to mark 13 URLs as errors; the other 19 were already considered errors because of the redirect loops (I assume). [13:32] Aha [13:32] Well the subcultura codebase is from 2012 or so, not much HTML5 I guess [13:33] lxml should handle HTML5 just fine. I think it's only edge cases and broken markup which *can* lead to errors. [13:40] Ahh [13:41] there might be some broken markup [13:41] Maybe, I really can't say, I found some weird links through httrack [13:49] Yeah, I've seen stuff like http://http://conmaskara.blogspot.com.es/.blogspot.com/ for example. [13:50] But the markup regarding that link is fine. [13:50] (It's from http://subcultura.es/user/florvieja/ ) [13:54] Well, I don't really see a speedup compared to before I switched to lxml. So I guess the HTML parser isn't limiting after all. [13:56] Well the command top may help you see CPU usage there. [13:57] Yeah, I am monitoring CPU usage, but it fluctuates strongly. [13:57] htop > top, by the way. [13:58] I guess I'm too old for all those fancy things like htop :P [14:00] Average load on the machine did go down compared to before, by the way. It's just that the request rate is still about the same. [14:21] *** REiN^ has quit IRC (Read error: Operation timed out) [14:41] *shrug* [14:42] * klondike mumbles in Spanish [14:55] *** RichardG has quit IRC (Read error: Connection reset by peer) [14:57] *** RichardG has joined #archiveteam-bs [14:59] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:01] *** RichardG has joined #archiveteam-bs [15:02] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:07] JAA: can you analyze my piece of code? i don;t know why my code is running without things inside if [15:07] look, [15:08] https://pastebin.com/stdekh5H [15:09] it looks like not processing the for loop and go to else: [15:10] which one? [15:11] all [15:11] are your files mpty? [15:11] empty [15:11] Uzerus: is it python2 or 3? [15:12] https://pastebin.com/yK18JEKu for example, python 3 [15:12] 3.5 [15:13] Ahh [15:13] file ignore: empty, file done is not empty [15:13] You see you have one open in mode rt and the other in mode r, true? [15:13] but... it can make sense IF the file has been opened as empty and not saved! [15:14] You're also opening the files many, many times instead of just once. [15:14] yup JAA, will repair it [15:15] readlines() returns a list of lines, but each line still has the linebreak at the end. Did you account for that? [15:16] eh, no [15:16] Uzerus: also your execution time will be N*M with N being the input file lines and M the ignore file lines. [15:17] i lost ~6 hours how click works, useless documentation in website [15:17] Use argparse next time. click seems to be a wrapper around optparse, which should just die already. [15:17] klindike: it's prototype, i want to do regex ignores in future [15:18] not copy & paste from grab-site f.e. [15:18] Oki Uzerus just saying because if the ignore file can fit in memory, a python set will make that O(N+M) [15:20] JAA: fortunetaly i have version of this in ARGPARSE, ppl on freenodes #python said me "argparse is not the easy way" [15:22] No idea what they mean by that. [15:23] You might have to implement the file existence check yourself, but otherwise, it would be very similar (just function calls instead of decorators, and you have a namespace object in the end instead of individual variables). [15:30] JAA: also you have a lot of duplicated code [15:32] Uzerus: ^ [15:34] Uzerus: here you can see a reasonably simple way to have different opening functions https://pastebin.com/FSMDd2jn [15:34] Basically if the interface is the same, assign the function you want to call to a variable and then use the variable to do the call [15:35] You could even do with (gzip.open if gzipp else open)(...) as logfile: but that's harder to read. [15:35] hah, klondike, nice code [15:36] Uzerus: nah, it's just one of many small tricks I have learnt over time. [15:37] last thing in my minds is that the file was not saved (and programm is operating on empty file all the time [15:39] Uzerus: just cat the file and see [15:39] (Supposing you are running this on a Linuz system) [15:39] you can also do this (since it's gzipped) [15:39] it save to the 'done' file but it do not save to 'out' file [15:39] gzip -dc file [15:40] gzipped is inly the input file, in read-only mode [15:40] less and zless are also useful. [15:41] JAA: i am just learning :D [15:41] Also, zcat if you really want to output the entire file. [15:41] Because remembering options is hard. [15:41] But in my experience, you rarely actually want to print the entire file. less also gives you the advantage of being able to search for stuff etc. [15:41] Uzerus: less and zless are other commands :) [15:42] not only, i never made any programm in my life, it's verry usefull exprience, building that app [15:47] klondike: My Subcultura grab is now in the forums and has slowed down to a crawl. :-/ [15:48] I'd say I'm surprised but... :P [15:48] Yeah [15:48] Oh well [15:57] I still would kill for just a full backup of the site even if that meant having to modify their old PHP code to generate the pages locally xD [16:01] Probably wouldn't even have to modify much if anything at all. [16:01] But yeah, if they're willing to give out the data, that's an option in principle. [16:04] data + all relevant information to set up a similar system* [16:05] I don't think they'll do that though [16:05] At least didn't look like [16:14] Yeah, I'm not surprised. Most people wouldn't do that. [16:18] analyzed my code once more, i have an logical bug.... [16:21] heh, mindduck [16:23] ewww... how to do that ... [16:36] *** icedice has joined #archiveteam-bs [17:03] is it possible that too many variables (information) can make an bug? [17:04] JAA:? [17:11] Uzerus: You could run out of memory, but other than that, I'd be quite surprised if you managed to break things by creating many variables. [17:22] Unless you overwrite something [17:22] For example in python you can do True=False and enjoy the mess ;) [17:23] WHAT?! [17:25] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [17:26] *** Kimmer has quit IRC (Read error: Operation timed out) [17:26] *** icedice has quit IRC (Ping timeout: 250 seconds) [17:27] https://stackoverflow.com/questions/19007383/compare-two-different-files-line-by-line-in-python [17:27] lol [17:28] ok, but i must first process the files by regex, later i can add "files" to this... em... [17:31] em... :/ there are so many ways that can and looks like they work! but they do not work at the end mostly xD [17:32] but ok, i am gaining knowladge... ... but why this do not work?! [17:32] :/ [17:33] What doesn't work? [17:37] JAA: seems my htcollect also got hit by the max_conn bug [17:37] my script, anyway ill rewrite it again [17:38] JAA: Can you share your script? I'd like to make something able to download the comics [17:38] ill stay with click instead of argparse [17:38] Uzerus: mind pasting it and saying what's it supposed to do? [17:39] https://pastebin.com/stdekh5H [17:40] What options are you calling it with? What is it intended to do? [17:41] it should check ignorefile line by line and compare, when found equal, breake and go to the next line from logfile/inputfile... if is not equal in ignore, go to done and check there, if found, break , else write to done and to output file [17:42] ignore is for clear list, done is when we resume some work, or have something in parts to process [17:43] And what is the problem? Are entries on done not filtered? [17:44] *** BartoCH has joined #archiveteam-bs [18:08] Well I have to leave now, I'll answer your questions later. [18:22] klondike: content between if ... else is not processing [18:22] *** kvieta has quit IRC (Quit: greedo shot first) [18:22] *** kvieta- is now known as kvieta [18:23] especially in donefile, looks like it always go to else: continue [18:25] Print out what you're comparing in the condition so you can see why they're never the same. [18:26] By the way, your domainfrom* functions are most likely not doing what you want them to do. [18:26] In particular, look at what the ''.join lines are doing. [18:27] they are converting list to string [18:27] cos comparing lists is a little... nonsense i think [18:28] Well yeah, they do, but not in the way you think. [18:29] Just print the values after those lines, or of the return value, and you'll see what I mean. [18:29] By the way, a tip regarding printing: print(repr(variable)) can be useful. [18:30] but... wait i will run with bigger wait time to have the beggining of log [18:31] script is designed to print always debugging info, ... when processing file and line, print line [18:31] -ot? [18:31] *** godane has quit IRC (Ping timeout: 506 seconds) [18:31] but it's not printing, lel, maybe cos of multiple with open... as... [18:31] Yeah, this belongs in #archiveteam-ot. [18:33] klondike: https://gist.github.com/JustAnotherArchivist/3e2b6c9e7c276a79a60c137b67a20798 [18:34] That's my current code. This is using wpull 1.2.3, not the more recent 2.0.x version (which is so buggy that it's hardly usable). [18:35] i found next logic bug in my script ... [18:35] Uzerus: -ot... [18:38] (klondike, you might want to join #archiveteam-ot if you want to continue that discussion.) [18:42] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [18:44] *** BartoCH has joined #archiveteam-bs [18:54] *** octothorp has quit IRC (Read error: Connection reset by peer) [18:56] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:18] *** RichardG has joined #archiveteam-bs [19:22] *** Atom has quit IRC (Read error: Operation timed out) [19:50] *** BartoCH has joined #archiveteam-bs [20:08] *** Kimmer has joined #archiveteam-bs [20:13] *** schbirid has quit IRC (Quit: Leaving) [20:30] *** jrwr has quit IRC (Max SendQ exceeded) [20:30] *** Ravenloft has joined #archiveteam-bs [20:31] *** zyphlar has quit IRC (Read error: Connection reset by peer) [20:31] *** zyphlar has joined #archiveteam-bs [20:31] *** jrwr has joined #archiveteam-bs [20:31] *** Mateon1 has quit IRC (Read error: Operation timed out) [20:32] *** Mateon1 has joined #archiveteam-bs [20:32] *** svchfoo1 sets mode: +o jrwr [21:20] *** REiN^ has joined #archiveteam-bs [21:32] *** octothorp has joined #archiveteam-bs [21:52] *** schbirid has joined #archiveteam-bs [22:04] *** MrDignity has joined #archiveteam-bs [22:23] *** godane has joined #archiveteam-bs [23:02] *** PotcFdk has quit IRC (~'o'/) [23:03] *** ranav has joined #archiveteam-bs [23:07] *** ranavalon has quit IRC (Read error: Operation timed out) [23:11] *** BlueMaxim has joined #archiveteam-bs