Time |
Nickname |
Message |
00:56
🔗
|
|
zyphlar has quit IRC (Max SendQ exceeded) |
00:56
🔗
|
|
zyphlar has joined #archiveteam-bs |
00:58
🔗
|
odemg |
arkiver, SketchCow we know about this? https://the-eye.eu/public/Books/IT%20Various/Learning%20SPARQL%2C%202nd%20Edition.pdf |
00:58
🔗
|
odemg |
that was meant to be this: https://www.kotaku.com.au/2018/01/miitomo-is-shutting-down-in-may/ |
01:47
🔗
|
|
octothorp has quit IRC (Read error: Connection reset by peer) |
01:47
🔗
|
|
octothorp has joined #archiveteam-bs |
02:04
🔗
|
|
username1 has joined #archiveteam-bs |
02:10
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
02:27
🔗
|
|
zyphlar has quit IRC (Max SendQ exceeded) |
02:27
🔗
|
|
zyphlar has joined #archiveteam-bs |
02:44
🔗
|
|
zyphlar has quit IRC (Max SendQ exceeded) |
02:44
🔗
|
|
zyphlar has joined #archiveteam-bs |
04:39
🔗
|
|
ubahn has quit IRC (Ping timeout: 260 seconds) |
04:41
🔗
|
|
ubahn has joined #archiveteam-bs |
04:50
🔗
|
|
qw3rty112 has joined #archiveteam-bs |
04:56
🔗
|
|
qw3rty111 has quit IRC (Read error: Operation timed out) |
05:49
🔗
|
|
ranav has quit IRC (Remote host closed the connection) |
05:50
🔗
|
|
ranavalon has joined #archiveteam-bs |
06:58
🔗
|
godane |
https://www.reddit.com/r/linux/comments/7swi6r/kernelorg_is_collecting_pre2000_lkml_archives/ |
07:43
🔗
|
|
username1 has quit IRC (Quit: Leaving) |
08:51
🔗
|
JAA |
odemg: Yeah, two people mentioned it previously in the main chan. |
08:54
🔗
|
JAA |
The subcultura.es grab is running well. 212k URLs retrieved (2.5 GiB), 1.2M in the queue. |
09:53
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
10:09
🔗
|
|
dashcloud has quit IRC (Ping timeout: 493 seconds) |
10:11
🔗
|
|
dashcloud has joined #archiveteam-bs |
10:23
🔗
|
|
schbirid has joined #archiveteam-bs |
10:41
🔗
|
|
Valentine has joined #archiveteam-bs |
11:44
🔗
|
|
BlueMaxim has quit IRC (Leaving) |
12:08
🔗
|
|
schbirid has quit IRC (Ping timeout: 252 seconds) |
12:13
🔗
|
|
schbirid has joined #archiveteam-bs |
12:27
🔗
|
klondike |
You seem to be doing better than me JAA, you didn't hit the maxconn limit? |
12:41
🔗
|
JAA |
klondike: I don't think so. What does their limiting look like? I have only seen very few timeouts, connection resets, 429s, etc. so far. |
12:41
🔗
|
klondike |
JAA: looks like a redirect to a page saying maxconn |
12:41
🔗
|
JAA |
Hmm |
12:41
🔗
|
klondike |
Better said, to a page with maxconn in the URL itself |
12:42
🔗
|
JAA |
Ah, ok, let me check. |
12:42
🔗
|
JAA |
Nope, no such URLs in the logs. |
12:43
🔗
|
JAA |
Oh, max_conn |
12:43
🔗
|
JAA |
Yeah, I got a few of those. |
12:44
🔗
|
klondike |
Those may need refetching |
12:45
🔗
|
JAA |
It's a 302 redirect to http://subcultura.es/max_conn to be precise. |
12:45
🔗
|
klondike |
Yeah that could be |
12:45
🔗
|
klondike |
And an infinite one too |
12:46
🔗
|
JAA |
Yeah, just saw that in the logs, lol. |
12:49
🔗
|
JAA |
Hmm, this will be a bit annoying. |
12:49
🔗
|
klondike |
JAA: if you reuse http/1.1 connections it should be okay |
12:51
🔗
|
JAA |
I'm not sure if wpull does that, to be honest. |
12:54
🔗
|
klondike |
It should unless you use --no-http-keep-alive |
12:54
🔗
|
klondike |
The question is for how long does it keep the connection |
13:02
🔗
|
JAA |
I've implemented a workaround. 302s to that max_conn page are now considered errors. |
13:03
🔗
|
JAA |
Meaning those URLs will be retried at the end. |
13:05
🔗
|
klondike |
Cool |
13:06
🔗
|
klondike |
JAA: what happens with the ones that have already failed? |
13:06
🔗
|
JAA |
I'm marking those as errors manually. |
13:07
🔗
|
klondike |
Oh, I hope you didn't get many :( |
13:07
🔗
|
JAA |
There are just under 1000 occurrences of "max_conn" in the log files. |
13:08
🔗
|
JAA |
But there are two lines for each URL (request + response), and many of them are the infinite loops. |
13:08
🔗
|
klondike |
:( |
13:09
🔗
|
JAA |
Only 32 URLs actually failed. |
13:10
🔗
|
JAA |
At most, that is. |
13:11
🔗
|
JAA |
There are 32 URLs which produced a 302 redirect, but I can't easily check which of these redirected to the max_conn page. |
13:11
🔗
|
JAA |
Probably most of them though. |
13:11
🔗
|
JAA |
Out of over 300k URLs by now, I'd say that's a pretty good ratio. :-) |
13:12
🔗
|
klondike |
I'm thinking, do you prefer if I fix a dedicated server in Spain for you? |
13:12
🔗
|
klondike |
I called their hosting service yesterday and they no longer offer servers with root access, but I can try to find one somewhere else. |
13:16
🔗
|
JAA |
I don't think it would make too big of a difference. The main bottlenecks are Subcultura's server (some requests take quite a long time) and wpull's HTML parsing, not the network. |
13:17
🔗
|
klondike |
Well I can pay for something with a fast processor too :P |
13:18
🔗
|
klondike |
Not for subcultura but for your side |
13:19
🔗
|
JAA |
I was about to say, Subcultura would certainly appreciate a better server. :-P |
13:21
🔗
|
klondike |
I doubt I can fix anything on that side, but I can ask the admin I'm talking with |
13:26
🔗
|
JAA |
I've switched wpull's HTML parsing to lxml. That should be significantly faster. |
13:28
🔗
|
klondike |
Is there any difference between one and the other? |
13:30
🔗
|
JAA |
html5lib is said to be a bit more robust, but it's pure-Python and therefore really slow. lxml is based on libxml2, i.e. C, and seems to work well enough for almost all cases. |
13:31
🔗
|
JAA |
By the way, I only had to mark 13 URLs as errors; the other 19 were already considered errors because of the redirect loops (I assume). |
13:32
🔗
|
klondike |
Aha |
13:32
🔗
|
klondike |
Well the subcultura codebase is from 2012 or so, not much HTML5 I guess |
13:33
🔗
|
JAA |
lxml should handle HTML5 just fine. I think it's only edge cases and broken markup which *can* lead to errors. |
13:40
🔗
|
klondike |
Ahh |
13:41
🔗
|
klondike |
there might be some broken markup |
13:41
🔗
|
klondike |
Maybe, I really can't say, I found some weird links through httrack |
13:49
🔗
|
JAA |
Yeah, I've seen stuff like http://http://conmaskara.blogspot.com.es/.blogspot.com/ for example. |
13:50
🔗
|
JAA |
But the markup regarding that link is fine. |
13:50
🔗
|
JAA |
(It's from http://subcultura.es/user/florvieja/ ) |
13:54
🔗
|
JAA |
Well, I don't really see a speedup compared to before I switched to lxml. So I guess the HTML parser isn't limiting after all. |
13:56
🔗
|
klondike |
Well the command top may help you see CPU usage there. |
13:57
🔗
|
JAA |
Yeah, I am monitoring CPU usage, but it fluctuates strongly. |
13:57
🔗
|
JAA |
htop > top, by the way. |
13:58
🔗
|
klondike |
I guess I'm too old for all those fancy things like htop :P |
14:00
🔗
|
JAA |
Average load on the machine did go down compared to before, by the way. It's just that the request rate is still about the same. |
14:21
🔗
|
|
REiN^ has quit IRC (Read error: Operation timed out) |
14:41
🔗
|
klondike |
*shrug* |
14:42
🔗
|
* |
klondike mumbles in Spanish |
14:55
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
14:57
🔗
|
|
RichardG has joined #archiveteam-bs |
14:59
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
15:01
🔗
|
|
RichardG has joined #archiveteam-bs |
15:02
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
15:07
🔗
|
Uzerus |
JAA: can you analyze my piece of code? i don;t know why my code is running without things inside if |
15:07
🔗
|
Uzerus |
look, |
15:08
🔗
|
Uzerus |
https://pastebin.com/stdekh5H |
15:09
🔗
|
Uzerus |
it looks like not processing the for loop and go to else: |
15:10
🔗
|
schbirid |
which one? |
15:11
🔗
|
Uzerus |
all |
15:11
🔗
|
schbirid |
are your files mpty? |
15:11
🔗
|
schbirid |
empty |
15:11
🔗
|
klondike |
Uzerus: is it python2 or 3? |
15:12
🔗
|
Uzerus |
https://pastebin.com/yK18JEKu for example, python 3 |
15:12
🔗
|
Uzerus |
3.5 |
15:13
🔗
|
klondike |
Ahh |
15:13
🔗
|
Uzerus |
file ignore: empty, file done is not empty |
15:13
🔗
|
klondike |
You see you have one open in mode rt and the other in mode r, true? |
15:13
🔗
|
Uzerus |
but... it can make sense IF the file has been opened as empty and not saved! |
15:14
🔗
|
JAA |
You're also opening the files many, many times instead of just once. |
15:14
🔗
|
Uzerus |
yup JAA, will repair it |
15:15
🔗
|
JAA |
readlines() returns a list of lines, but each line still has the linebreak at the end. Did you account for that? |
15:16
🔗
|
Uzerus |
eh, no |
15:16
🔗
|
klondike |
Uzerus: also your execution time will be N*M with N being the input file lines and M the ignore file lines. |
15:17
🔗
|
Uzerus |
i lost ~6 hours how click works, useless documentation in website |
15:17
🔗
|
JAA |
Use argparse next time. click seems to be a wrapper around optparse, which should just die already. |
15:17
🔗
|
Uzerus |
klindike: it's prototype, i want to do regex ignores in future |
15:18
🔗
|
Uzerus |
not copy & paste from grab-site f.e. |
15:18
🔗
|
klondike |
Oki Uzerus just saying because if the ignore file can fit in memory, a python set will make that O(N+M) |
15:20
🔗
|
Uzerus |
JAA: fortunetaly i have version of this in ARGPARSE, ppl on freenodes #python said me "argparse is not the easy way" |
15:22
🔗
|
JAA |
No idea what they mean by that. |
15:23
🔗
|
JAA |
You might have to implement the file existence check yourself, but otherwise, it would be very similar (just function calls instead of decorators, and you have a namespace object in the end instead of individual variables). |
15:30
🔗
|
klondike |
JAA: also you have a lot of duplicated code |
15:32
🔗
|
JAA |
Uzerus: ^ |
15:34
🔗
|
klondike |
Uzerus: here you can see a reasonably simple way to have different opening functions https://pastebin.com/FSMDd2jn |
15:34
🔗
|
klondike |
Basically if the interface is the same, assign the function you want to call to a variable and then use the variable to do the call |
15:35
🔗
|
JAA |
You could even do with (gzip.open if gzipp else open)(...) as logfile: but that's harder to read. |
15:35
🔗
|
Uzerus |
hah, klondike, nice code |
15:36
🔗
|
klondike |
Uzerus: nah, it's just one of many small tricks I have learnt over time. |
15:37
🔗
|
Uzerus |
last thing in my minds is that the file was not saved (and programm is operating on empty file all the time |
15:39
🔗
|
klondike |
Uzerus: just cat the file and see |
15:39
🔗
|
klondike |
(Supposing you are running this on a Linuz system) |
15:39
🔗
|
klondike |
you can also do this (since it's gzipped) |
15:39
🔗
|
Uzerus |
it save to the 'done' file but it do not save to 'out' file |
15:39
🔗
|
klondike |
gzip -dc file |
15:40
🔗
|
Uzerus |
gzipped is inly the input file, in read-only mode |
15:40
🔗
|
JAA |
less and zless are also useful. |
15:41
🔗
|
Uzerus |
JAA: i am just learning :D |
15:41
🔗
|
JAA |
Also, zcat if you really want to output the entire file. |
15:41
🔗
|
JAA |
Because remembering options is hard. |
15:41
🔗
|
JAA |
But in my experience, you rarely actually want to print the entire file. less also gives you the advantage of being able to search for stuff etc. |
15:41
🔗
|
klondike |
Uzerus: less and zless are other commands :) |
15:42
🔗
|
Uzerus |
not only, i never made any programm in my life, it's verry usefull exprience, building that app |
15:47
🔗
|
JAA |
klondike: My Subcultura grab is now in the forums and has slowed down to a crawl. :-/ |
15:48
🔗
|
klondike |
I'd say I'm surprised but... :P |
15:48
🔗
|
JAA |
Yeah |
15:48
🔗
|
JAA |
Oh well |
15:57
🔗
|
klondike |
I still would kill for just a full backup of the site even if that meant having to modify their old PHP code to generate the pages locally xD |
16:01
🔗
|
JAA |
Probably wouldn't even have to modify much if anything at all. |
16:01
🔗
|
JAA |
But yeah, if they're willing to give out the data, that's an option in principle. |
16:04
🔗
|
JAA |
data + all relevant information to set up a similar system* |
16:05
🔗
|
klondike |
I don't think they'll do that though |
16:05
🔗
|
klondike |
At least didn't look like |
16:14
🔗
|
JAA |
Yeah, I'm not surprised. Most people wouldn't do that. |
16:18
🔗
|
Uzerus |
analyzed my code once more, i have an logical bug.... |
16:21
🔗
|
Uzerus |
heh, mindduck |
16:23
🔗
|
Uzerus |
ewww... how to do that ... |
16:36
🔗
|
|
icedice has joined #archiveteam-bs |
17:03
🔗
|
Uzerus |
is it possible that too many variables (information) can make an bug? |
17:04
🔗
|
Uzerus |
JAA:? |
17:11
🔗
|
JAA |
Uzerus: You could run out of memory, but other than that, I'd be quite surprised if you managed to break things by creating many variables. |
17:22
🔗
|
klondike |
Unless you overwrite something |
17:22
🔗
|
klondike |
For example in python you can do True=False and enjoy the mess ;) |
17:23
🔗
|
Uzerus |
WHAT?! |
17:25
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
17:26
🔗
|
|
Kimmer has quit IRC (Read error: Operation timed out) |
17:26
🔗
|
|
icedice has quit IRC (Ping timeout: 250 seconds) |
17:27
🔗
|
Uzerus |
https://stackoverflow.com/questions/19007383/compare-two-different-files-line-by-line-in-python |
17:27
🔗
|
Uzerus |
lol |
17:28
🔗
|
Uzerus |
ok, but i must first process the files by regex, later i can add "files" to this... em... |
17:31
🔗
|
Uzerus |
em... :/ there are so many ways that can and looks like they work! but they do not work at the end mostly xD |
17:32
🔗
|
Uzerus |
but ok, i am gaining knowladge... ... but why this do not work?! |
17:32
🔗
|
Uzerus |
:/ |
17:33
🔗
|
klondike |
What doesn't work? |
17:37
🔗
|
klondike |
JAA: seems my htcollect also got hit by the max_conn bug |
17:37
🔗
|
Uzerus |
my script, anyway ill rewrite it again |
17:38
🔗
|
klondike |
JAA: Can you share your script? I'd like to make something able to download the comics |
17:38
🔗
|
Uzerus |
ill stay with click instead of argparse |
17:38
🔗
|
klondike |
Uzerus: mind pasting it and saying what's it supposed to do? |
17:39
🔗
|
Uzerus |
https://pastebin.com/stdekh5H |
17:40
🔗
|
klondike |
What options are you calling it with? What is it intended to do? |
17:41
🔗
|
Uzerus |
it should check ignorefile line by line and compare, when found equal, breake and go to the next line from logfile/inputfile... if is not equal in ignore, go to done and check there, if found, break , else write to done and to output file |
17:42
🔗
|
Uzerus |
ignore is for clear list, done is when we resume some work, or have something in parts to process |
17:43
🔗
|
klondike |
And what is the problem? Are entries on done not filtered? |
17:44
🔗
|
|
BartoCH has joined #archiveteam-bs |
18:08
🔗
|
klondike |
Well I have to leave now, I'll answer your questions later. |
18:22
🔗
|
Uzerus |
klondike: content between if ... else is not processing |
18:22
🔗
|
|
kvieta has quit IRC (Quit: greedo shot first) |
18:22
🔗
|
|
kvieta- is now known as kvieta |
18:23
🔗
|
Uzerus |
especially in donefile, looks like it always go to else: continue |
18:25
🔗
|
JAA |
Print out what you're comparing in the condition so you can see why they're never the same. |
18:26
🔗
|
JAA |
By the way, your domainfrom* functions are most likely not doing what you want them to do. |
18:26
🔗
|
JAA |
In particular, look at what the ''.join lines are doing. |
18:27
🔗
|
Uzerus |
they are converting list to string |
18:27
🔗
|
Uzerus |
cos comparing lists is a little... nonsense i think |
18:28
🔗
|
JAA |
Well yeah, they do, but not in the way you think. |
18:29
🔗
|
JAA |
Just print the values after those lines, or of the return value, and you'll see what I mean. |
18:29
🔗
|
JAA |
By the way, a tip regarding printing: print(repr(variable)) can be useful. |
18:30
🔗
|
Uzerus |
but... wait i will run with bigger wait time to have the beggining of log |
18:31
🔗
|
Uzerus |
script is designed to print always debugging info, ... when processing file and line, print line |
18:31
🔗
|
Kaz |
-ot? |
18:31
🔗
|
|
godane has quit IRC (Ping timeout: 506 seconds) |
18:31
🔗
|
Uzerus |
but it's not printing, lel, maybe cos of multiple with open... as... |
18:31
🔗
|
JAA |
Yeah, this belongs in #archiveteam-ot. |
18:33
🔗
|
JAA |
klondike: https://gist.github.com/JustAnotherArchivist/3e2b6c9e7c276a79a60c137b67a20798 |
18:34
🔗
|
JAA |
That's my current code. This is using wpull 1.2.3, not the more recent 2.0.x version (which is so buggy that it's hardly usable). |
18:35
🔗
|
Uzerus |
i found next logic bug in my script ... |
18:35
🔗
|
JAA |
Uzerus: -ot... |
18:38
🔗
|
JAA |
(klondike, you might want to join #archiveteam-ot if you want to continue that discussion.) |
18:42
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
18:44
🔗
|
|
BartoCH has joined #archiveteam-bs |
18:54
🔗
|
|
octothorp has quit IRC (Read error: Connection reset by peer) |
18:56
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
19:18
🔗
|
|
RichardG has joined #archiveteam-bs |
19:22
🔗
|
|
Atom has quit IRC (Read error: Operation timed out) |
19:50
🔗
|
|
BartoCH has joined #archiveteam-bs |
20:08
🔗
|
|
Kimmer has joined #archiveteam-bs |
20:13
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
20:30
🔗
|
|
jrwr has quit IRC (Max SendQ exceeded) |
20:30
🔗
|
|
Ravenloft has joined #archiveteam-bs |
20:31
🔗
|
|
zyphlar has quit IRC (Read error: Connection reset by peer) |
20:31
🔗
|
|
zyphlar has joined #archiveteam-bs |
20:31
🔗
|
|
jrwr has joined #archiveteam-bs |
20:31
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
20:32
🔗
|
|
Mateon1 has joined #archiveteam-bs |
20:32
🔗
|
|
svchfoo1 sets mode: +o jrwr |
21:20
🔗
|
|
REiN^ has joined #archiveteam-bs |
21:32
🔗
|
|
octothorp has joined #archiveteam-bs |
21:52
🔗
|
|
schbirid has joined #archiveteam-bs |
22:04
🔗
|
|
MrDignity has joined #archiveteam-bs |
22:23
🔗
|
|
godane has joined #archiveteam-bs |
23:02
🔗
|
|
PotcFdk has quit IRC (~'o'/) |
23:03
🔗
|
|
ranav has joined #archiveteam-bs |
23:07
🔗
|
|
ranavalon has quit IRC (Read error: Operation timed out) |
23:11
🔗
|
|
BlueMaxim has joined #archiveteam-bs |