[08:33] soultcer: Caught an utf8 decode error when submitting the result back to the tracker o_o http://pejsta.nu/1148 [08:35] I guess it's related to me specifying an username for the scraper [08:39] Okay, snipurl.com seems to have rate limiting as well [08:47] UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2-4: invalid data [08:47] so, apparently not from the username :p I'll take a looksie [08:52] hmm, maybe related to me running this in python2.6 [11:51] Ran it on python2.7 - with issues as well >_> [15:35] ersi: Those utf8 errors usually happened when web.py was trying to parse POST data. The gzipped result file is sent via POST and web.py always tried to decode it. [15:36] What web.py version are you running? [15:39] soultcer: Seems like I'm running web.py-0.37 [15:39] Are you running a newer/older version? [15:40] 0.34 [15:40] Exciting [15:40] But if there is a bug in the tracker so it doesn't work with web.py 0.37, I rather fix the bug in the tracker [15:40] Yeah, that's the sane approach. :-) [15:41] Unfortunately the whole tracker is pretty much a big hack, so fixing bugs is hard :D [15:42] Oh, I'm used to that from my work. Everything is a hack. Especially """Enterprise"" software" [15:54] I'm sure snipurls have rate limiting now btw. Gonna investigate that after/if I get this working :) [15:54] ersi: Try creating a directory called "files" i the tracker directory [15:54] Alright [15:55] The utf8 error was just shown because web.py caught some kind of exception and tried to display debug information about that [15:57] Ugly, but fair enough. I created the directory, I'll see if it works better now :) [15:57] Well, yes and no. It created a tmpfile there. But it still hands off 500's [15:59] Ah, and the tmpfile is the actual list of codes|urls - gzipped. [15:59] Run the app with debug disabled so it just shows the original exception (web.config.debug = False in tracker.py) [15:59] Will do [16:00] I captured the traffic with tcpdump btw, if that'd be of any help - meaning we got the HTTP PUT and all' [16:01] The exception will probably be more useful [16:03] It'll come any second now (I'm too lazy to create smaller tasks.. currently chunks of 50 codes) [16:03] Hehe when I debug test I simply modify tinyback to only do 10 lookups and then stop. (Just don't forget to remove that "feature" before you run it on the real tracker again) [16:05] Heh, d'oh! :-) [16:06] On a side note; I've been thinking about "How to kick start a web crawl" and this is a great source (ie output from URL shorteners) of seed urls.. You get quite the randomness [16:07] Here we go! Finally, an exception! http://pejsta.nu/1149 [16:07] woot [16:08] Yes, lot of randomness but also probably a lot of spam sites. Spammers love URL shorteners [16:09] Hehe, since I don't specify a username - I guess I trigger this issue ^_^ [16:09] data variable seems to be defined in 'if username' [16:09] yeah, you can probably simply move the data = outside the if and it will work [16:10] Or should the last line be within the first if? [16:10] ie. db.update() call [16:12] No, it should be outside the if [16:12] Inside the if is the code that does the stats for every user [16:12] Outside is the code that does the stats for the url shorteners [16:13] If a downloader doesn't have a username, we still want to count his task in the url shortener statistic [16:13] Yay, now it works. [16:14] Man, luckily I put the whole thing into a single DB transaction. This means all contributions by downloaders with username were rejected, but at least the task would be reassigned. [16:14] Without the transaction, all work done by unassigned users would have been lost with no way for me to know which had been lost :-/ [16:15] * ersi shrugs [16:45] ersi: What kind of rate limiting did you find btw? [16:50] soultcer: I've only concluded that there is rate limiting - which block/holds request for a period - then it resumes as usual [16:50] I'll investigate the rate soon [16:54] Also, snipurl hands out URLs sequentially, but they did not start with "0" as first ID, but apparently with "20wa5rt" apparently [16:54] haha, wtf [16:54] So.. what would the next logical code be, after "20wa5rt"? :D [16:55] "20wa5ru"? [16:56] In that case yes, since they do numbers before letters [16:56] And they never assign a code with -_~ in it, even though it would be a valid code [16:59] Huh [16:59] Hm, maybe we should setup GitHub Notifications for tinyarchive to here as well [17:04] oh, hehe. I pushed up a branch with the same fix :p I'll just delete it [17:05] soultcer: If you write "closes issue #1" or "fixes issue #1" in your commit message, github will automatically close the issue for you btw [17:05] Ah [17:06] I actually made the commit before you opened the issue, I just forgot to push to github [17:06] ah :) [17:06] Hmm, how do I delete a remote branch? [17:06] git push : [17:07] as in; "git push origin :fixing-issue-1" (if that's my branch name) ? [17:07] Yes [17:07] Cool [17:08] I should have just let you made the commit, after all it's your bug and your fix [17:08] I just went through the tracker.tinyarchive.org logs, and apparently nobody ever tried to submit a task with no username [17:09] cool :D [17:09] That, or it ain't get logged? [17:09] Nah, every exception gets logged [17:09] Regarding the bug fix, don't worry about it. I'll hang on and fix more of these :-) [17:10] There are tons of more stupid bugs in both the tinyback and the tracker code [17:10] What are you running tinyarchive under btw? Apache with mod_wsgi? Nginx with wsgi/gunicorn? [17:10] If you want to do something more fun you can also play around with other stuff like making the graphs more useful [17:10] I'm not much of a UI guy :) [17:10] Besides, I got to report an issue - that's always fun \o/ [17:11] I'll get to investigating snipurl's rate limiting now. [17:18] How do I 'checkout' a branch I don't have locally, but is available remotely? [17:21] By the way: 2012-12-07 18:20:20,659 tinyback.Reaper DEBUG: Fetching code 6~, try 1 [17:21] 2012-12-07 18:20:43,827 tinyback.Reaper DEBUG: Code 6~ leads to URL 'http://ganga-japan.com/unsecured-payday-loans-immediate-cash-without-collateral.html' [17:21] Seems like they do use ~ though :) [17:35] Only if you select it as custom "nickname" [17:38] Oh, I see [22:41] yay...got tinyback running also on my raspberry pi, now sleep time.. [22:51] heh [22:51] cool