#urlteam 2012-12-07,Fri

↑back Search

Time Nickname Message
08:33 🔗 ersi soultcer: Caught an utf8 decode error when submitting the result back to the tracker o_o http://pejsta.nu/1148
08:35 🔗 ersi I guess it's related to me specifying an username for the scraper
08:39 🔗 ersi Okay, snipurl.com seems to have rate limiting as well
08:47 🔗 ersi UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2-4: invalid data
08:47 🔗 ersi so, apparently not from the username :p I'll take a looksie
08:52 🔗 ersi hmm, maybe related to me running this in python2.6
11:51 🔗 ersi Ran it on python2.7 - with issues as well >_>
15:35 🔗 soultcer ersi: Those utf8 errors usually happened when web.py was trying to parse POST data. The gzipped result file is sent via POST and web.py always tried to decode it.
15:36 🔗 soultcer What web.py version are you running?
15:39 🔗 ersi soultcer: Seems like I'm running web.py-0.37
15:39 🔗 ersi Are you running a newer/older version?
15:40 🔗 soultcer 0.34
15:40 🔗 ersi Exciting
15:40 🔗 soultcer But if there is a bug in the tracker so it doesn't work with web.py 0.37, I rather fix the bug in the tracker
15:40 🔗 ersi Yeah, that's the sane approach. :-)
15:41 🔗 soultcer Unfortunately the whole tracker is pretty much a big hack, so fixing bugs is hard :D
15:42 🔗 ersi Oh, I'm used to that from my work. Everything is a hack. Especially """Enterprise"" software"
15:54 🔗 ersi I'm sure snipurls have rate limiting now btw. Gonna investigate that after/if I get this working :)
15:54 🔗 soultcer ersi: Try creating a directory called "files" i the tracker directory
15:54 🔗 ersi Alright
15:55 🔗 soultcer The utf8 error was just shown because web.py caught some kind of exception and tried to display debug information about that
15:57 🔗 ersi Ugly, but fair enough. I created the directory, I'll see if it works better now :)
15:57 🔗 ersi Well, yes and no. It created a tmpfile there. But it still hands off 500's
15:59 🔗 ersi Ah, and the tmpfile is the actual list of codes|urls - gzipped.
15:59 🔗 soultcer Run the app with debug disabled so it just shows the original exception (web.config.debug = False in tracker.py)
15:59 🔗 ersi Will do
16:00 🔗 ersi I captured the traffic with tcpdump btw, if that'd be of any help - meaning we got the HTTP PUT and all'
16:01 🔗 soultcer The exception will probably be more useful
16:03 🔗 ersi It'll come any second now (I'm too lazy to create smaller tasks.. currently chunks of 50 codes)
16:03 🔗 soultcer Hehe when I debug test I simply modify tinyback to only do 10 lookups and then stop. (Just don't forget to remove that "feature" before you run it on the real tracker again)
16:05 🔗 ersi Heh, d'oh! :-)
16:06 🔗 ersi On a side note; I've been thinking about "How to kick start a web crawl" and this is a great source (ie output from URL shorteners) of seed urls.. You get quite the randomness
16:07 🔗 ersi Here we go! Finally, an exception! http://pejsta.nu/1149
16:07 🔗 ersi woot
16:08 🔗 soultcer Yes, lot of randomness but also probably a lot of spam sites. Spammers love URL shorteners
16:09 🔗 ersi Hehe, since I don't specify a username - I guess I trigger this issue ^_^
16:09 🔗 ersi data variable seems to be defined in 'if username'
16:09 🔗 soultcer yeah, you can probably simply move the data = outside the if and it will work
16:10 🔗 ersi Or should the last line be within the first if?
16:10 🔗 ersi ie. db.update() call
16:12 🔗 soultcer No, it should be outside the if
16:12 🔗 soultcer Inside the if is the code that does the stats for every user
16:12 🔗 soultcer Outside is the code that does the stats for the url shorteners
16:13 🔗 soultcer If a downloader doesn't have a username, we still want to count his task in the url shortener statistic
16:13 🔗 ersi Yay, now it works.
16:14 🔗 soultcer Man, luckily I put the whole thing into a single DB transaction. This means all contributions by downloaders with username were rejected, but at least the task would be reassigned.
16:14 🔗 soultcer Without the transaction, all work done by unassigned users would have been lost with no way for me to know which had been lost :-/
16:15 🔗 * ersi shrugs
16:45 🔗 soultcer ersi: What kind of rate limiting did you find btw?
16:50 🔗 ersi soultcer: I've only concluded that there is rate limiting - which block/holds request for a period - then it resumes as usual
16:50 🔗 ersi I'll investigate the rate soon
16:54 🔗 soultcer Also, snipurl hands out URLs sequentially, but they did not start with "0" as first ID, but apparently with "20wa5rt" apparently
16:54 🔗 ersi haha, wtf
16:54 🔗 ersi So.. what would the next logical code be, after "20wa5rt"? :D
16:55 🔗 ersi "20wa5ru"?
16:56 🔗 soultcer In that case yes, since they do numbers before letters
16:56 🔗 soultcer And they never assign a code with -_~ in it, even though it would be a valid code
16:59 🔗 ersi Huh
16:59 🔗 ersi Hm, maybe we should setup GitHub Notifications for tinyarchive to here as well
17:04 🔗 ersi oh, hehe. I pushed up a branch with the same fix :p I'll just delete it
17:05 🔗 ersi soultcer: If you write "closes issue #1" or "fixes issue #1" in your commit message, github will automatically close the issue for you btw
17:05 🔗 soultcer Ah
17:06 🔗 soultcer I actually made the commit before you opened the issue, I just forgot to push to github
17:06 🔗 ersi ah :)
17:06 🔗 ersi Hmm, how do I delete a remote branch?
17:06 🔗 soultcer git push <remote> :<branch-to-delete>
17:07 🔗 ersi as in; "git push origin :fixing-issue-1" (if that's my branch name) ?
17:07 🔗 soultcer Yes
17:07 🔗 ersi Cool
17:08 🔗 soultcer I should have just let you made the commit, after all it's your bug and your fix
17:08 🔗 soultcer I just went through the tracker.tinyarchive.org logs, and apparently nobody ever tried to submit a task with no username
17:09 🔗 ersi cool :D
17:09 🔗 ersi That, or it ain't get logged?
17:09 🔗 soultcer Nah, every exception gets logged
17:09 🔗 ersi Regarding the bug fix, don't worry about it. I'll hang on and fix more of these :-)
17:10 🔗 soultcer There are tons of more stupid bugs in both the tinyback and the tracker code
17:10 🔗 ersi What are you running tinyarchive under btw? Apache with mod_wsgi? Nginx with wsgi/gunicorn?
17:10 🔗 soultcer If you want to do something more fun you can also play around with other stuff like making the graphs more useful
17:10 🔗 ersi I'm not much of a UI guy :)
17:10 🔗 ersi Besides, I got to report an issue - that's always fun \o/
17:11 🔗 ersi I'll get to investigating snipurl's rate limiting now.
17:18 🔗 ersi How do I 'checkout' a branch I don't have locally, but is available remotely?
17:21 🔗 ersi By the way: 2012-12-07 18:20:20,659 tinyback.Reaper DEBUG: Fetching code 6~, try 1
17:21 🔗 ersi 2012-12-07 18:20:43,827 tinyback.Reaper DEBUG: Code 6~ leads to URL 'http://ganga-japan.com/unsecured-payday-loans-immediate-cash-without-collateral.html'
17:21 🔗 ersi Seems like they do use ~ though :)
17:35 🔗 soultcer Only if you select it as custom "nickname"
17:38 🔗 ersi Oh, I see
22:41 🔗 deathy yay...got tinyback running also on my raspberry pi, now sleep time..
22:51 🔗 chronomex heh
22:51 🔗 chronomex cool

irclogger-viewer