[00:09] --amend ;) [00:10] I think the warrior does a git pull, so it will try to awkwardly merge commits when I overwrite them. [00:10] And it is usually a bad idea anyways to change already published git history [00:26] oh, yeah, okay [08:06] soultcer: Nice that you added - Leaderboard. :) [19:50] soultcer: So, if I'm interested in writing a scraper - do you have any tips? [19:53] Hmm, I guess it's hard to find a shortener which has predictable URLs [20:03] Snipurl.com supposedly have 6,715,253,742 URLs [20:17] Alright, seems I got the basic structure I guess [20:19] when is the next urlteam .torrent coming out? [20:20] According to the site, in the end of 2012 [20:22] soultcer: What's running @ tracker.tinyarchive.org/v1/? Is that checked into tinyback? I guess it isn't :o [21:30] iirc we haven't done a lot of shortener scraping this year [21:30] swebb was crazy busy, he unshortened every link from the twitter spritzer feed (2% of all twitter posts) [21:31] I'm just interested in A) How does the tracker work? B) What does it take to 'support a Scraper'? really [21:31] Mmmh [21:31] ersi: The tracker code lives in a separate repository, gimme a sec I'll put it on github [21:31] sure thang [21:31] Yea, well one of my computers was crazy busy. :) [21:31] any info on how to setup script(not warrior)? after clone is it run.py or something else? and what about parameters like tracker and others ? [21:32] I've written the very basic skeleton for http://snipurl.com which looks similar to ur1.ca [21:33] 195M unwound urls in my twitter DB as of this morning. [21:33] woah :) [21:33] 16.4GB of mysql storage space. [21:34] uncompressed of course. [21:34] I've only filled up that much database storage by generating fake test data [21:34] -rw-r----- 1 david david 95G Dec 6 21:34 tinyurl.db [21:34] -rw-r----- 1 david david 265G Dec 6 21:34 bitly.db [21:34] david@kat:/media/tinyarchive/database/data$ ls -lh | grep -E '(bitly|tinyurl)' [21:34] That's a lot! [21:35] BerkeleyDB with B-Tree indexes [21:36] Yea, I used to use those too. Much nicer than flat-file. [21:36] ersi: https://github.com/ArchiveTeam/tinyarchive [21:36] hm, not a bad idea [21:36] Is there anything that I can do/take over re urlteam work? [21:36] much better than the flat files we were using, yes? [21:37] chronomex: About 10 times the size than the compressed .txt.xz files [21:37] For working with them: Great. For distributing: Bad. [21:37] well .xz crunches the text files down to about 2-5% size [21:37] so doesn't sound too bad [21:38] right, but you can run an extract from the db [21:38] yes? [21:38] I think it's 25% for the shortener URLs. The rest is index overhead of the btrees [21:38] aye [21:38] wait, what? [21:38] I'm not parsing that correctly [21:38] Holy fuck @ 350GB+ [21:38] .txt.xz files: 50 GB, .txt files: 200 GB, .db files: 500 GB [21:39] Yea, mine will only be a couple of GB compressed. [21:39] ah, I see [21:39] ersi: Help is always welcome. If you know a little bit of python, writing an additional scraper will be easy. Did you look at the services.py file in the tinyback repository? [21:39] soultcer: I've written one for snipurl.com [21:40] Cool, can you git push it somewhere/post a diff? [21:40] sure [21:41] (I'll try to move the tinyback repo to the archiveteam organization so that in the future everyone here can just push to it. It's just a bit complicated because the warrior directly pulls from the repo) [21:42] I guess you could add me to your repo and I could push up the branch [21:43] so...running the script help anyone? :) [21:43] deathy: Which script? [21:43] deathy: Do you have seesaw-kit? [21:43] just `./run-pipeline tinyback/pipeline.py deathy` - simple as that [21:43] yep. on machine where I ran webshots script :) [21:44] ersi: Okay you should now be able to push. I didn't even know github let you do that on unpaid accounts :D [21:45] Yeah, it's pretty nifty :) [21:45] ok... so no running two pipelines at same time with default address/port? (also have dailybooth running there right now) [21:47] deathy: You can use --disable-web-server to disable the webserver or --port to change the port of the webserver (try ./pipeline --help for more options) [21:47] deathy: add ./run-pipeline --port or --disable-web-server [21:47] deathy: `run-pipeline` without any commands or with --help for all options >_> [21:49] thanks. seems to have started ok now [21:49] wat, "You can't push to git://github.com/soult/tinyback.git" [21:49] I guess.. unpaid accounts can't share then [21:49] unless you fork and pull [21:49] Your github account is named "ersi", right? [21:49] isn't that the "Git Read-Only" link? [21:50] oh, lol [21:50] correct [21:50] There we go [21:51] [tinyback] ersi created add-snipurl-service (+1 new commit): https://github.com/soult/tinyback/commit/c09f2289eb13 [21:51] tinyback/add-snipurl-service c09f228 Erik SimmesgÃ¥rd: Initial work on adding a scraper service for snipurl [21:51] \o/ [21:51] I explicitly disabled the github shorturl service for the IRC bot :D [21:52] :D [21:55] I maybe should write that down somewhere, here is how I usually add new services: [21:55] a) Check service, find out the charset [21:55] b) See how it redirects, and how it shows that an URL doesn't exist [21:55] c) Bombard the shortener with random URLs to see how it handles too many requests (will it block the scraper like bit.ly does, will it throw a specific http error, etc) [21:56] Makes sense. I've done very limited of a) and did mostly of b) but none of c) so far [21:57] d) Find out how it handles spam links. Most shorteners have a page like this: ow.ly/aU90 [21:58] ersi: Doing a and b is already a lot of work, especially since none of those things have adequate documentation [21:58] Yeah [21:58] ersi: Did you look at the test-defintions folder: I use it to write simple "tests" that verify that the shortener does not suddenly change it's API/HTTP responses [21:59] Yeah, I added one for snipurls.com for a few random ones. I saw your more exhaustive ones for bitly and owly - good inspiration [22:00] Haha I'm an idiot, wondering why I didn't find that file, because I forgot to git checkout the snipurl branch :D [22:00] :D [22:00] yeah, branches that are pushed to remotes confuses the fuck out of me [22:06] Haha, snipurl guys are idiots. They allow you to specify a password for links. Once a password has been set for any long URL, no other short URL without a password can be created for the same URL [22:12] [tinyback] soult pushed 2 new commits to add-snipurl-service: https://github.com/soult/tinyback/compare/c09f2289eb13...ba5438d766cd [22:12] tinyback/add-snipurl-service 2371ced David Triendl: services.Snipurl: Throw CodeBlockedException on links that require a private key [22:12] tinyback/add-snipurl-service ba5438d David Triendl: services.Snipurl: ~ is valid character [22:13] ersi: I extended the code a little bit: It now handles shorturl that require "private key" as "not found" [22:15] any easy way to see how many tasks I've sent to the tracker? Tracker page only shows top 10 .. [22:17] soultcer: cool [22:17] deathy: no [22:19] sqlite> SELECT name, count FROM statistics JOIN service ON service_id = id WHERE username = 'deathy'; [22:19] bitly|39 [22:19] isgd|20 [22:19] owly|50 [22:19] tinyurl|52 [22:19] ooh, do me :) [22:19] I think I should add an option to show all users [22:19] isgd|1216 [22:19] bitly|1187 [22:19] klam|235 [22:19] owly|1508 [22:19] tinyurl|1402 [22:19] Awesome. [22:20] thanks. good to know it's really working ok and not just send 1-2 items :) [22:22] Okay, so the next step in scraping an URL shortener is adding it to the main branch and increasing the version number. [22:23] Clients always submit the version number when fetching a task, so that the tracker can tell outdated clients to update [22:24] I initiated a pull request on GitHub between the branches [22:24] If you feel it's good to go, shoot - else I'll keep working on it until it's better [22:25] I'd like to take it for a test spin though - is it easy to do? I guess it's "just" to create some tasks with create_task.py and run tracker.py? [22:26] Yeah, simply create the database (cat scheme.sql | sqlite3 tasks.sqlite), run ./tracker.py (requires web.py) and use create_task.py (don't forget to change the tracker url there too) [22:28] * ersi nods [22:29] hmm, should it background immediately? [22:30] Though you can also do it without the tracker: http://pastebin.com/GxuNiQaY [22:30] ersi: Ah right, change the last line in the tracker to this: [22:30] if __name__ == "__main__": [22:30] app.run() [22:30] Gotcha ;-) [22:30] This will enable the built-in webserver. Otherwise it needs to be called from wsgi [22:31] yeah [22:31] I've messed around with Flask previously [22:33] Hm, weird. sqlite3 didn't like the service table from schema.sql [22:33] I made some schema changes when I added fancy graphs for the tracker, maybe I messed something uo [22:33] *up [22:34] I'll check it out [22:34] "finished_tasks_count INTEGER NOT NULL DEFAULT 0," should not have a comma at the end [22:34] oh [22:34] haha, yeah - just saw that [22:35] When you run the task, be sure to enable debug output, it gives some insight into how snapurl creates shorturls [22:40] ersi: Did it work? [22:41] I'm trying to find out how to actually create tasks :) [22:42] The task_create file is a bit complicated, because I use it to split up a long range (say 00000-zzzzz) into small tasks that only have 600 codes each [22:43] So something like sequence_from_to(tracker, "snipurl", "abcde....z-_~", "2500000", "250zzzz", 600) should do [22:46] Oh my [22:46] sequence_generator sure didn't like that [22:47] Exception handling combined with nice error messages is for pussies [22:48] It doesn't find substring, which I find, absurd. [22:49] http://pejsta.nu/1147 [22:50] Ah, I successfully created a task now [22:50] You said "Using the charset "abc" create a sequence from "1" to "10", split into 10 tasks" [22:50] But there is no 1 or 0 in the charset, how should it know how to increment from a "1"? [22:50] True that. [22:51] Oh, so that's why your line didn't work either? '2', '5' and '0' wasn't in charset? [22:52] Oh, right, my bad [22:52] I'm starting to get a hang of it now :) [22:52] 0123456789abcdefghijklmnopqrstuvwxyz-_~ <-- actual charset of snipurl [22:53] Indeed [22:53] I guess it doesn't help that it's 23:53 and I'm sleepy :D [22:56] Hehe [22:56] Now you know most of the tracker/tinyback stuff anyway [22:56] I'll keep a scraper running overnight to see what happens, and then I can merge your code tomorrow [22:59] * soultcer is off to bed [22:59] Nighty :)