#urlteam 2012-12-06,Thu

↑back Search

Time Nickname Message
00:09 πŸ”— chronomex --amend ;)
00:10 πŸ”— soultcer I think the warrior does a git pull, so it will try to awkwardly merge commits when I overwrite them.
00:10 πŸ”— soultcer And it is usually a bad idea anyways to change already published git history
00:26 πŸ”— chronomex oh, yeah, okay
08:06 πŸ”— ersi soultcer: Nice that you added - Leaderboard. :)
19:50 πŸ”— ersi soultcer: So, if I'm interested in writing a scraper - do you have any tips?
19:53 πŸ”— ersi Hmm, I guess it's hard to find a shortener which has predictable URLs
20:03 πŸ”— ersi Snipurl.com supposedly have 6,715,253,742 URLs
20:17 πŸ”— ersi Alright, seems I got the basic structure I guess
20:19 πŸ”— deathy when is the next urlteam .torrent coming out?
20:20 πŸ”— ersi According to the site, in the end of 2012
20:22 πŸ”— ersi soultcer: What's running @ tracker.tinyarchive.org/v1/? Is that checked into tinyback? I guess it isn't :o
21:30 πŸ”— chronomex iirc we haven't done a lot of shortener scraping this year
21:30 πŸ”— soultcer swebb was crazy busy, he unshortened every link from the twitter spritzer feed (2% of all twitter posts)
21:31 πŸ”— ersi I'm just interested in A) How does the tracker work? B) What does it take to 'support a Scraper'? really
21:31 πŸ”— ersi Mmmh
21:31 πŸ”— soultcer ersi: The tracker code lives in a separate repository, gimme a sec I'll put it on github
21:31 πŸ”— ersi sure thang
21:31 πŸ”— swebb Yea, well one of my computers was crazy busy. :)
21:31 πŸ”— deathy any info on how to setup script(not warrior)? after clone is it run.py or something else? and what about parameters like tracker and others ?
21:32 πŸ”— ersi I've written the very basic skeleton for http://snipurl.com which looks similar to ur1.ca
21:33 πŸ”— swebb 195M unwound urls in my twitter DB as of this morning.
21:33 πŸ”— ersi woah :)
21:33 πŸ”— swebb 16.4GB of mysql storage space.
21:34 πŸ”— swebb uncompressed of course.
21:34 πŸ”— ersi I've only filled up that much database storage by generating fake test data
21:34 πŸ”— soultcer -rw-r----- 1 david david 95G Dec 6 21:34 tinyurl.db
21:34 πŸ”— soultcer -rw-r----- 1 david david 265G Dec 6 21:34 bitly.db
21:34 πŸ”— soultcer david@kat:/media/tinyarchive/database/data$ ls -lh | grep -E '(bitly|tinyurl)'
21:34 πŸ”— swebb That's a lot!
21:35 πŸ”— soultcer BerkeleyDB with B-Tree indexes
21:36 πŸ”— swebb Yea, I used to use those too. Much nicer than flat-file.
21:36 πŸ”— soultcer ersi: https://github.com/ArchiveTeam/tinyarchive
21:36 πŸ”— chronomex hm, not a bad idea
21:36 πŸ”— ersi Is there anything that I can do/take over re urlteam work?
21:36 πŸ”— chronomex much better than the flat files we were using, yes?
21:37 πŸ”— soultcer chronomex: About 10 times the size than the compressed .txt.xz files
21:37 πŸ”— soultcer For working with them: Great. For distributing: Bad.
21:37 πŸ”— chronomex well .xz crunches the text files down to about 2-5% size
21:37 πŸ”— chronomex so doesn't sound too bad
21:38 πŸ”— chronomex right, but you can run an extract from the db
21:38 πŸ”— chronomex yes?
21:38 πŸ”— soultcer I think it's 25% for the shortener URLs. The rest is index overhead of the btrees
21:38 πŸ”— chronomex aye
21:38 πŸ”— chronomex wait, what?
21:38 πŸ”— chronomex I'm not parsing that correctly
21:38 πŸ”— ersi Holy fuck @ 350GB+
21:38 πŸ”— soultcer .txt.xz files: 50 GB, .txt files: 200 GB, .db files: 500 GB
21:39 πŸ”— swebb Yea, mine will only be a couple of GB compressed.
21:39 πŸ”— chronomex ah, I see
21:39 πŸ”— soultcer ersi: Help is always welcome. If you know a little bit of python, writing an additional scraper will be easy. Did you look at the services.py file in the tinyback repository?
21:39 πŸ”— ersi soultcer: I've written one for snipurl.com
21:40 πŸ”— soultcer Cool, can you git push it somewhere/post a diff?
21:40 πŸ”— ersi sure
21:41 πŸ”— soultcer (I'll try to move the tinyback repo to the archiveteam organization so that in the future everyone here can just push to it. It's just a bit complicated because the warrior directly pulls from the repo)
21:42 πŸ”— ersi I guess you could add me to your repo and I could push up the branch
21:43 πŸ”— deathy so...running the script help anyone? :)
21:43 πŸ”— soultcer deathy: Which script?
21:43 πŸ”— ersi deathy: Do you have seesaw-kit?
21:43 πŸ”— ersi just `./run-pipeline tinyback/pipeline.py deathy` - simple as that
21:43 πŸ”— deathy yep. on machine where I ran webshots script :)
21:44 πŸ”— soultcer ersi: Okay you should now be able to push. I didn't even know github let you do that on unpaid accounts :D
21:45 πŸ”— ersi Yeah, it's pretty nifty :)
21:45 πŸ”— deathy ok... so no running two pipelines at same time with default address/port? (also have dailybooth running there right now)
21:47 πŸ”— soultcer deathy: You can use --disable-web-server to disable the webserver or --port to change the port of the webserver (try ./pipeline --help for more options)
21:47 πŸ”— ersi deathy: add ./run-pipeline --port <someone else than 8001> or --disable-web-server
21:47 πŸ”— ersi deathy: `run-pipeline` without any commands or with --help for all options >_>
21:49 πŸ”— deathy thanks. seems to have started ok now
21:49 πŸ”— ersi wat, "You can't push to git://github.com/soult/tinyback.git"
21:49 πŸ”— ersi I guess.. unpaid accounts can't share then
21:49 πŸ”— ersi unless you fork and pull
21:49 πŸ”— soultcer Your github account is named "ersi", right?
21:49 πŸ”— deathy isn't that the "Git Read-Only" link?
21:50 πŸ”— ersi oh, lol
21:50 πŸ”— ersi correct
21:50 πŸ”— ersi There we go
21:51 πŸ”— GitHub67 [tinyback] ersi created add-snipurl-service (+1 new commit): https://github.com/soult/tinyback/commit/c09f2289eb13
21:51 πŸ”— GitHub67 tinyback/add-snipurl-service c09f228 Erik SimmesgΓƒΒ₯rd: Initial work on adding a scraper service for snipurl
21:51 πŸ”— ersi \o/
21:51 πŸ”— soultcer I explicitly disabled the github shorturl service for the IRC bot :D
21:52 πŸ”— ersi :D
21:55 πŸ”— soultcer I maybe should write that down somewhere, here is how I usually add new services:
21:55 πŸ”— soultcer a) Check service, find out the charset
21:55 πŸ”— soultcer b) See how it redirects, and how it shows that an URL doesn't exist
21:55 πŸ”— soultcer c) Bombard the shortener with random URLs to see how it handles too many requests (will it block the scraper like bit.ly does, will it throw a specific http error, etc)
21:56 πŸ”— ersi Makes sense. I've done very limited of a) and did mostly of b) but none of c) so far
21:57 πŸ”— soultcer d) Find out how it handles spam links. Most shorteners have a page like this: ow.ly/aU90
21:58 πŸ”— soultcer ersi: Doing a and b is already a lot of work, especially since none of those things have adequate documentation
21:58 πŸ”— ersi Yeah
21:58 πŸ”— soultcer ersi: Did you look at the test-defintions folder: I use it to write simple "tests" that verify that the shortener does not suddenly change it's API/HTTP responses
21:59 πŸ”— ersi Yeah, I added one for snipurls.com for a few random ones. I saw your more exhaustive ones for bitly and owly - good inspiration
22:00 πŸ”— soultcer Haha I'm an idiot, wondering why I didn't find that file, because I forgot to git checkout the snipurl branch :D
22:00 πŸ”— ersi :D
22:00 πŸ”— ersi yeah, branches that are pushed to remotes confuses the fuck out of me
22:06 πŸ”— soultcer Haha, snipurl guys are idiots. They allow you to specify a password for links. Once a password has been set for any long URL, no other short URL without a password can be created for the same URL
22:12 πŸ”— GitHub62 [tinyback] soult pushed 2 new commits to add-snipurl-service: https://github.com/soult/tinyback/compare/c09f2289eb13...ba5438d766cd
22:12 πŸ”— GitHub62 tinyback/add-snipurl-service 2371ced David Triendl: services.Snipurl: Throw CodeBlockedException on links that require a private key
22:12 πŸ”— GitHub62 tinyback/add-snipurl-service ba5438d David Triendl: services.Snipurl: ~ is valid character
22:13 πŸ”— soultcer ersi: I extended the code a little bit: It now handles shorturl that require "private key" as "not found"
22:15 πŸ”— deathy any easy way to see how many tasks I've sent to the tracker? Tracker page only shows top 10 ..
22:17 πŸ”— ersi soultcer: cool
22:17 πŸ”— ersi deathy: no
22:19 πŸ”— soultcer sqlite> SELECT name, count FROM statistics JOIN service ON service_id = id WHERE username = 'deathy';
22:19 πŸ”— soultcer bitly|39
22:19 πŸ”— soultcer isgd|20
22:19 πŸ”— soultcer owly|50
22:19 πŸ”— soultcer tinyurl|52
22:19 πŸ”— ersi ooh, do me :)
22:19 πŸ”— soultcer I think I should add an option to show all users
22:19 πŸ”— soultcer isgd|1216
22:19 πŸ”— soultcer bitly|1187
22:19 πŸ”— soultcer klam|235
22:19 πŸ”— soultcer owly|1508
22:19 πŸ”— soultcer tinyurl|1402
22:19 πŸ”— ersi Awesome.
22:20 πŸ”— deathy thanks. good to know it's really working ok and not just send 1-2 items :)
22:22 πŸ”— soultcer Okay, so the next step in scraping an URL shortener is adding it to the main branch and increasing the version number.
22:23 πŸ”— soultcer Clients always submit the version number when fetching a task, so that the tracker can tell outdated clients to update
22:24 πŸ”— ersi I initiated a pull request on GitHub between the branches
22:24 πŸ”— ersi If you feel it's good to go, shoot - else I'll keep working on it until it's better
22:25 πŸ”— ersi I'd like to take it for a test spin though - is it easy to do? I guess it's "just" to create some tasks with create_task.py and run tracker.py?
22:26 πŸ”— soultcer Yeah, simply create the database (cat scheme.sql | sqlite3 tasks.sqlite), run ./tracker.py (requires web.py) and use create_task.py (don't forget to change the tracker url there too)
22:28 πŸ”— * ersi nods
22:29 πŸ”— ersi hmm, should it background immediately?
22:30 πŸ”— soultcer Though you can also do it without the tracker: http://pastebin.com/GxuNiQaY
22:30 πŸ”— soultcer ersi: Ah right, change the last line in the tracker to this:
22:30 πŸ”— soultcer if __name__ == "__main__":
22:30 πŸ”— soultcer app.run()
22:30 πŸ”— ersi Gotcha ;-)
22:30 πŸ”— soultcer This will enable the built-in webserver. Otherwise it needs to be called from wsgi
22:31 πŸ”— ersi yeah
22:31 πŸ”— ersi I've messed around with Flask previously
22:33 πŸ”— ersi Hm, weird. sqlite3 didn't like the service table from schema.sql
22:33 πŸ”— soultcer I made some schema changes when I added fancy graphs for the tracker, maybe I messed something uo
22:33 πŸ”— soultcer *up
22:34 πŸ”— ersi I'll check it out
22:34 πŸ”— soultcer "finished_tasks_count INTEGER NOT NULL DEFAULT 0," should not have a comma at the end
22:34 πŸ”— ersi oh
22:34 πŸ”— ersi haha, yeah - just saw that
22:35 πŸ”— soultcer When you run the task, be sure to enable debug output, it gives some insight into how snapurl creates shorturls
22:40 πŸ”— soultcer ersi: Did it work?
22:41 πŸ”— ersi I'm trying to find out how to actually create tasks :)
22:42 πŸ”— soultcer The task_create file is a bit complicated, because I use it to split up a long range (say 00000-zzzzz) into small tasks that only have 600 codes each
22:43 πŸ”— soultcer So something like sequence_from_to(tracker, "snipurl", "abcde....z-_~", "2500000", "250zzzz", 600) should do
22:46 πŸ”— ersi Oh my
22:46 πŸ”— ersi sequence_generator sure didn't like that
22:47 πŸ”— soultcer Exception handling combined with nice error messages is for pussies
22:48 πŸ”— ersi It doesn't find substring, which I find, absurd.
22:49 πŸ”— ersi http://pejsta.nu/1147
22:50 πŸ”— ersi Ah, I successfully created a task now
22:50 πŸ”— soultcer You said "Using the charset "abc" create a sequence from "1" to "10", split into 10 tasks"
22:50 πŸ”— soultcer But there is no 1 or 0 in the charset, how should it know how to increment from a "1"?
22:50 πŸ”— ersi True that.
22:51 πŸ”— ersi Oh, so that's why your line didn't work either? '2', '5' and '0' wasn't in charset?
22:52 πŸ”— soultcer Oh, right, my bad
22:52 πŸ”— ersi I'm starting to get a hang of it now :)
22:52 πŸ”— soultcer 0123456789abcdefghijklmnopqrstuvwxyz-_~ <-- actual charset of snipurl
22:53 πŸ”— ersi Indeed
22:53 πŸ”— ersi I guess it doesn't help that it's 23:53 and I'm sleepy :D
22:56 πŸ”— soultcer Hehe
22:56 πŸ”— soultcer Now you know most of the tracker/tinyback stuff anyway
22:56 πŸ”— soultcer I'll keep a scraper running overnight to see what happens, and then I can merge your code tomorrow
22:59 πŸ”— * soultcer is off to bed
22:59 πŸ”— ersi Nighty :)

irclogger-viewer