#urlteam 2012-12-06,Thu

↑back Search

Time	Nickname	Message
00:09 ^🔗	chronomex	--amend ;)
00:10 ^🔗	soultcer	I think the warrior does a git pull, so it will try to awkwardly merge commits when I overwrite them.
00:10 ^🔗	soultcer	And it is usually a bad idea anyways to change already published git history
00:26 ^🔗	chronomex	oh, yeah, okay
08:06 ^🔗	ersi	soultcer: Nice that you added - Leaderboard. :)
19:50 ^🔗	ersi	soultcer: So, if I'm interested in writing a scraper - do you have any tips?
19:53 ^🔗	ersi	Hmm, I guess it's hard to find a shortener which has predictable URLs
20:03 ^🔗	ersi	Snipurl.com supposedly have 6,715,253,742 URLs
20:17 ^🔗	ersi	Alright, seems I got the basic structure I guess
20:19 ^🔗	deathy	when is the next urlteam .torrent coming out?
20:20 ^🔗	ersi	According to the site, in the end of 2012
20:22 ^🔗	ersi	soultcer: What's running @ tracker.tinyarchive.org/v1/? Is that checked into tinyback? I guess it isn't :o
21:30 ^🔗	chronomex	iirc we haven't done a lot of shortener scraping this year
21:30 ^🔗	soultcer	swebb was crazy busy, he unshortened every link from the twitter spritzer feed (2% of all twitter posts)
21:31 ^🔗	ersi	I'm just interested in A) How does the tracker work? B) What does it take to 'support a Scraper'? really
21:31 ^🔗	ersi	Mmmh
21:31 ^🔗	soultcer	ersi: The tracker code lives in a separate repository, gimme a sec I'll put it on github
21:31 ^🔗	ersi	sure thang
21:31 ^🔗	swebb	Yea, well one of my computers was crazy busy. :)
21:31 ^🔗	deathy	any info on how to setup script(not warrior)? after clone is it run.py or something else? and what about parameters like tracker and others ?
21:32 ^🔗	ersi	I've written the very basic skeleton for http://snipurl.com which looks similar to ur1.ca
21:33 ^🔗	swebb	195M unwound urls in my twitter DB as of this morning.
21:33 ^🔗	ersi	woah :)
21:33 ^🔗	swebb	16.4GB of mysql storage space.
21:34 ^🔗	swebb	uncompressed of course.
21:34 ^🔗	ersi	I've only filled up that much database storage by generating fake test data
21:34 ^🔗	soultcer	-rw-r----- 1 david david 95G Dec 6 21:34 tinyurl.db
21:34 ^🔗	soultcer	-rw-r----- 1 david david 265G Dec 6 21:34 bitly.db
21:34 ^🔗	soultcer	david@kat:/media/tinyarchive/database/data$ ls -lh \| grep -E '(bitly\|tinyurl)'
21:34 ^🔗	swebb	That's a lot!
21:35 ^🔗	soultcer	BerkeleyDB with B-Tree indexes
21:36 ^🔗	swebb	Yea, I used to use those too. Much nicer than flat-file.
21:36 ^🔗	soultcer	ersi: https://github.com/ArchiveTeam/tinyarchive
21:36 ^🔗	chronomex	hm, not a bad idea
21:36 ^🔗	ersi	Is there anything that I can do/take over re urlteam work?
21:36 ^🔗	chronomex	much better than the flat files we were using, yes?
21:37 ^🔗	soultcer	chronomex: About 10 times the size than the compressed .txt.xz files
21:37 ^🔗	soultcer	For working with them: Great. For distributing: Bad.
21:37 ^🔗	chronomex	well .xz crunches the text files down to about 2-5% size
21:37 ^🔗	chronomex	so doesn't sound too bad
21:38 ^🔗	chronomex	right, but you can run an extract from the db
21:38 ^🔗	chronomex	yes?
21:38 ^🔗	soultcer	I think it's 25% for the shortener URLs. The rest is index overhead of the btrees
21:38 ^🔗	chronomex	aye
21:38 ^🔗	chronomex	wait, what?
21:38 ^🔗	chronomex	I'm not parsing that correctly
21:38 ^🔗	ersi	Holy fuck @ 350GB+
21:38 ^🔗	soultcer	.txt.xz files: 50 GB, .txt files: 200 GB, .db files: 500 GB
21:39 ^🔗	swebb	Yea, mine will only be a couple of GB compressed.
21:39 ^🔗	chronomex	ah, I see
21:39 ^🔗	soultcer	ersi: Help is always welcome. If you know a little bit of python, writing an additional scraper will be easy. Did you look at the services.py file in the tinyback repository?
21:39 ^🔗	ersi	soultcer: I've written one for snipurl.com
21:40 ^🔗	soultcer	Cool, can you git push it somewhere/post a diff?
21:40 ^🔗	ersi	sure
21:41 ^🔗	soultcer	(I'll try to move the tinyback repo to the archiveteam organization so that in the future everyone here can just push to it. It's just a bit complicated because the warrior directly pulls from the repo)
21:42 ^🔗	ersi	I guess you could add me to your repo and I could push up the branch
21:43 ^🔗	deathy	so...running the script help anyone? :)
21:43 ^🔗	soultcer	deathy: Which script?
21:43 ^🔗	ersi	deathy: Do you have seesaw-kit?
21:43 ^🔗	ersi	just `./run-pipeline tinyback/pipeline.py deathy` - simple as that
21:43 ^🔗	deathy	yep. on machine where I ran webshots script :)
21:44 ^🔗	soultcer	ersi: Okay you should now be able to push. I didn't even know github let you do that on unpaid accounts :D
21:45 ^🔗	ersi	Yeah, it's pretty nifty :)
21:45 ^🔗	deathy	ok... so no running two pipelines at same time with default address/port? (also have dailybooth running there right now)
21:47 ^🔗	soultcer	deathy: You can use --disable-web-server to disable the webserver or --port to change the port of the webserver (try ./pipeline --help for more options)
21:47 ^🔗	ersi	deathy: add ./run-pipeline --port <someone else than 8001> or --disable-web-server
21:47 ^🔗	ersi	deathy: `run-pipeline` without any commands or with --help for all options >_>
21:49 ^🔗	deathy	thanks. seems to have started ok now
21:49 ^🔗	ersi	wat, "You can't push to git://github.com/soult/tinyback.git"
21:49 ^🔗	ersi	I guess.. unpaid accounts can't share then
21:49 ^🔗	ersi	unless you fork and pull
21:49 ^🔗	soultcer	Your github account is named "ersi", right?
21:49 ^🔗	deathy	isn't that the "Git Read-Only" link?
21:50 ^🔗	ersi	oh, lol
21:50 ^🔗	ersi	correct
21:50 ^🔗	ersi	There we go
21:51 ^🔗	GitHub67	[tinyback] ersi created add-snipurl-service (+1 new commit): https://github.com/soult/tinyback/commit/c09f2289eb13
21:51 ^🔗	GitHub67	tinyback/add-snipurl-service c09f228 Erik SimmesgÃ¥rd: Initial work on adding a scraper service for snipurl
21:51 ^🔗	ersi	\o/
21:51 ^🔗	soultcer	I explicitly disabled the github shorturl service for the IRC bot :D
21:52 ^🔗	ersi	:D
21:55 ^🔗	soultcer	I maybe should write that down somewhere, here is how I usually add new services:
21:55 ^🔗	soultcer	a) Check service, find out the charset
21:55 ^🔗	soultcer	b) See how it redirects, and how it shows that an URL doesn't exist
21:55 ^🔗	soultcer	c) Bombard the shortener with random URLs to see how it handles too many requests (will it block the scraper like bit.ly does, will it throw a specific http error, etc)
21:56 ^🔗	ersi	Makes sense. I've done very limited of a) and did mostly of b) but none of c) so far
21:57 ^🔗	soultcer	d) Find out how it handles spam links. Most shorteners have a page like this: ow.ly/aU90
21:58 ^🔗	soultcer	ersi: Doing a and b is already a lot of work, especially since none of those things have adequate documentation
21:58 ^🔗	ersi	Yeah
21:58 ^🔗	soultcer	ersi: Did you look at the test-defintions folder: I use it to write simple "tests" that verify that the shortener does not suddenly change it's API/HTTP responses
21:59 ^🔗	ersi	Yeah, I added one for snipurls.com for a few random ones. I saw your more exhaustive ones for bitly and owly - good inspiration
22:00 ^🔗	soultcer	Haha I'm an idiot, wondering why I didn't find that file, because I forgot to git checkout the snipurl branch :D
22:00 ^🔗	ersi	:D
22:00 ^🔗	ersi	yeah, branches that are pushed to remotes confuses the fuck out of me
22:06 ^🔗	soultcer	Haha, snipurl guys are idiots. They allow you to specify a password for links. Once a password has been set for any long URL, no other short URL without a password can be created for the same URL
22:12 ^🔗	GitHub62	[tinyback] soult pushed 2 new commits to add-snipurl-service: https://github.com/soult/tinyback/compare/c09f2289eb13...ba5438d766cd
22:12 ^🔗	GitHub62	tinyback/add-snipurl-service 2371ced David Triendl: services.Snipurl: Throw CodeBlockedException on links that require a private key
22:12 ^🔗	GitHub62	tinyback/add-snipurl-service ba5438d David Triendl: services.Snipurl: ~ is valid character
22:13 ^🔗	soultcer	ersi: I extended the code a little bit: It now handles shorturl that require "private key" as "not found"
22:15 ^🔗	deathy	any easy way to see how many tasks I've sent to the tracker? Tracker page only shows top 10 ..
22:17 ^🔗	ersi	soultcer: cool
22:17 ^🔗	ersi	deathy: no
22:19 ^🔗	soultcer	sqlite> SELECT name, count FROM statistics JOIN service ON service_id = id WHERE username = 'deathy';
22:19 ^🔗	soultcer	bitly\|39
22:19 ^🔗	soultcer	isgd\|20
22:19 ^🔗	soultcer	owly\|50
22:19 ^🔗	soultcer	tinyurl\|52
22:19 ^🔗	ersi	ooh, do me :)
22:19 ^🔗	soultcer	I think I should add an option to show all users
22:19 ^🔗	soultcer	isgd\|1216
22:19 ^🔗	soultcer	bitly\|1187
22:19 ^🔗	soultcer	klam\|235
22:19 ^🔗	soultcer	owly\|1508
22:19 ^🔗	soultcer	tinyurl\|1402
22:19 ^🔗	ersi	Awesome.
22:20 ^🔗	deathy	thanks. good to know it's really working ok and not just send 1-2 items :)
22:22 ^🔗	soultcer	Okay, so the next step in scraping an URL shortener is adding it to the main branch and increasing the version number.
22:23 ^🔗	soultcer	Clients always submit the version number when fetching a task, so that the tracker can tell outdated clients to update
22:24 ^🔗	ersi	I initiated a pull request on GitHub between the branches
22:24 ^🔗	ersi	If you feel it's good to go, shoot - else I'll keep working on it until it's better
22:25 ^🔗	ersi	I'd like to take it for a test spin though - is it easy to do? I guess it's "just" to create some tasks with create_task.py and run tracker.py?
22:26 ^🔗	soultcer	Yeah, simply create the database (cat scheme.sql \| sqlite3 tasks.sqlite), run ./tracker.py (requires web.py) and use create_task.py (don't forget to change the tracker url there too)
22:28 ^🔗	*	ersi nods
22:29 ^🔗	ersi	hmm, should it background immediately?
22:30 ^🔗	soultcer	Though you can also do it without the tracker: http://pastebin.com/GxuNiQaY
22:30 ^🔗	soultcer	ersi: Ah right, change the last line in the tracker to this:
22:30 ^🔗	soultcer	if __name__ == "__main__":
22:30 ^🔗	soultcer	app.run()
22:30 ^🔗	ersi	Gotcha ;-)
22:30 ^🔗	soultcer	This will enable the built-in webserver. Otherwise it needs to be called from wsgi
22:31 ^🔗	ersi	yeah
22:31 ^🔗	ersi	I've messed around with Flask previously
22:33 ^🔗	ersi	Hm, weird. sqlite3 didn't like the service table from schema.sql
22:33 ^🔗	soultcer	I made some schema changes when I added fancy graphs for the tracker, maybe I messed something uo
22:33 ^🔗	soultcer	*up
22:34 ^🔗	ersi	I'll check it out
22:34 ^🔗	soultcer	"finished_tasks_count INTEGER NOT NULL DEFAULT 0," should not have a comma at the end
22:34 ^🔗	ersi	oh
22:34 ^🔗	ersi	haha, yeah - just saw that
22:35 ^🔗	soultcer	When you run the task, be sure to enable debug output, it gives some insight into how snapurl creates shorturls
22:40 ^🔗	soultcer	ersi: Did it work?
22:41 ^🔗	ersi	I'm trying to find out how to actually create tasks :)
22:42 ^🔗	soultcer	The task_create file is a bit complicated, because I use it to split up a long range (say 00000-zzzzz) into small tasks that only have 600 codes each
22:43 ^🔗	soultcer	So something like sequence_from_to(tracker, "snipurl", "abcde....z-_~", "2500000", "250zzzz", 600) should do
22:46 ^🔗	ersi	Oh my
22:46 ^🔗	ersi	sequence_generator sure didn't like that
22:47 ^🔗	soultcer	Exception handling combined with nice error messages is for pussies
22:48 ^🔗	ersi	It doesn't find substring, which I find, absurd.
22:49 ^🔗	ersi	http://pejsta.nu/1147
22:50 ^🔗	ersi	Ah, I successfully created a task now
22:50 ^🔗	soultcer	You said "Using the charset "abc" create a sequence from "1" to "10", split into 10 tasks"
22:50 ^🔗	soultcer	But there is no 1 or 0 in the charset, how should it know how to increment from a "1"?
22:50 ^🔗	ersi	True that.
22:51 ^🔗	ersi	Oh, so that's why your line didn't work either? '2', '5' and '0' wasn't in charset?
22:52 ^🔗	soultcer	Oh, right, my bad
22:52 ^🔗	ersi	I'm starting to get a hang of it now :)
22:52 ^🔗	soultcer	0123456789abcdefghijklmnopqrstuvwxyz-_~ <-- actual charset of snipurl
22:53 ^🔗	ersi	Indeed
22:53 ^🔗	ersi	I guess it doesn't help that it's 23:53 and I'm sleepy :D
22:56 ^🔗	soultcer	Hehe
22:56 ^🔗	soultcer	Now you know most of the tracker/tinyback stuff anyway
22:56 ^🔗	soultcer	I'll keep a scraper running overnight to see what happens, and then I can merge your code tomorrow
22:59 ^🔗	*	soultcer is off to bed
22:59 ^🔗	ersi	Nighty :)

irclogger-viewer