Time |
Nickname |
Message |
00:09
π
|
chronomex |
--amend ;) |
00:10
π
|
soultcer |
I think the warrior does a git pull, so it will try to awkwardly merge commits when I overwrite them. |
00:10
π
|
soultcer |
And it is usually a bad idea anyways to change already published git history |
00:26
π
|
chronomex |
oh, yeah, okay |
08:06
π
|
ersi |
soultcer: Nice that you added - Leaderboard. :) |
19:50
π
|
ersi |
soultcer: So, if I'm interested in writing a scraper - do you have any tips? |
19:53
π
|
ersi |
Hmm, I guess it's hard to find a shortener which has predictable URLs |
20:03
π
|
ersi |
Snipurl.com supposedly have 6,715,253,742 URLs |
20:17
π
|
ersi |
Alright, seems I got the basic structure I guess |
20:19
π
|
deathy |
when is the next urlteam .torrent coming out? |
20:20
π
|
ersi |
According to the site, in the end of 2012 |
20:22
π
|
ersi |
soultcer: What's running @ tracker.tinyarchive.org/v1/? Is that checked into tinyback? I guess it isn't :o |
21:30
π
|
chronomex |
iirc we haven't done a lot of shortener scraping this year |
21:30
π
|
soultcer |
swebb was crazy busy, he unshortened every link from the twitter spritzer feed (2% of all twitter posts) |
21:31
π
|
ersi |
I'm just interested in A) How does the tracker work? B) What does it take to 'support a Scraper'? really |
21:31
π
|
ersi |
Mmmh |
21:31
π
|
soultcer |
ersi: The tracker code lives in a separate repository, gimme a sec I'll put it on github |
21:31
π
|
ersi |
sure thang |
21:31
π
|
swebb |
Yea, well one of my computers was crazy busy. :) |
21:31
π
|
deathy |
any info on how to setup script(not warrior)? after clone is it run.py or something else? and what about parameters like tracker and others ? |
21:32
π
|
ersi |
I've written the very basic skeleton for http://snipurl.com which looks similar to ur1.ca |
21:33
π
|
swebb |
195M unwound urls in my twitter DB as of this morning. |
21:33
π
|
ersi |
woah :) |
21:33
π
|
swebb |
16.4GB of mysql storage space. |
21:34
π
|
swebb |
uncompressed of course. |
21:34
π
|
ersi |
I've only filled up that much database storage by generating fake test data |
21:34
π
|
soultcer |
-rw-r----- 1 david david 95G Dec 6 21:34 tinyurl.db |
21:34
π
|
soultcer |
-rw-r----- 1 david david 265G Dec 6 21:34 bitly.db |
21:34
π
|
soultcer |
david@kat:/media/tinyarchive/database/data$ ls -lh | grep -E '(bitly|tinyurl)' |
21:34
π
|
swebb |
That's a lot! |
21:35
π
|
soultcer |
BerkeleyDB with B-Tree indexes |
21:36
π
|
swebb |
Yea, I used to use those too. Much nicer than flat-file. |
21:36
π
|
soultcer |
ersi: https://github.com/ArchiveTeam/tinyarchive |
21:36
π
|
chronomex |
hm, not a bad idea |
21:36
π
|
ersi |
Is there anything that I can do/take over re urlteam work? |
21:36
π
|
chronomex |
much better than the flat files we were using, yes? |
21:37
π
|
soultcer |
chronomex: About 10 times the size than the compressed .txt.xz files |
21:37
π
|
soultcer |
For working with them: Great. For distributing: Bad. |
21:37
π
|
chronomex |
well .xz crunches the text files down to about 2-5% size |
21:37
π
|
chronomex |
so doesn't sound too bad |
21:38
π
|
chronomex |
right, but you can run an extract from the db |
21:38
π
|
chronomex |
yes? |
21:38
π
|
soultcer |
I think it's 25% for the shortener URLs. The rest is index overhead of the btrees |
21:38
π
|
chronomex |
aye |
21:38
π
|
chronomex |
wait, what? |
21:38
π
|
chronomex |
I'm not parsing that correctly |
21:38
π
|
ersi |
Holy fuck @ 350GB+ |
21:38
π
|
soultcer |
.txt.xz files: 50 GB, .txt files: 200 GB, .db files: 500 GB |
21:39
π
|
swebb |
Yea, mine will only be a couple of GB compressed. |
21:39
π
|
chronomex |
ah, I see |
21:39
π
|
soultcer |
ersi: Help is always welcome. If you know a little bit of python, writing an additional scraper will be easy. Did you look at the services.py file in the tinyback repository? |
21:39
π
|
ersi |
soultcer: I've written one for snipurl.com |
21:40
π
|
soultcer |
Cool, can you git push it somewhere/post a diff? |
21:40
π
|
ersi |
sure |
21:41
π
|
soultcer |
(I'll try to move the tinyback repo to the archiveteam organization so that in the future everyone here can just push to it. It's just a bit complicated because the warrior directly pulls from the repo) |
21:42
π
|
ersi |
I guess you could add me to your repo and I could push up the branch |
21:43
π
|
deathy |
so...running the script help anyone? :) |
21:43
π
|
soultcer |
deathy: Which script? |
21:43
π
|
ersi |
deathy: Do you have seesaw-kit? |
21:43
π
|
ersi |
just `./run-pipeline tinyback/pipeline.py deathy` - simple as that |
21:43
π
|
deathy |
yep. on machine where I ran webshots script :) |
21:44
π
|
soultcer |
ersi: Okay you should now be able to push. I didn't even know github let you do that on unpaid accounts :D |
21:45
π
|
ersi |
Yeah, it's pretty nifty :) |
21:45
π
|
deathy |
ok... so no running two pipelines at same time with default address/port? (also have dailybooth running there right now) |
21:47
π
|
soultcer |
deathy: You can use --disable-web-server to disable the webserver or --port to change the port of the webserver (try ./pipeline --help for more options) |
21:47
π
|
ersi |
deathy: add ./run-pipeline --port <someone else than 8001> or --disable-web-server |
21:47
π
|
ersi |
deathy: `run-pipeline` without any commands or with --help for all options >_> |
21:49
π
|
deathy |
thanks. seems to have started ok now |
21:49
π
|
ersi |
wat, "You can't push to git://github.com/soult/tinyback.git" |
21:49
π
|
ersi |
I guess.. unpaid accounts can't share then |
21:49
π
|
ersi |
unless you fork and pull |
21:49
π
|
soultcer |
Your github account is named "ersi", right? |
21:49
π
|
deathy |
isn't that the "Git Read-Only" link? |
21:50
π
|
ersi |
oh, lol |
21:50
π
|
ersi |
correct |
21:50
π
|
ersi |
There we go |
21:51
π
|
GitHub67 |
[tinyback] ersi created add-snipurl-service (+1 new commit): https://github.com/soult/tinyback/commit/c09f2289eb13 |
21:51
π
|
GitHub67 |
tinyback/add-snipurl-service c09f228 Erik SimmesgΓΒ₯rd: Initial work on adding a scraper service for snipurl |
21:51
π
|
ersi |
\o/ |
21:51
π
|
soultcer |
I explicitly disabled the github shorturl service for the IRC bot :D |
21:52
π
|
ersi |
:D |
21:55
π
|
soultcer |
I maybe should write that down somewhere, here is how I usually add new services: |
21:55
π
|
soultcer |
a) Check service, find out the charset |
21:55
π
|
soultcer |
b) See how it redirects, and how it shows that an URL doesn't exist |
21:55
π
|
soultcer |
c) Bombard the shortener with random URLs to see how it handles too many requests (will it block the scraper like bit.ly does, will it throw a specific http error, etc) |
21:56
π
|
ersi |
Makes sense. I've done very limited of a) and did mostly of b) but none of c) so far |
21:57
π
|
soultcer |
d) Find out how it handles spam links. Most shorteners have a page like this: ow.ly/aU90 |
21:58
π
|
soultcer |
ersi: Doing a and b is already a lot of work, especially since none of those things have adequate documentation |
21:58
π
|
ersi |
Yeah |
21:58
π
|
soultcer |
ersi: Did you look at the test-defintions folder: I use it to write simple "tests" that verify that the shortener does not suddenly change it's API/HTTP responses |
21:59
π
|
ersi |
Yeah, I added one for snipurls.com for a few random ones. I saw your more exhaustive ones for bitly and owly - good inspiration |
22:00
π
|
soultcer |
Haha I'm an idiot, wondering why I didn't find that file, because I forgot to git checkout the snipurl branch :D |
22:00
π
|
ersi |
:D |
22:00
π
|
ersi |
yeah, branches that are pushed to remotes confuses the fuck out of me |
22:06
π
|
soultcer |
Haha, snipurl guys are idiots. They allow you to specify a password for links. Once a password has been set for any long URL, no other short URL without a password can be created for the same URL |
22:12
π
|
GitHub62 |
[tinyback] soult pushed 2 new commits to add-snipurl-service: https://github.com/soult/tinyback/compare/c09f2289eb13...ba5438d766cd |
22:12
π
|
GitHub62 |
tinyback/add-snipurl-service 2371ced David Triendl: services.Snipurl: Throw CodeBlockedException on links that require a private key |
22:12
π
|
GitHub62 |
tinyback/add-snipurl-service ba5438d David Triendl: services.Snipurl: ~ is valid character |
22:13
π
|
soultcer |
ersi: I extended the code a little bit: It now handles shorturl that require "private key" as "not found" |
22:15
π
|
deathy |
any easy way to see how many tasks I've sent to the tracker? Tracker page only shows top 10 .. |
22:17
π
|
ersi |
soultcer: cool |
22:17
π
|
ersi |
deathy: no |
22:19
π
|
soultcer |
sqlite> SELECT name, count FROM statistics JOIN service ON service_id = id WHERE username = 'deathy'; |
22:19
π
|
soultcer |
bitly|39 |
22:19
π
|
soultcer |
isgd|20 |
22:19
π
|
soultcer |
owly|50 |
22:19
π
|
soultcer |
tinyurl|52 |
22:19
π
|
ersi |
ooh, do me :) |
22:19
π
|
soultcer |
I think I should add an option to show all users |
22:19
π
|
soultcer |
isgd|1216 |
22:19
π
|
soultcer |
bitly|1187 |
22:19
π
|
soultcer |
klam|235 |
22:19
π
|
soultcer |
owly|1508 |
22:19
π
|
soultcer |
tinyurl|1402 |
22:19
π
|
ersi |
Awesome. |
22:20
π
|
deathy |
thanks. good to know it's really working ok and not just send 1-2 items :) |
22:22
π
|
soultcer |
Okay, so the next step in scraping an URL shortener is adding it to the main branch and increasing the version number. |
22:23
π
|
soultcer |
Clients always submit the version number when fetching a task, so that the tracker can tell outdated clients to update |
22:24
π
|
ersi |
I initiated a pull request on GitHub between the branches |
22:24
π
|
ersi |
If you feel it's good to go, shoot - else I'll keep working on it until it's better |
22:25
π
|
ersi |
I'd like to take it for a test spin though - is it easy to do? I guess it's "just" to create some tasks with create_task.py and run tracker.py? |
22:26
π
|
soultcer |
Yeah, simply create the database (cat scheme.sql | sqlite3 tasks.sqlite), run ./tracker.py (requires web.py) and use create_task.py (don't forget to change the tracker url there too) |
22:28
π
|
* |
ersi nods |
22:29
π
|
ersi |
hmm, should it background immediately? |
22:30
π
|
soultcer |
Though you can also do it without the tracker: http://pastebin.com/GxuNiQaY |
22:30
π
|
soultcer |
ersi: Ah right, change the last line in the tracker to this: |
22:30
π
|
soultcer |
if __name__ == "__main__": |
22:30
π
|
soultcer |
app.run() |
22:30
π
|
ersi |
Gotcha ;-) |
22:30
π
|
soultcer |
This will enable the built-in webserver. Otherwise it needs to be called from wsgi |
22:31
π
|
ersi |
yeah |
22:31
π
|
ersi |
I've messed around with Flask previously |
22:33
π
|
ersi |
Hm, weird. sqlite3 didn't like the service table from schema.sql |
22:33
π
|
soultcer |
I made some schema changes when I added fancy graphs for the tracker, maybe I messed something uo |
22:33
π
|
soultcer |
*up |
22:34
π
|
ersi |
I'll check it out |
22:34
π
|
soultcer |
"finished_tasks_count INTEGER NOT NULL DEFAULT 0," should not have a comma at the end |
22:34
π
|
ersi |
oh |
22:34
π
|
ersi |
haha, yeah - just saw that |
22:35
π
|
soultcer |
When you run the task, be sure to enable debug output, it gives some insight into how snapurl creates shorturls |
22:40
π
|
soultcer |
ersi: Did it work? |
22:41
π
|
ersi |
I'm trying to find out how to actually create tasks :) |
22:42
π
|
soultcer |
The task_create file is a bit complicated, because I use it to split up a long range (say 00000-zzzzz) into small tasks that only have 600 codes each |
22:43
π
|
soultcer |
So something like sequence_from_to(tracker, "snipurl", "abcde....z-_~", "2500000", "250zzzz", 600) should do |
22:46
π
|
ersi |
Oh my |
22:46
π
|
ersi |
sequence_generator sure didn't like that |
22:47
π
|
soultcer |
Exception handling combined with nice error messages is for pussies |
22:48
π
|
ersi |
It doesn't find substring, which I find, absurd. |
22:49
π
|
ersi |
http://pejsta.nu/1147 |
22:50
π
|
ersi |
Ah, I successfully created a task now |
22:50
π
|
soultcer |
You said "Using the charset "abc" create a sequence from "1" to "10", split into 10 tasks" |
22:50
π
|
soultcer |
But there is no 1 or 0 in the charset, how should it know how to increment from a "1"? |
22:50
π
|
ersi |
True that. |
22:51
π
|
ersi |
Oh, so that's why your line didn't work either? '2', '5' and '0' wasn't in charset? |
22:52
π
|
soultcer |
Oh, right, my bad |
22:52
π
|
ersi |
I'm starting to get a hang of it now :) |
22:52
π
|
soultcer |
0123456789abcdefghijklmnopqrstuvwxyz-_~ <-- actual charset of snipurl |
22:53
π
|
ersi |
Indeed |
22:53
π
|
ersi |
I guess it doesn't help that it's 23:53 and I'm sleepy :D |
22:56
π
|
soultcer |
Hehe |
22:56
π
|
soultcer |
Now you know most of the tracker/tinyback stuff anyway |
22:56
π
|
soultcer |
I'll keep a scraper running overnight to see what happens, and then I can merge your code tomorrow |
22:59
π
|
* |
soultcer is off to bed |
22:59
π
|
ersi |
Nighty :) |