#archiveteam 2011-12-30,Fri

↑back Search

Time Nickname Message
00:46 🔗 arrith til about rsync's -K
00:46 🔗 arrith i've used symlinks with cfv a bunch but not with rsync i guess
01:57 🔗 SketchCow I'll take it. yeah
02:26 🔗 undersco2 I have successfully made it to MI!
02:26 🔗 undersco2 (alive)
02:28 🔗 undersco2 Dec_29_2011_19:21:00 SketchCow gui77: ext2 or thereabouts
02:29 🔗 undersco2 ext2 has some really funky limits, and no journaling
02:29 🔗 undersco2 I'd recommend ext3/4 or JFS
02:29 🔗 undersco2 (jfs is about as solid as you can get as far as data integrity if you lose power/blow up/knock out the sata cable/etc)
02:31 🔗 undersco2 Oh
02:31 🔗 undersco2 chronomex already beat me at saying that
02:31 🔗 * undersco2 reading through logs
02:39 🔗 dnova I just applied for the stephen hawking thing
02:39 🔗 dnova job
02:39 🔗 dnova shit for pay but seems like an adventure
02:50 🔗 godane i got 98 episodes of systm backup
02:50 🔗 godane almost done
02:55 🔗 SketchCow There's a stephen hawking thing?
02:58 🔗 dnova yeah
02:58 🔗 dnova saw it on slashdot
02:58 🔗 dnova http://www.news.com.au/breaking-news/stephen-hawking-wants-you-to-find-him-a-voice/story-e6frfku0-1226233018306#ixzz1hxAssZOx
03:01 🔗 SketchCow Be careful, hawking's kind of a dick
03:01 🔗 SketchCow But I mean, he can't, you know, hit you
03:01 🔗 dnova I've heard rumors of such :(
03:02 🔗 dnova haha
03:16 🔗 undersco2 hahaha
03:41 🔗 bsmith093 godane: where will these eps be when you're done uploading them?
04:18 🔗 SketchCow Just took a shipment of 1.4gb of geocities.
04:18 🔗 SketchCow yay
04:19 🔗 PatC woot
04:20 🔗 NotGLaDOS Mmm geocities
04:22 🔗 dashcloud this is new geocities beyond what you currently had?
04:26 🔗 SketchCow http://www.archive.org/details/geocities-jcn-pack
04:26 🔗 SketchCow Yes
04:27 🔗 arrith is the geocities stuff in a standard enough format to diff the new stuff against it?
04:27 🔗 arrith and easily see what's old/new
04:27 🔗 dashcloud that's amazing
04:29 🔗 SketchCow No idea.
04:29 🔗 SketchCow But I have to move to the next thing.
04:29 🔗 SketchCow Although you're right, I should add a .txt.
04:35 🔗 SketchCow cat geocities.jcn-grab.tar.bz2 | bunzip2 -- | tar vtf - | awk '{print $3," ",$4," ",$5," ",$6}' > geocities.jcn-grab.txt
04:39 🔗 bsmith093 i have a script i found and tweaked to pull fanfiction.net stories and puts them in the folder named for the author, how do i sort those folders by first letter forexample books/author would go in books/A/author
04:39 🔗 bsmith093 there about 50k folders at this point so automated is preferred
04:46 🔗 arrith for i in {A..Z}; do echo mkdir "$i" && echo mv "$i"* "$i"/"; done
04:46 🔗 arrith that should work for uppercase, separate thing needed with {a..z} for lowercase
04:46 🔗 arrith few typos in there too heh
04:47 🔗 arrith shopt -s nocaseglob
04:48 🔗 arrith apparently can be used to turn off case sensitivity in globbing, so that should help
04:50 🔗 arrith http://stackoverflow.com/questions/156916/case-insensitive-glob-on-zsh-bash
04:50 🔗 SketchCow http://www.archive.org/details/geocities-jcn-pack&reCache=1 Done
04:50 🔗 arrith nice
04:50 🔗 arrith undersco2: jfs vs ext? i've looked into jfs but just for performance and ended up with ext3/xfs. dunno about data integrity.
05:04 🔗 chronomex jfs is a bit slower
05:04 🔗 chronomex but it runs like a tank
05:04 🔗 chronomex highly recommended
05:05 🔗 arrith chronomex: would you say using fossil / zfs-fuse be too extreme, say if one's goal is data integrity?
05:05 🔗 chronomex I used ZFS-fuse for a while
05:05 🔗 chronomex it turned many of my symlinks into 0000-permission empty files.
05:05 🔗 chronomex not recommended
05:10 🔗 arrith ouch
05:11 🔗 arrith yeah it's this weird good/bad tradeoff with that stuff, namely zfs and btrfs. like they're supposed to be super good in terms of hashing your data and detecting random bit flip errors but their various implementations have quite a bit of instability in other areas
05:16 🔗 chronomex I tried btrfs too
05:16 🔗 chronomex accidentally filled up my disk
05:16 🔗 chronomex couldn't delete anything because it had nowhere to copy-on-write the metadata
05:16 🔗 SketchCow The big thing now is how do I deal with these google groups.
05:16 🔗 SketchCow I think I have to move things around.
05:16 🔗 SketchCow I'm inclined to wait to January, let more people finish filtering on.
05:16 🔗 chronomex SketchCow: google groups? hm. I think we should go after yahoo groups maybe.
05:16 🔗 SketchCow We're done, dude.
05:16 🔗 SketchCow We have them all.
05:17 🔗 chronomex all the messages?
05:17 🔗 chronomex don't be silly.
05:17 🔗 SketchCow No, no, the files.
05:18 🔗 SketchCow root@teamarchive-0:/2/FTP/swebb/GOOGLEGROUPS/groups-k# ls
05:18 🔗 SketchCow k19tcnh2-pages.zip kauri-files.zip kiska-pages.zip kongoni-pages.zip
05:18 🔗 SketchCow k2chocolategenova-pages.zip kauri-pages.zip kismamak-es-vedonok-csoportja-pages.zip konwersacje-jezykowe-aegee-krakow-files.zip
05:18 🔗 SketchCow kadez-pages.zip kaybedecek-bir-yok-ki-files.zip kjsce_electronics_2005-files.zip ko-se-slaze-da-je-marija-petrovic-naj-naj-razredna-pages.zip
05:18 🔗 SketchCow kahrmann-blog-files.zip kaybedecek-bir-yok-ki-pages.zip kjsce_electronics_2005-pages.zip kroatologija2010-pages.zip
05:18 🔗 SketchCow kaliann-pages.zip kenya-chess-forum-pages.zip kkaiengineer-pages.zip kup4ino-files.zip
05:18 🔗 SketchCow kaliteyonetimi2011-pages.zip kenya-welcome-files.zip knjizevnost-pages.zip kurdistan-dolamari-pages.zip
05:18 🔗 SketchCow kamash--pages.zip kenya-welcome-pages.zip knowledgeharvester-pages.zip kurniandikojaya-files.zip
05:18 🔗 SketchCow kami-roke-pages.zip khuyen-mai-vnn-pages.zip koan2008-files.zip
05:19 🔗 SketchCow kanisiusinfo-files.zip kik555-pages.zip koan2008-pages.zip
05:19 🔗 SketchCow katotakkyu-pages.zip kinhte34tsnt1992-1997-pages.zip kolna3rab-pages.zip
05:19 🔗 SketchCow see
05:20 🔗 chronomex I see
05:21 🔗 chronomex it'd be really bitchen to suck all the messages out of yahoo groups
05:21 🔗 SketchCow rchive: koan2008-files.zip
05:21 🔗 SketchCow --------- ---------- ----- ----
05:21 🔗 SketchCow 7148029 2011-07-18 00:35 koan2008/03 It's Easy.mp3
05:21 🔗 SketchCow Length Date Time Name
05:21 🔗 SketchCow 2535424 2011-07-18 00:35 koan2008/15 The Wind.mp3
05:21 🔗 SketchCow 217168 2011-07-18 00:35 koan2008/Archive Welcome sunset copy.jpg
05:21 🔗 SketchCow 219648 2011-07-18 00:35 koan2008/Koan2008 Goes Google Group Newletter & FAQs.doc
05:21 🔗 SketchCow --------- ---------- ----- ----
05:21 🔗 SketchCow 6156 2011-07-18 00:35 koan2008/__welcome.txt
05:21 🔗 SketchCow 958 2011-07-18 00:35 koan2008/1-a-cup-of-tea.txt
05:21 🔗 SketchCow Archive: koan2008-pages.zip
05:21 🔗 SketchCow Length Date Time Name
05:21 🔗 SketchCow 12507 2011-07-18 00:35 koan2008/10-the-last-poem-of-hoshin-commentary-enough-is-enough.txt
05:21 🔗 SketchCow 10228 2011-07-18 00:35 koan2008/11-the-story-of-shunkai-extra-innocence.txt
05:21 🔗 SketchCow 2310 2011-07-18 00:35 koan2008/12-happy-chinaman.txt
05:21 🔗 SketchCow 2175 2011-07-18 00:35 koan2008/13-a-buddha.txt
05:21 🔗 SketchCow 6841 2011-07-18 00:35 koan2008/14-muddy-road-commentary-mind-body-i-the-fascinating-body.txt
05:23 🔗 SketchCow Anyway, I really am inclined to make directories of these.
05:23 🔗 SketchCow Like an item with, say, 200 of these groups.
05:24 🔗 SketchCow But it'll still be a few thousand details I'm worried about.
05:24 🔗 SketchCow SOMEONE MENTIONED THIS IN AN ARTICLE
05:28 🔗 godane i got 33gb of systm
05:29 🔗 godane large and hd version
05:29 🔗 godane large only when that was the high bitrate version there
05:30 🔗 godane hd after episode 36
05:45 🔗 SketchCow HEY EVERYONE
05:46 🔗 SketchCow http://www.prelinger.com/dmapa.html
05:47 🔗 dnova I am unqualified
05:55 🔗 arrith is google groups planned to be mirrored?
06:05 🔗 bsmith093 way to go, Jason, you're gonna apply right?
06:07 🔗 dnova I doubt that
06:48 🔗 Coderjoe filemaker?
06:49 🔗 arrith once was apple's userfacing db
06:49 🔗 Coderjoe i know what it is. i'm just surprised it still sees use
06:50 🔗 arrith ah
06:50 🔗 arrith i've seen it before but i always have to look it up
06:58 🔗 SketchCow I'm WAY too busy
06:58 🔗 SketchCow I can't live in SF
07:07 🔗 Coderjoe yeah... geography is my problem as well... plus a couple other requirements
07:42 🔗 godane uploading ctrl atl chicken
07:43 🔗 godane a very belef cooking show on rev3
07:44 🔗 Coderjoe http://www.youtube.com/watch?v=IlBmbt8IVv4
07:48 🔗 SketchCow Working on my audio soundscape for Barcelona art group
07:48 🔗 SketchCow Oh yeah, 2:49am before a Jan 1 deadline
07:48 🔗 SketchCow You know it
07:49 🔗 SketchCow Should be 40 minutes long.
07:49 🔗 SketchCow 4 minutes done so far.
08:14 🔗 godane i'm getting very slow upload speed to archive.org
08:15 🔗 godane its like 40kbytes
08:15 🔗 godane i should be getting 200kbytes
08:16 🔗 SketchCow http://www.archive.org/download/jamendo-002843/02.mp3
08:16 🔗 SketchCow They do runs at night
08:16 🔗 SketchCow I can't use that track, but that track is awesome.
08:27 🔗 godane SketchCow: so archive.org gets hammered at night?
08:28 🔗 godane thats why its slow
08:33 🔗 bsmith093 i assumed you moved to sf when you got the IA position?
08:34 🔗 bsmith093 or is that nmostly telepresence
08:34 🔗 chronomex nobody can make SketchCow stay in one place
08:34 🔗 chronomex nobody
08:38 🔗 ersi nobody puts baby in corner >_>
08:40 🔗 SketchCow Nobody tells smash to stop smash
08:41 🔗 BlueMax Bam bam bam bam
08:41 🔗 SketchCow Soundscape up to 7 minutes out of 40.
08:41 🔗 chronomex nobody makes google release a functioning copy of android
08:41 🔗 SketchCow Nobody makes rape truck stop raping or stop being a truck
08:42 🔗 chronomex what the fuck is rape truck
08:42 🔗 SketchCow 2,620,000 results in google image search for rape truck
08:42 🔗 SketchCow 452 for "rape truck"
08:42 🔗 chronomex these are all vans
08:43 🔗 arrith 28c3: How governments have tried to block Tor 1:25:40 https://www.youtube.com/watch?v=DX46Qv_b7F4
08:44 🔗 Nemo_bis are Google groups messages being downloaded?
08:45 🔗 SketchCow http://i46.photobucket.com/albums/f102/edou812/DONK.jpg
08:45 🔗 SketchCow archiveteam mobile is here
08:45 🔗 chronomex awwwwyeah
08:47 🔗 arrith so since questions about google groups messages are going unanswered i take it they're not being grabbed
08:48 🔗 chronomex that's a safe bet
08:48 🔗 SketchCow Not at the moment.
08:48 🔗 SketchCow The files and pages were
08:49 🔗 SketchCow There's no indication the messages are going away
08:50 🔗 arrith ah alright
09:02 🔗 yipdw| f
09:02 🔗 yipdw| oops
09:03 🔗 Nemo_bis where should one look for a complete archive of Usenet messages?
09:03 🔗 Nemo_bis I don't nderstand how complete my uni. news server is
09:15 🔗 arrith Nemo_bis: google groups is as complete of one as there can be i think
09:15 🔗 arrith Nemo_bis: there's been some stuff about people finding long lost usenet archives and sending them to google and they add them to google groups
09:15 🔗 Nemo_bis yep, but it's not easy to download
09:21 🔗 arrith yeah ;/
09:23 🔗 arrith maybe something using NNTP
09:41 🔗 DFJustin a little of the source data is on archive.org http://www.archive.org/details/utzoo-wiseman-usenet-archive
09:51 🔗 Nemo_bis yep, saw it, but it's very little
09:51 🔗 chronomex it's also very old
09:51 🔗 chronomex clearly worthless
09:51 🔗 Nemo_bis I tried command line but couldn't find any tool which easily allows to download all content of all groups
09:51 🔗 Nemo_bis so I used a GUI software but it was very slow
09:51 🔗 Nemo_bis downloaded some 600 MiB in a night IIRC
09:52 🔗 DFJustin back when google groups first started they mentioned using some kind of publicly available CD-ROMs for some of it too, those may be around someplace too
09:52 🔗 Nemo_bis hmm
09:52 🔗 DFJustin the stuff they got from dejanews is not available though
09:53 🔗 Nemo_bis http://tidbits.com/article/3229
09:53 🔗 Nemo_bis is this it?
09:54 🔗 DFJustin probably
10:02 🔗 Nemo_bis I suspect ftp.sterling.com is no longer active
10:22 🔗 arrith Nemo_bis: i know emacs has newsreader things and you can do anything in emacslisp
10:22 🔗 arrith there's gotta be some fairly easily scriptable nntp clients/libraries. python or perl should have a bunch, probably also ruby
10:22 🔗 Nemo_bis looks like my uni's newserver was started in 1996 only, no idea whether this makes it useful
10:22 🔗 arrith if not just some existing client
10:23 🔗 Nemo_bis too hard for me, I'm a newbie
10:23 🔗 arrith Nemo_bis: that's a place to start
10:23 🔗 arrith Nemo_bis: http://learnpythonthehardway.org
10:23 🔗 Nemo_bis aww
10:23 🔗 arrith get un-newbie'd :P
10:24 🔗 Nemo_bis sudo apt-get install unnewbieator
10:25 🔗 Nemo_bis also, what format should I look for? the software I tried saved in some sqdblite format
10:25 🔗 chronomex mbox is good
10:25 🔗 chronomex maildir is good
10:25 🔗 chronomex anything that can be turned into mbox or maildir is good
10:30 🔗 Nemo_bis nothing else?
10:30 🔗 Nemo_bis example: http://toolserver.org/~nemobis/tmp/italia.venezia.discussioni.sqlitedb
10:30 🔗 Nemo_bis can't remember the software name
10:32 🔗 arrith sqlite is at least super open. so you might have to write your own thing to turn it into an mbox but it's totally doable
10:34 🔗 Nemo_bis yes, it seems to store them in mbox format somehow, you can read emails in plain text
10:34 🔗 arrith wonder what's in the sqlite db then
10:36 🔗 DFJustin http://web.archive.org/web/20110514012530/http://groups.google.com/group/google.public.support.general/msg/d88f36fb3e2c0aac
10:39 🔗 DFJustin ironically you can't access this message on live google groups anymore, they've made the whole newsgroup non-public for some reason
10:39 🔗 arrith DFJustin: good link
10:39 🔗 arrith that's odd
10:39 🔗 arrith but yeah dejanews and various people in universities sending backups. there's some guy talking about his story of some usenet archives that made the news rounds nt too long ago
10:40 🔗 SketchCow First 20 minutes finished!
10:40 🔗 SketchCow \o/
10:40 🔗 arrith woo!
10:51 🔗 arrith wait
10:52 🔗 arrith jan 1st is at least 48 hrs away or so
10:52 🔗 SketchCow Yep, I'm ahead of crunch
10:52 🔗 SketchCow Poor crunch
10:52 🔗 arrith ah, good
10:53 🔗 arrith well, around irc i've seen people talking about going out drinking tonight. either people like to go out drinking on thursdays or they're forgetting days.
11:03 🔗 chronomex I go drinking wednesdays :)
11:14 🔗 SketchCow hahahahahah
11:14 🔗 SketchCow I sent in the beta to the people
11:14 🔗 SketchCow Someone just mailed me, basically saying "Please, do you have a phone number? There has been a great misunderstanding and we need to talk."
11:14 🔗 SketchCow Ha ha
11:15 🔗 SketchCow You know the best part? This project I made? Fucking awesome. Even if there's a huge misunderstanding and they want none of it, it's getting released.
11:19 🔗 godane in about 5 hours i will have ctrl alt chicken uploaded
11:52 🔗 BlueMax SketchCow: project?
17:56 🔗 chronomex it's great. you'll love it.
20:13 🔗 Coderjoe :-\
20:13 🔗 Coderjoe http://arstechnica.com/tech-policy/news/2011/12/godaddy-wins-and-loses-move-your-domain-day-over-sopa.ars
20:13 🔗 Coderjoe (note that the story is just using DNS server info...)
20:16 🔗 yipdw| Coderjoe: I'm going to guess most of GoDaddy's customers do not give a shit about SOPA
20:16 🔗 yipdw| and it's not because they're malicious
20:17 🔗 yipdw| they just don't give a shit about all these cyberlaw things
20:17 🔗 yipdw| I'm not sure how to make them care, either
20:17 🔗 Coderjoe there is speculation that the trasfer out count is low because they only looked at the one day and not the full counts since the can opened.
20:18 🔗 Coderjoe plus there is speculation that godaddy themselves transferred a number of sites they were holding in from elsewhere.
20:18 🔗 Coderjoe and then the whole slowdown of transfer auth code emails
20:18 🔗 Coderjoe too many people don't care about what the government is doing, online or off
20:19 🔗 yipdw| I think that even after you factor that in it does not impact GoDaddy's numbers at all
20:19 🔗 Coderjoe "that's boring. let's watch (sport)!"
20:19 🔗 yipdw| their reputation is already pretty shit, so
20:19 🔗 yipdw| we actually use GoDaddy as an SSL certificate authority
20:19 🔗 yipdw| at work
20:20 🔗 yipdw| we will not switch for the next few years, because SSL certificates are expensive as fuck
20:20 🔗 bsmith093 how much?
20:20 🔗 yipdw| I don't know if this will be enough impetus to not renew with GoDaddy, but I'm going to bring it up in the next SCRUM
20:20 🔗 yipdw| bsmith093: for a wildcard domain? a ton
20:20 🔗 yipdw| er, wildcard certificate
20:21 🔗 yipdw| $200/year
20:21 🔗 bsmith093 quick irc question how do i register my nick on this server?
20:23 🔗 yipdw| oh, actually, never mind
20:23 🔗 yipdw| I thought we had our GoDaddy SSL certificate for a few years; turns out it expires September 9, 2012
20:23 🔗 yipdw| that's not that much of an investment
20:25 🔗 yipdw| bsmith093: efnet doesn't run registration services
20:25 🔗 bsmith093 well thats annouing
20:25 🔗 Coderjoe at all. even channels.
20:26 🔗 Coderjoe though I am surprised by the occasional appearance of chanfix
20:26 🔗 Coderjoe if a channel goes opless, that bot pops in and ops someone (probably the person with the oldest join time)
20:27 🔗 yipdw| http://www.efnet.org/chanfix/
20:27 🔗 yipdw| that's the only IRC service I know of, anyway
20:28 🔗 bsmith093 anyway http://fanficdownloader-fork.googlecode.com/files/fanficdownloader-fork0.0.1.7z that link is to the thing im using for the fanfiction.net downloads, unpack cd into it and run ./automate x01 or some other x file those are the split link lists for the download scripts thats the rest of the folders, im doing x00 which is the first 200K stories
20:29 🔗 bsmith093 i didnt write it, but i tweaked the config enough that its faster to just distribute that
20:29 🔗 yipdw| have you verified that it actually grabs everything you want?
20:30 🔗 yipdw| multi-chapter stories, author profiles, etc
20:30 🔗 yipdw| also whether it records request/response metadata
20:30 🔗 bsmith093 the stroies anyway, the reviews can wait and i can figure out calibre rename epub by tags lateer
20:30 🔗 yipdw| admittedly on fanfiction.net some of the response headers are not very useful
20:30 🔗 bsmith093 multi chap, yes, profiles no, but this is good for now, also no warc, AFAIK
20:31 🔗 yipdw| I'll look into making one that uses wget-warc
20:31 🔗 bsmith093 great please do
20:43 🔗 gui77 SketchCow: hey any update on getting me an rsync slot to start uploading? i'm running out of space xD
21:10 🔗 godane Ctrl_Alt_Chicken done: www.archive.org/details/Ctrl_Alt_Chicken
21:20 🔗 gui77 hey guys when running the upload script for mobileme do i need to manually erase already uploaded files?
21:22 🔗 chronomex no, it should skip them on future uploads
21:22 🔗 chronomex they will stay around
21:22 🔗 gui77 right right, but i meant, after i upload the files, i don't need them anymore, right?
21:22 🔗 gui77 if so it'd be better to delete them once they're finished to make room for more
21:23 🔗 gui77 or should i keep even the ones i've already uploaded?
21:23 🔗 chronomex I prefer you didn't, but I will defer to someone more involved with memac
21:24 🔗 gui77 cool mate thanks :) it's my first time so don't really knowhow everything works yet
21:24 🔗 chronomex aye
21:49 🔗 godane i added time travel support to slitaz-tank project
21:50 🔗 godane this way source can be touch to update timestamp when the local clock is behind
21:52 🔗 arrith one alt topic: Friendliness - Understanding - Camaraderie - Kloseness
21:53 🔗 chronomex not feeling it, arrith
21:53 🔗 arrith anyone have experiencing downloading an entire subreddit?
21:53 🔗 chronomex keep trying :P
21:53 🔗 arrith chronomex: haha, tough room
21:54 🔗 SketchCow Toughest room there is
21:55 🔗 arrith ahh
21:55 🔗 arrith well dang, something i had fear'd:
21:55 🔗 arrith "
21:55 🔗 arrith "This script made me figure out that reddit lists only the last 1000 posts! Older posts are hidden. If you have a direct link to them, fine, otherwise they are gone :( So this script will only list 40 pages. This is a limitation of reddit.
21:55 🔗 chronomex fuckres
21:55 🔗 arrith i remember there was some limit, forgot where. now how to get around it hmm
21:57 🔗 arrith "There's no limit on the number of saved links, just a limit on listing sizes. For some reason (probably performance) reddit will never return a list of more than 1000 things (whether they are submissions, posts, messages, or saved links). They are still there, but you can't see them, unless you remove (unsave, in the case of saved links) the most recent ones. You could also try sorting in another way, of course, but that will st
21:57 🔗 arrith ill return 1000 things at most.
21:57 🔗 arrith "
21:57 🔗 arrith http://www.reddit.com/r/help/comments/jhxmr/why_are_we_limited_in_the_number_of_links_we_can/
21:58 🔗 gui77 could maybe links be scraped off search engines?
21:59 🔗 gui77 and perhaps url checked just by bruteforce to fidn real (existing) combinations? in that case the /jhxmr/
21:59 🔗 gui77 although that would take ages
21:59 🔗 arrith yeah a guy suggested doing a search engine thing in /r/help, but google supposedly misses a bunch
22:00 🔗 arrith actually, bruteforcing the key is an interesting idea
22:00 🔗 arrith since i do have time on my side (i think)
22:01 🔗 arrith actually dangit, that's a huge number. 5^36
22:02 🔗 arrith ~14.55 * 10^12 TB if each was only 1 byte
22:03 🔗 arrith a reddit mirror is starting to be more and more something i want to look into
22:06 🔗 gui77 arrith: how did you get 5^36?
22:06 🔗 gui77 http://rorr.im/about
22:06 🔗 gui77 these guys do it by scraping the content as soon as it's up and just adding it to a rolling database
22:07 🔗 arrith gui77: seem to be 5 characters in the id, and the id seems to be made up of the lowercase alphabet with either 0-9 or 1-9 digits. 26 letters in the alphabet plus 10 digits (0-9) gives 36.
22:07 🔗 arrith ah saw those in my search results but i assumed they wouldn't have a db up to dl.. i'll look at their site now
22:07 🔗 gui77 arrith of course how did i not see that haha
22:07 🔗 gui77 they open sourced some pytohn modules to mirror content
22:07 🔗 gui77 maybe some can be helpful
22:07 🔗 arrith that's an idea
22:08 🔗 gui77 you could also think about asking reddit directly, they might very well be sympathetic to the cause and make it easier for us
22:08 🔗 arrith yeah, maybe for AT. me as just some guy they don't seem to be into
22:09 🔗 arrith "Removing the 1000-pages crawl limi"t http://groups.google.com/group/reddit-dev/browse_thread/thread/cbf05aa83dd03de5
22:09 🔗 gui77 AT?
22:09 🔗 arrith gui77: archive team
22:10 🔗 arrith woo rorrim is in python
22:10 🔗 gui77 ah ok
22:11 🔗 arrith well i suppose crawling rorr.im to fill in gaps is one option
22:13 🔗 gui77 getting teh old stuff is teh issue, right?
22:13 🔗 gui77 *the
22:13 🔗 arrith yep
22:13 🔗 arrith periodic new crawling can get all the new stuff
22:16 🔗 bsmith093 just like for ffnet archive
22:17 🔗 bsmith093 there will be gaps and they probably will never be filled but the near end will just keep filling uo and crawling that will wokr beautifully
22:17 🔗 gui77 yeah but there's a LOT of backlog on reddit
22:18 🔗 gui77 well practically most of reddit as it is now wouldn't be scrapeable, right?
22:22 🔗 gui77 what if you crawl existing reddit pages for more reddit links?
22:24 🔗 arrith gui77: that'd be a good bet
22:25 🔗 gui77 any decent (worth having) content will be linked to multiple times in teh future, for sure
22:25 🔗 gui77 *the
22:25 🔗 gui77 it'd take time but if you did that, the only stuff you wouldn't have would be stuff that is old, unpopular, and never linked to. doesn't seem like a major los
22:26 🔗 arrith goal is to get it alll
22:26 🔗 gui77 yeah, i know, but better than nothing i guess :/
22:27 🔗 arrith yeah
22:27 🔗 gui77 ooh i have an idea
22:27 🔗 gui77 reddit.com/random
22:27 🔗 gui77 that gives you a random submission
22:28 🔗 gui77 it'll redirect to something of the sort reddit.com/tb/xxxxx where xxxxx is the 5char code
22:28 🔗 gui77 could you maybe just hit those repetedly?
22:29 🔗 arrith haha
22:29 🔗 arrith yeah, that's probably a decent bit more feasible than enumerating the ids
22:30 🔗 gui77 yeah. although i'm not sure if that'll reutrn those of any subreddit or just those you're subscribed to
22:30 🔗 gui77 oh btw you don't happen to know how i can check that the upload script finished uploading a profile for mobileme?
22:32 🔗 arrith ah can't say that i do, there might be a status page somewhere
22:36 🔗 soultcer Hm, the whole reddit discussion reminds me. Reddit is kind of an url shortener: redd.it/nwcbb
22:38 🔗 arrith good point. similar tactics to archiving those are a good idea to look into
22:40 🔗 soultcer Haha I have a feeling they won't like it if they get a couple thousand requests per second from a single IP
22:40 🔗 bsmith093 sleep 5
22:40 🔗 bsmith093 ms
22:40 🔗 gui77 oh no of course they won't. but if it's distributed, and slow
22:41 🔗 gui77 the old stuff won't run away, we can take our time downloading
22:41 🔗 soultcer But what if users submit faster than we download?
22:41 🔗 arrith soultcer: cross that bridge when we come to it :P
22:41 🔗 gui77 although i think it'd be easier to get in touch with people over at reddit. helping us out somehow might be easier for them than having to deal with the added load
22:41 🔗 arrith soultcer: besides getting the older stuff, i see that as the most pressing issue
22:41 🔗 gui77 soultcer: we don't thousands of requests per second to scrape onyl new content, do we?
22:42 🔗 soultcer "Give us a database dump or we (D)DoS you!" <-- :D
22:42 🔗 soultcer From my perspective only the "reddit content id" to "target URL" mapping is of interest, but I guess you want stuff like subreddit, name of submitter, points, too?
22:43 🔗 arrith soultcer: yeah. i mean i'd be for taking what we can get
22:44 🔗 arrith maybe doing a faster crawl on the url level then going back over at the page level
22:44 🔗 godane did we backup berliOS?
22:44 🔗 soultcer I think so
22:44 🔗 soultcer http://archiveteam.org/index.php?title=BerliOS
22:45 🔗 arrith godane: they ended up announcing they weren't really going down but switching to new leadership. so it wasn't really necessary
22:45 🔗 arrith godane: although it hasn't been confirmed if there's anything that might be lost in the switch
22:45 🔗 arrith one thing i remembered is google reader keeps a kind of cache, so google reader might be something
22:45 🔗 soultcer Anyone else have a problem with wiki sites and Firefox 9 where the wiki navigation bar is suddenly on the bottom of the page?
22:48 🔗 gui77 man i really need to free some disk space :/
22:48 🔗 gui77 soultcer: wiki's fine for me on ff9
22:49 🔗 soultcer Weird
22:50 🔗 gui77 tried clearing cache and all?
22:50 🔗 soultcer Yup
22:51 🔗 soultcer Might just be an addon of mine
22:51 🔗 arrith i figure the internet archive probably has some bits of reddit and it looks fine except nsfw pages seem to not have been crawled due to a "you must be at least eighteen to view this reddit" page
22:53 🔗 gui77 anyone know how to gracefully stop the upload script?
22:53 🔗 arrith gui77: pretty sure you do ctrl-c and it finishes up the last thing it's working on, check the docs though
22:53 🔗 gui77 arrith: doesn't ctrl-c immediately stop it?
22:54 🔗 gui77 what docs? the wiki?
22:55 🔗 gui77 the readme is very short and vague
22:55 🔗 dashcloud I read the reddit post there- the reddit person at the end says they'd much prefer to work out a way to get a data dump than have require use of a scraper
22:56 🔗 arrith dashcloud: yeah i suppose contacting them for a data dump would be a good first step
22:57 🔗 arrith how does the Archive Team do that kind of thing btw? just anyone or are there people that've done it before and are good at it?
22:59 🔗 soultcer dashcloud: Which post? Can you give me a link?
23:00 🔗 dashcloud http://groups.google.com/group/reddit-dev/browse_thread/thread/cbf05aa83dd03de5
23:00 🔗 dashcloud some research guy asks about the 1000 post limit
23:01 🔗 soultcer Thanks
23:01 🔗 dashcloud thank arrith- that's how I saw the link
23:28 🔗 gui77 arrith: you were rigt about ctrl-c. it immediately stops it but since it's rsync it'll just left off where it stopped!
23:28 🔗 gui77 *right
23:52 🔗 arrith gui77: ah good
23:53 🔗 arrith yeah that guy is just some guy too, as in not part of an org or something that reddit might want to help out
23:53 🔗 arrith ohh
23:53 🔗 arrith "tl;dr If you need old data, we'd much rather work out a way to get you a data dump than to have you scrape. "
23:55 🔗 arrith one thing that doesn't cover is stuff that has been deleted from reddit

irclogger-viewer