#archiveteam-bs 2013-02-04,Mon

↑back Search

Time Nickname Message
01:32 🔗 swebb chronomex: Yea, I'm doing the auto-op stuff.
01:33 🔗 chronomex ok, my hostname has changed; I'm now coming from numbertron.com (and never from gir.seattlewireless.net)
08:02 🔗 BlueMax I hate how one of the best ZX Spectrum emulators on Windows isn't free
08:41 🔗 godane i broke the 2400+ mark in my g4video-web collection
08:56 🔗 SmileyG hmmm
08:56 🔗 SmileyG punchfork has a date, but xanga doesn't..
08:56 🔗 SmileyG is punchfork done too?
08:56 🔗 BlueMax ...why did they call their site "punchfork"?
08:56 🔗 BlueMax that just sounds painful.
08:57 🔗 SmileyG ;)
08:57 🔗 SmileyG Hmmm
08:57 🔗 * SmileyG adds another suggestion for the warrior
08:59 🔗 SmileyG https://github.com/ArchiveTeam/seesaw-kit/issues/19
13:14 🔗 swebb chronomex: ok, I've updated the auto-op to use your new source domain. Welcome back!
13:26 🔗 SmileyG hmmm
13:26 🔗 SmileyG SCO requesting permission to "loose" documents
13:26 🔗 SmileyG suggest IA offers free storage and digital conversion
13:29 🔗 ersi EFF has taken similar things up previously AFAIK
13:29 🔗 ersi I think that's why SketchCow got a few pallets of paper previously
13:33 🔗 SmileyG heh, I was joking, but it would be rather amusing
13:38 🔗 ersi I think you'd need several warehouses though
14:44 🔗 xk_id I'm here
14:45 🔗 ersi enjoy the cat and mouse game of Banhammer 3k
14:46 🔗 xk_id I just need a bunch of cheap machine on which to run the spider and a ssh tunnel, I think.
14:46 🔗 ersi throw in more hosts, go slower/faster in turns, switch patterns/user agents etc
14:46 🔗 xk_id what's the name for providers of such services?
14:46 🔗 ersi lowendbox.com
14:47 🔗 xk_id I just found out some guys called rackspace also provide 512 MB RAM machines with full root for 2p an hour
14:47 🔗 ersi yeah
14:48 🔗 ersi there's a bunch of providers.. linode is another one
14:48 🔗 ersi slicehost
14:48 🔗 ersi etc etc
14:48 🔗 ersi You can even get VM's at Google these days, that'd be interesting. I'd like to see them ban Google
14:48 🔗 xk_id hah
14:49 🔗 ersi https://cloud.google.com/products/compute-engine
14:49 🔗 xk_id wow
14:49 🔗 xk_id well, if I can get 5 machines in 5 different countries
14:50 🔗 xk_id and iterate each time I get banned...
14:50 🔗 xk_id it should work, right?
14:50 🔗 xk_id and thanks for the recommendations, there seems to be a plethora of choices available
14:50 🔗 ersi I enjoy killing services
14:50 🔗 ersi And seeing how you most likely hammer the shit out of that social network, it's my pleasure
14:50 🔗 ersi :D
14:50 🔗 xk_id Nooooo
14:51 🔗 xk_id I promise you I am not...
14:51 🔗 xk_id I ran *one* single threaded crawler
14:51 🔗 ersi Well, you're probably as hard to spot as a naked man on a Town Square
14:51 🔗 xk_id for most of the time
14:51 🔗 ersi I'd just request a bunch of IPs for each VM
14:51 🔗 xk_id I only started running a second one from my laptop for about an hour, before the first one got banned.
14:51 🔗 ersi until they're smart and block the whole provider
14:51 🔗 xk_id delays between requests were between 0.8 and 1s
14:52 🔗 ersi Well, it doesn't look like a large social network.. I'd not be suprised if they got super sucky infra
14:55 🔗 xk_id is it illegal what I'm doing?
14:57 🔗 ersi I dunno, maybe. Depends on country, county and how tech litterate your justice system is in general
14:57 🔗 ersi Then again, most people probably break a few laws every now and then. There's plenty of laws.
14:58 🔗 xk_id ?
14:58 🔗 xk_id is it a bad idea to explain to the rackspace customer support what I am doing
14:59 🔗 ersi Why would you, though?
14:59 🔗 xk_id I wanted to know if they have something useful for me.
14:59 🔗 xk_id I had to explain what my requirements were
14:59 🔗 ersi Well, you just want a VM and possibly a few IPs
15:00 🔗 xk_id I've already mentioned crawling, and now they ask if I can provide more details about usage
15:00 🔗 Schbirid mention it is a research project
15:00 🔗 xk_id k
15:04 🔗 xk_id is it any possibility I might get accused of of ddos'ing them?
15:04 🔗 xk_id or am I just becoming paranoid?
15:04 🔗 Schbirid anything is possible :(
15:04 🔗 xk_id don't joke..
15:05 🔗 ersi We're not joking
15:05 🔗 ersi then again, what the fuck do we care. We eat services for breakfast occationally
15:06 🔗 xk_id have you ever had problems?
15:07 🔗 ersi sure, of course services fight back
15:07 🔗 Schbirid what was that peotry site named again?
15:07 🔗 ersi Schbirid: Lulu.
15:08 🔗 ersi Or well, poetry.com
15:08 🔗 Schbirid i think jason mentioned their struggle in his talks
15:08 🔗 Schbirid http://ascii.textfiles.com/archives/3278 maybe
15:11 🔗 xk_id heh
15:11 🔗 xk_id funny title :)
15:11 🔗 xk_id well
15:12 🔗 xk_id not finishing my degree would probably be worse than being sued
15:12 🔗 xk_id that aside, I hope my supervisor is illiterate enough to realise what I'm actually up to.
15:12 🔗 xk_id (illiterate in respect to IT)
15:14 🔗 SmileyG hmmm
15:14 🔗 SmileyG what IS your degree btw?
15:15 🔗 SmileyG just randomly out of interest?
15:16 🔗 DFJustin http://www.nytimes.com/2013/02/04/world/africa/saving-timbuktus-priceless-artifacts-from-militants-clutches.html?_r=0
15:17 🔗 xk_id uh, two rackspace guys asked me online what I want to use the servers for. so now a third guy called me to ask the same thing
15:17 🔗 xk_id Oh, my degree is IT/management consultancy. Not very romantic.
15:17 🔗 ersi rackspace does that
15:17 🔗 ersi they phone all new costumers
15:17 🔗 xk_id But I managed to get away with an interesting dissertation topic
15:17 🔗 ersi xk_id: That's a.. degree?
15:18 🔗 xk_id well, it's not what it's called
15:18 🔗 xk_id but it's what it is is, essentially
15:18 🔗 xk_id if you're a really good student, you end up an IT/mngmt consultant.
15:18 🔗 xk_id but the degree is called Information Management for Business
15:18 🔗 ersi In other words, can't get a proper employment?
15:19 🔗 xk_id hmmm? I thought those guys are pretty sorted
15:19 🔗 ersi I don't got much over for people who instantly turn into consultants
15:19 🔗 ersi Got a pretty bad rep
15:19 🔗 xk_id well, as you can see, I kind of drift away from my degree
15:19 🔗 xk_id :P
15:19 🔗 xk_id my dissertation is on network science
16:14 🔗 SmileyG you have a interesting disseration on a crap sounding degree.
16:14 🔗 SmileyG I was the other way around,
16:14 🔗 xk_id what was your degree and dissertation?
16:14 🔗 SmileyG Comp Sci
16:14 🔗 SmileyG and errr
16:15 🔗 SmileyG multiplatform location aware social gaming and interaction tool
16:15 🔗 SmileyG a website which found gamers with simular interests who were located nearby.
16:15 🔗 SmileyG I didn't evne build the site.
16:15 🔗 SmileyG It was more fun looking at all the issues surrounding the idea.
16:16 🔗 SmileyG From: wtf is the point, anyone online can game anyone else so why be local?
16:27 🔗 xk_id heh
16:28 🔗 xk_id I doubt I'll get a good grade on my dissertation, despite the extreme effort I'm putting in it. Because I think it's the kind of looking at issues that you mentioned which is expected.. Not so much trying to crawl a website without getting banned..
16:28 🔗 xk_id I'm doing it wrong.
16:29 🔗 Schbirid the result does not count, the approach, working, thoughts etc do. at least in germany
16:50 🔗 xk_id same here. and I'm not following it at all.
16:50 🔗 xk_id I'm very retarded, I don't know what I'm thinking...
16:50 🔗 xk_id I got too enamourated with this topic.....
16:50 🔗 xk_id I have less than 2 months left
16:50 🔗 xk_id oh, god.
16:51 🔗 SmileyG I did that
16:51 🔗 SmileyG step back
16:51 🔗 SmileyG drink something
16:51 🔗 SmileyG the crazy thing is, while doing that
16:51 🔗 SmileyG I re-wrote my wifes disseration, fixing all her grammar and stuff.
16:51 🔗 SmileyG I'm dylexic, but it was a good way to not think about mine at the time D:
16:52 🔗 Schbirid procrastination is evil
16:56 🔗 xk_id btw, sorry, I know you won't be able to help, but I really need to tell this to someone to get it off my chest. I caught my crawler malfunctioning (i.e extracting incorrect data from the webpage; 2 pages in the friendlist, only extracted the first one). Now I'm running everything again and it works well. I have no, absolutely no idea what could be going on, how often it happens, and why
16:56 🔗 * xk_id rips the last hair on his scalp
16:56 🔗 SmileyG write about the bug!!!!!!!
16:57 🔗 Schbirid what he said!
16:57 🔗 xk_id can I do that?!
16:57 🔗 SmileyG lol thats kind of like the point.
16:57 🔗 SmileyG :/
16:57 🔗 SmileyG Or it was in CompSci
16:57 🔗 SmileyG It was never about the end project, its about the journey.
16:57 🔗 SmileyG Show how you adapt to problems
16:57 🔗 Schbirid yeah
16:57 🔗 SmileyG Show how you've used your learning of the last 3 years to get around issues.
17:00 🔗 xk_id I'm wondering whether I should restrict the scope of my dissertation to the crawler, tbh
17:04 🔗 tef xk_id: you know that thing social networks do? return a fail whale page
17:04 🔗 Schbirid speak to your supervisor(or how that is called) if possible
17:04 🔗 tef that
17:05 🔗 xk_id tef: sorry?
17:05 🔗 tef xk_id: every so often you get a crap page and you have to hit f5. your crawler needs to have the same data
17:05 🔗 tef behaviour even
17:05 🔗 xk_id Schbirid: I tried, I dunno, I think I'm going really wrong about the whole thing. I'm 110% about results.
17:06 🔗 tef you need to be stricter about your scraper, and ensure it fails fast if the page is not what it expects, and says 'page error' rather than 'page ok, no data'
17:07 🔗 xk_id I haven't implemented any tests for page errors.
17:07 🔗 xk_id gather.com didn't seem that dynamic, I thought it won't do that sort of crap
17:08 🔗 tef heh
17:08 🔗 tef web pages fail
17:08 🔗 tef they fail more when you hammer them
17:08 🔗 xk_id I shall not sob!
17:30 🔗 xk_id if only there was somewhere on the user's profile the total number of friends..
17:52 🔗 soultcer xk_id: What website are you scraping?
17:52 🔗 xk_id Gather.com
17:54 🔗 soultcer What's the crawler written in?
17:54 🔗 xk_id Node.js
19:21 🔗 xk_id \o/
19:21 🔗 xk_id crawler operational on cloudshards.com
20:46 🔗 * xk_id cries with joy
20:47 🔗 xk_id I found the bug
20:47 🔗 ersi \o/
20:49 🔗 xk_id programming is.. well.. it's something. the joys and the pains certainly balance each other....
20:49 🔗 xk_id :)
20:50 🔗 xk_id It's as much beautiful as it is horrible.
20:50 🔗 xk_id Those balances in life..
20:52 🔗 ersi Heh, yeah.
21:13 🔗 schbiridi xk_id: document it, you got another page written :)
21:14 🔗 xk_id :)
21:15 🔗 xk_id I really thought my dissertation should resemble more a journal article, rather than a sort of reflective/auto-biographical piece. but your suggestions are consistent with what I think my supervisor has been trying to explain to me for a while...
21:38 🔗 SmileyG explain what your going to do
21:39 🔗 SmileyG explain what you did
21:39 🔗 SmileyG explain everything else.
22:11 🔗 xk_id I'm successfully running my crawlers on two VPSs. that I'm paying $3/month
22:12 🔗 xk_id *that I'm paying $3/month for
22:12 🔗 xk_id I don't think it's too bad
22:12 🔗 xk_id and there seem to be lots of providers like that
22:15 🔗 chronomex I like the 21st century :)
22:15 🔗 xk_id that's exactly what I was thinking too :)
22:19 🔗 SmileyG we do live in that future I dreamt of as a child :o
22:19 🔗 godane so i have uploaded 2679 g4tv.com web videos
22:20 🔗 SmileyG :O
22:39 🔗 omf_ godane, are you just working your way through that site?
22:47 🔗 godane i'm working on the videos
22:48 🔗 godane i can't get everything by myself
22:48 🔗 omf_ do you have a list of what is left and what is done?
22:50 🔗 godane i got most of the forums so i think i can do the forums
22:50 🔗 godane i also have the feed
22:51 🔗 godane i want to grab this: http://www.g4tv.com/techtvvault/index.html
22:51 🔗 godane but there is no easy way i think
22:52 🔗 godane very old techtv articles
22:53 🔗 godane i getting triumph of the nerds
22:54 🔗 godane i'm also getting secret life of machines
22:55 🔗 godane that is only found on p2p i think
22:55 🔗 godane the author even says to get it from p2p
22:55 🔗 godane http://en.wikipedia.org/wiki/Secret_Life_of_Machines
22:57 🔗 godane it looks like it was released on video tape and dvd
22:58 🔗 chronomex secret life of machines is cool
22:58 🔗 godane its old enough that it shouldn't get darked
23:11 🔗 xk_id do you guys archive vimeo/youtube?
23:11 🔗 xk_id or, does anybody, for the matter?
23:11 🔗 xk_id better I should google..
23:11 🔗 balrog "72 hours of video are uploaded to YouTube every minute"
23:11 🔗 balrog good luck.
23:11 🔗 xk_id I had no idea that was the scale
23:16 🔗 godane i think IA trys to back up the ones with bigger videos
23:16 🔗 chronomex IA sucks in the videos that are mentioned on the twitter feed they archive
23:16 🔗 godane thats my thought
23:17 🔗 godane i also think i would be best to just also back up the videos said users have uploaded
23:18 🔗 chronomex sure
23:18 🔗 godane its a way to do sort of a tree grab of youtube
23:19 🔗 godane then you can start looking at playlists and grab those videos and all thoses users videos
23:19 🔗 godane and etc
23:57 🔗 dashcloud xk_id: can you access any of the things you want over IPv6? There's a vastly larger number of IPv6 addresses available for you if you could
23:58 🔗 dashcloud also, if you're really stuck, there's a Plan Z: bulletproof hosting (which is something to consider only if everything else fails)
