#archiveteam 2012-10-04,Thu

↑back Search

Time Nickname Message
00:26 πŸ”— SketchCow Huh.
00:31 πŸ”— godane hey SketchCow
00:33 πŸ”— BlueMax I assume we know about Webshots already?
00:36 πŸ”— BlueMax http://www.webshots.com/ turning into another service, deleting all user photos that do not conform to said service
00:56 πŸ”— SketchCow Yeah. I'm thinking we might need to go after this. Maybe.
00:57 πŸ”— godane closes on December 1, 2012.
00:57 πŸ”— godane so that may give use some time to decide to grab it
00:59 πŸ”— godane it uses flash to display images
01:01 πŸ”— godane i'm grabing theregister.co.uk by year
01:02 πŸ”— godane there is also wget.log file
01:07 πŸ”— BlueMax how many photos does it have, 600 million?
01:14 πŸ”— SketchCow It says that.
01:19 πŸ”— joepie91 godane: where do you see it use flash?
01:20 πŸ”— godane http://www.webshots.com/pro/photo/3334729?path=/artist-kevin-mcneal
01:21 πŸ”— godane the picture is blocked by my flash blocker
01:21 πŸ”— joepie91 oh indeed
01:21 πŸ”— joepie91 that's weird
01:21 πŸ”— joepie91 ooooo
01:21 πŸ”— joepie91 that's for the pro part of the site
01:21 πŸ”— joepie91 the members part doesn't use flash it seems
01:22 πŸ”— godane ok
01:22 πŸ”— joepie91 also, the picture URLs are very easy to extract from the page source so that's not a problem
01:22 πŸ”— godane ok
01:23 πŸ”— joepie91 SketchCow: want me to have a run and collect as many usernames as possible?
01:23 πŸ”— joepie91 from webshots
01:24 πŸ”— joepie91 they seem to have a fairly parse-able index, but it seems limited to showing 10k users per category
02:15 πŸ”— joepie91 hm. webshots is pretty big.
02:20 πŸ”— Nintendud how i shot web
02:34 πŸ”— joepie91 hmhm: http://i.imgur.com/X7qwe.png
02:38 πŸ”— joepie91 well, looks like it started fetching users
02:38 πŸ”— joepie91 http://i.imgur.com/zlFdz.png
02:41 πŸ”— swagstaff Y'all on top of this? http://www.buzzfeed.com/katienotopoulos/your-internet-photos-are-already-starting-to-die
02:43 πŸ”— swagstaff "... However, buried deep within their http://help.getsmileapp.com/customer/portal/articles/708519-what-if-i-don%E2%80%99t-do-anything- is the bad news. The bad news is that if you donҀ™t log into your old Webshots account and confirm it as a new Smile account, all your photos will be deleted. ...."
02:44 πŸ”— joepie91 swagstaff: http://i.imgur.com/zlFdz.png :)
02:44 πŸ”— joepie91 I'm not sure if there are any plans to archive everything
02:44 πŸ”— joepie91 but I'm already generating a list of all users, just in case
02:47 πŸ”— swagstaff Glad to see the Team is Ever Alert. Good luck if you decide to archive.
02:49 πŸ”— joepie91 alright, time to sleep... with a bit of luck it's done gathering usernames by tomorrow :D
02:50 πŸ”— Nintendud joepie91: nice.
02:50 πŸ”— Nintendud my warrior has been bored recently.
02:52 πŸ”— joepie91 heh
02:52 πŸ”— joepie91 also, not that it's much use since it's not really distributed, but if anyone wants the source of said script, git clone http://git.cryto.net/repo/projects/joepie91/webshots
02:53 πŸ”— joepie91 very hacky and simple, but it works :P
02:57 πŸ”— arkhive 690 million photos apparently.
02:58 πŸ”— joepie91 yup
02:59 πŸ”— joepie91 time for sleep
02:59 πŸ”— joepie91 goodnight all
03:28 πŸ”— * DFJustin anticipates the cathedral of butthurt as photographers realize that flash doesn't keep you from saving the photos
05:18 πŸ”— SketchCow I was wrong.
05:18 πŸ”— SketchCow By the way.
05:18 πŸ”— SketchCow Wayback machine has indexed 186 billion webpages.
05:18 πŸ”— SketchCow And is expecting to do 240 billion.
05:18 πŸ”— SketchCow Billion.
05:27 πŸ”— DFJustin (Γ’Β˜Β‰ΓŽΒ΅Γ£Β€Β€Γ’ΒŠΒ™Γ―ΒΎΒ‰)Γ―ΒΎΒ‰
05:40 πŸ”— BlueMax I think I just crapped my pants at that number
06:04 πŸ”— Nemo_bis too bad Google no longer has that childish count of indexed pages
06:07 πŸ”— Nemo_bis hmm "In 2012, Google has indexed over 30 trillion web pages, 100 billion queries per month"
09:35 πŸ”— alard I've started a Webshots page on the wiki: http://archiveteam.org/index.php?title=Webshots
09:36 πŸ”— alard (There are some nice comments on this "photo of the day": http://travel.webshots.com/photo/2248078140105543869vCJpvs )
11:28 πŸ”— alard https://github.com/ArchiveTeam/webshots-grab/
11:29 πŸ”— SmileyG oooo code
11:30 πŸ”— * SmileyG looks and attempts to learn lua in 10 minutes before giving up
11:33 πŸ”— SmileyG yup, I have no clue wtf that does :<
11:36 πŸ”— joepie91 All photos will be removed by December 1. Until then, you may use message boards and you may search, browse and view images, but you wonÒ??t be able to upload or download images.
11:36 πŸ”— joepie91 nasty.
11:38 πŸ”— joepie91 "hi, we're going to close down the site and tell you in advance but you can't download anything anymore by the time you're aware of it"
11:38 πŸ”— joepie91 not that flash is a terribly good protection, but ok :P
11:38 πŸ”— SmileyG CHALLENGE ACCEPTED!
11:42 πŸ”— joepie91 lol
11:44 πŸ”— SmileyG hmmm
11:44 πŸ”— SmileyG is it a sign that I accidently named the script webshits.sh?
11:45 πŸ”— joepie91 hahaha
11:46 πŸ”— joepie91 mmm
11:46 πŸ”— joepie91 AT wiki frontpage really needs an update
12:35 πŸ”— joepie91 http://i.imgur.com/EMv89.png
12:36 πŸ”— joepie91 found about 950k users so far
12:36 πŸ”— joepie91 re: webshots
12:36 πŸ”— SmileyG joepie91: how? :o
12:37 πŸ”— joepie91 SmileyG: I started out by crawling all 'top members' for all categories
12:37 πŸ”— joepie91 script just parses all usernames out of the page, deduplicates
12:37 πŸ”— SmileyG hmmm
12:37 πŸ”— joepie91 and writes the whole list to a file
12:37 πŸ”— joepie91 it's been running for a night or so now
12:37 πŸ”— SmileyG nice, where are these scripts :S
12:37 πŸ”— joepie91 I estimate it's through about half of the users now
12:38 πŸ”— joepie91 (it just collects usernames, nothing else, though)
12:38 πŸ”— joepie91 git clone http://git.cryto.net/repo/projects/joepie91/webshots
12:38 πŸ”— SmileyG ahhh
12:38 πŸ”— joepie91 it's a stupidly simple script though :P
12:38 πŸ”— joepie91 (regex is fun!)
12:39 πŸ”— SmileyG :/
12:40 πŸ”— SmileyG I'm trying to understand this and i just urgh.
12:41 πŸ”— joepie91 SmileyG: what part are you having trouble understanding?
12:41 πŸ”— SmileyG actually your code kind of makes sense
12:41 πŸ”— SmileyG but I just don't know how I'd ever write it :<
12:41 πŸ”— joepie91 what it basically does is this:
12:42 πŸ”— joepie91 retrieve community index, find all "top members" links, retrieve all those links, and for each of them find all pagination links
12:42 πŸ”— joepie91 then for every page of every category ('top members' link), it finds all user-page URLs in the page
12:42 πŸ”— joepie91 and extracts the username from that
12:42 πŸ”— joepie91 that's pretty much it
12:43 πŸ”— joepie91 structure may seem a bit odd because I'm trying to prevent it from loading the first page of a category twice
12:43 πŸ”— joepie91 so the first page (which is the destination of the 'top members' link) is added separately
12:43 πŸ”— joepie91 to the 'queue'
12:43 πŸ”— joepie91 if I hadn't done that, it would've sufficed to just use a few nested foreach loops and I'd be done
12:43 πŸ”— joepie91 :P
12:56 πŸ”— SmileyG :/ archive.org down :?
13:10 πŸ”— joepie91 works for me
13:59 πŸ”— joepie91 whoop
13:59 πŸ”— joepie91 986098
13:59 πŸ”— joepie91 root@aarnist:~/webshots/webshots# cat users.txt | wc -l
14:00 πŸ”— joepie91 seems it's done :)
14:00 πŸ”— joepie91 a remark though:
14:00 πŸ”— joepie91 there were *very* few duplicates
14:00 πŸ”— joepie91 that makes me think that this is really only a portion of the webshots users
14:00 πŸ”— joepie91 every category's "top users" listing only shows 100 pages max
14:01 πŸ”— joepie91 SmileyG, SketchCow, any suggestions as to how to get more usernames?
14:04 πŸ”— joepie91 yeah, I was afraid of this: https://www.google.nl/search?sugexp=chrome,mod=8&sourceid=chrome&ie=UTF-8&q=site%3Acommunity.webshots.com+inurl%3Auser
14:04 πŸ”— joepie91 about 11.400.000 results
14:04 πŸ”— joepie91 that's about 11 times as much as I have now
14:07 πŸ”— jiphex Safari is quite clever. If you open a link like google.com/search?q=something%20something - it parses the query and puts the query into the google search address bar thing as if you'd typed it yourself
14:15 πŸ”— SmileyG hmmm
14:15 πŸ”— SmileyG not really
14:15 πŸ”— SmileyG POST.
14:15 πŸ”— SmileyG :D
14:17 πŸ”— jiphex erm, wrong channel :/
14:22 πŸ”— joepie91 SmileyG: http://aarnist.cryto.net:81/webshots/users.txt
14:22 πŸ”— joepie91 also, SmileyG, #webshots
14:22 πŸ”— alard (and everyone else is welcome too, of course)
14:23 πŸ”— joepie91 :P
17:34 πŸ”— alard SketchCow: Can you make a webshots rsync area on fos?
17:46 πŸ”— SketchCow YEs.
17:46 πŸ”— SketchCow How big do we think this is going to be? Probably big, huh.
17:48 πŸ”— alard Big, yes.
17:49 πŸ”— SketchCow I'm in that channel, let's discuss it there.
19:53 πŸ”— SketchCow Cool thing.
19:53 πŸ”— SketchCow Someone has donated a massive pile of recorded-off-vcr news programs from the 1980s.
19:53 πŸ”— SketchCow So yeah
19:55 πŸ”— SketchCow I'm packing up the first range of Cinch and putting it on the archive.
19:55 πŸ”— SketchCow 446gb of audio!
19:55 πŸ”— joepie91 Cinch?
19:56 πŸ”— chronomex three pounds of flax!
20:05 πŸ”— SketchCow Cinch.FM, one of the sillier shutdowns.
20:11 πŸ”— alard SketchCow: Keep in mind that this is not the main website, but the discussion boards: http://archive.org/details/archiveteam-city-of-heroes-main
20:12 πŸ”— alard I have a copy of the City of Heroes website (the documentation), but I haven't uploaded that yet.
20:41 πŸ”— joepie91 in other news: minus is basically dead, they only allow multimedia uploads now and want to move away from general purpose file storage
20:42 πŸ”— chronomex pirated movies ONLY
20:43 πŸ”— joepie91 lol
20:43 πŸ”— Aranje truth
20:43 πŸ”— chronomex worked for megaupload, right?
20:43 πŸ”— chronomex that guy made -piles- of money
20:44 πŸ”— Aranje It's sounding like he'll get to keep it too, pretty shortly
20:44 πŸ”— Aranje on a long enough (or short enough!) timescale, minus will be a raging success :D
20:45 πŸ”— godane i'm still trying to find old webuser magazines for my "collection"
20:45 πŸ”— godane some of the missing issue only link to dead file servers now
20:46 πŸ”— godane or there not there servers anymore
20:47 πŸ”— joepie91 hai Aranje
20:47 πŸ”— joepie91 long time no speak
20:47 πŸ”— joepie91 also #webshots
20:47 πŸ”— joepie91 :P
20:57 πŸ”— SketchCow So, to explain - I just call it the "Main" city of heroes grab because we want to do one last supplemental grab later.
20:57 πŸ”— SketchCow That's what I meant.
20:59 πŸ”— alard SketchCow: Yes, I thought so. Shall I rsync you the warcs of the main website so you can add them to the item? (The 'alard' rsync space on fos is gone.)
21:18 πŸ”— SketchCow How big?
21:26 πŸ”— alard 3.6GB
21:28 πŸ”— SketchCow I guess I need an alard place.
21:28 πŸ”— SketchCow Let me get that going.
21:57 πŸ”— oli joepie91: ...
21:57 πŸ”— joepie91 lol
21:57 πŸ”— joepie91 hai
21:57 πŸ”— oli i got lots of capacity in LA and the hetzner box
21:57 πŸ”— joepie91 see #webshots
21:57 πŸ”— joepie91 :P

irclogger-viewer