#archiveteam-bs 2016-11-16,Wed

↑back Search

Time	Nickname	Message
00:20 ^🔗	yipdw	why can you not search Google Apps for the App Passwords page
00:20 ^🔗	yipdw	come on Google, index yourself
00:21 ^🔗	*	yipdw gcan never friggin find this damn thin
00:22 ^🔗	xmc	who indexes the indexer
00:23 ^🔗	yipdw	interestingly, you can search in the Google Apps Admin site for settings
00:23 ^🔗	yipdw	I wonder if nobody at Google uses the App Passwords page enough for it to matter
00:23 ^🔗	yipdw	because why in the world would you ever use an app that wasn't web-based
00:34 ^🔗		godane has joined #archiveteam-bs
01:00 ^🔗		powerKit2 has joined #archiveteam-bs
01:00 ^🔗	powerKit2	https://catalogd.archive.org/log/599682290 ...did this task break?
01:00 ^🔗	xmc	should be still running
01:02 ^🔗	powerKit2	-shrug- it just seemed to be taking longer than it should
01:03 ^🔗	xmc	when they break the row in /history/ turns red and there's an error message in the log
01:03 ^🔗	xmc	unless something is deeply wrong
01:04 ^🔗	xmc	if the derive for a 20 minute video doesn't complete in six hours, then that's cause for worry
01:04 ^🔗	xmc	but an hour? ehhh
01:05 ^🔗	powerKit2	I think the longest video in the item is 3 hours and 20 minutes
01:05 ^🔗		nickname_ has quit IRC (Read error: Operation timed out)
01:06 ^🔗	xmc	oh, well, then that's yeah
01:06 ^🔗	xmc	take two aspirin and call me in the morning
01:08 ^🔗	powerKit2	I'm guessing this is why people don't typically upload 39Gigabytes of video onto the Internet Archive.
01:09 ^🔗	powerKit2	I just figured it'd be kinda mean to dump 121+ individual items into community video.
01:20 ^🔗	powerKit2	Anyway, I've been meaning to start recording my videos in FFv1 from now on. Can the archive derive from video encoded that way?
01:23 ^🔗	xmc	well. an item should be a work that stands on its own
01:23 ^🔗	xmc	not three works, not half a work
01:23 ^🔗	xmc	how you define this ... hard to say
01:26 ^🔗		Yoshimura has quit IRC (Remote host closed the connection)
01:28 ^🔗	powerKit2	Honestly, I just didn't want to go through 121 random videos with non descriptive names and figure out what each one.
01:28 ^🔗	xmc	fair
01:28 ^🔗	powerKit2	*what each one was.
01:30 ^🔗	powerKit2	Anyway, before I start recording my future videos in FFv1, can the archive actually derive from them?
01:30 ^🔗	xmc	what is ffv1
01:30 ^🔗	powerKit2	https://en.wikipedia.org/wiki/FFV1
01:30 ^🔗	xmc	i suggest you make a short test video and upload it into a test item and see what happens
01:30 ^🔗	xmc	test items get deleted after a month
01:31 ^🔗	powerKit2	I think it'd work, it looks like derive.php uses llibavcodec which included FFV1.
01:31 ^🔗	powerKit2	*libavcodec
01:33 ^🔗	powerKit2	Yeah, I'll just make a test video later and see.
02:12 ^🔗		powerKit2 has quit IRC (Quit: Page closed)
02:46 ^🔗		zenguy has quit IRC (Ping timeout: 370 seconds)
02:57 ^🔗		Yoshimura has joined #archiveteam-bs
03:03 ^🔗		zenguy has joined #archiveteam-bs
03:12 ^🔗		n00b184 has joined #archiveteam-bs
04:05 ^🔗		Ravenloft has quit IRC (Read error: Connection reset by peer)
04:55 ^🔗		krazedkat has quit IRC (Leaving)
05:06 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
05:07 ^🔗	godane	i'm at 995k items now
05:07 ^🔗	godane	less then 5k items away from 1 million items
05:08 ^🔗	godane	also nasa docs are almost done
05:13 ^🔗		Sk1d has joined #archiveteam-bs
05:26 ^🔗		mst__ has joined #archiveteam-bs
05:35 ^🔗		mst__ has quit IRC (Quit: bye)
06:49 ^🔗		Asparagir has joined #archiveteam-bs
07:08 ^🔗	whopper	National Library of Australis PANDORA internet archive.. I had no idea this thing existed - http://pandora.nla.gov.au/
07:08 ^🔗	whopper	485,506,170 files and 25.66 TB
07:26 ^🔗		turnkit has joined #archiveteam-bs
07:27 ^🔗	turnkit	Anyone heard of college "viewbooks" -- they are basically mini booklets describing a college for prospective students. I'm considering trying to create a large collection of them.
07:27 ^🔗	turnkit	I found a site that has them sort of aggregated already: https://issuu.com/search?q=viewbook
07:28 ^🔗	turnkit	but many of them are marked "no download"
07:28 ^🔗	turnkit	and if I go to different college sites I can find them. But I think it'd be basically a lot of just manual searching to get one from each college for each year that they were available.
07:29 ^🔗	turnkit	The part I am interested in is finding what college clubs each college had each year.
07:29 ^🔗	turnkit	I would think someone already has indexed this but I haven't found an index of college clubs yet.
07:30 ^🔗	turnkit	anyone happen to already stumble on a college viewbook pdf collection that I could use to extract that info?
07:30 ^🔗	turnkit	i guess... https://www.google.com/search?q=college+viewbook+type%3A.pdf
07:31 ^🔗	turnkit	Can I just run that into wget somehow? (time to listen to the man)
07:33 ^🔗		ravetcofx has quit IRC (Read error: Operation timed out)
07:49 ^🔗	yipdw	turnkit: if you've got a list of URLs, yeah, you can feed those into wget/wpull/whatever
07:50 ^🔗	turnkit	this is a pretty dumb question but do you know an easy way to get google results into a list? I guess I could save the whole page then grep or sed for http:// but seems like there should be an simplier way
07:50 ^🔗	yipdw	unfortunately I don't know of any Google search scraper offhand that'll do this
07:50 ^🔗	yipdw	the main difficulty is that Google builds a lot of bot checks into the search
07:51 ^🔗	turnkit	I found a SEO plugin that claims to save Google results as CSV but it was bloaty
07:51 ^🔗	turnkit	Well I found how to change the Google setting to get 100 results per page -- that sort of helps
07:52 ^🔗	turnkit	? http://www.labnol.org/internet/google-web-scraping/28450/
07:53 ^🔗	turnkit	oh that doesn't work -- I found that last week and couldn't figure it out
07:54 ^🔗	turnkit	i guess this is more basic than I thought.... stumbling around. https://www.google.com/search?&q=scrape+google+search+results+into+links
07:56 ^🔗	yipdw	so, the basics are not too bad; if you keep a human-like pace and don't give yourself away obviously (e.g. use the default curl/wget user-agent or whatever) you'll probably be fine just grabbing each search page
07:56 ^🔗	yipdw	and parsing out the links with nokogiri/beautifulsoup/whatever
07:57 ^🔗	yipdw	the problem comes when people go "oh, one process is good, let me scale up to 47"
07:57 ^🔗	yipdw	and then they wonder why they are getting no result
07:57 ^🔗	yipdw	s
07:59 ^🔗	yipdw	you will have to deal with getting the URL out of the Google URL redirect thingy
08:06 ^🔗	yipdw	turnkit: e.g. https://gitlab.peach-bun.com/snippets/44, quick scripting
08:26 ^🔗	turnkit	I'll check that out. Thanks!
08:32 ^🔗		turnkit_ has joined #archiveteam-bs
09:10 ^🔗		krazedkat has joined #archiveteam-bs
09:11 ^🔗		GE has joined #archiveteam-bs
09:51 ^🔗		turnkit_ has quit IRC (Ping timeout: 268 seconds)
10:05 ^🔗		Smiley has joined #archiveteam-bs
10:07 ^🔗		SmileyG has quit IRC (Ping timeout: 250 seconds)
10:38 ^🔗		GE has quit IRC (Quit: zzz)
10:42 ^🔗		turnkit has quit IRC (Quit: Page closed)
10:44 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:21 ^🔗		n00b184 has quit IRC (Ping timeout: 268 seconds)
12:35 ^🔗		GE has joined #archiveteam-bs
14:00 ^🔗		SilSte has joined #archiveteam-bs
14:08 ^🔗		tfgbd_znc has quit IRC (Read error: Operation timed out)
14:09 ^🔗		tfgbd_znc has joined #archiveteam-bs
14:11 ^🔗		SilSte has quit IRC (Read error: Connection reset by peer)
14:12 ^🔗		SilSte has joined #archiveteam-bs
14:49 ^🔗		sep332_ has quit IRC (konversation out)
14:51 ^🔗		sep332_ has joined #archiveteam-bs
14:54 ^🔗		Start has quit IRC (Quit: Disconnected.)
15:50 ^🔗		Ravenloft has joined #archiveteam-bs
16:07 ^🔗		ravetcofx has joined #archiveteam-bs
16:25 ^🔗		Shakespea has joined #archiveteam-bs
16:25 ^🔗	Shakespea	Afternoon
16:26 ^🔗	Shakespea	I found an intresting site
16:27 ^🔗	Shakespea	www.oldapps.com any possibility of getting it archived? I would mention this on the Web site, but owing to some unfrotunate misunderstandings I can't raise the matter there at the moment.
16:28 ^🔗	Aoede	"This web page at download.oldapps.com has been reported to contain unwanted software and has been blocked "
16:28 ^🔗	Aoede	thanks firefox
16:28 ^🔗	Shakespea	Are you using an ad-blocker?
16:28 ^🔗	Shakespea	It loaded fine for me
16:29 ^🔗	Shakespea	http://www.oldapps.com/index.php being the full URL
16:29 ^🔗	Aoede	Loads fine, just doesn't let me download anything. Weird
16:30 ^🔗	Shakespea	The useful thing is that seems to have older versions of some 'sharing' tools... ;)
16:30 ^🔗	Aoede	:D
16:30 ^🔗	Shakespea	I also noted - www.mdgx.com
16:31 ^🔗	Shakespea	Which has suport files going back nearly 20 years
16:31 ^🔗	Shakespea	(And which probably should be mirrored at some point)
16:32 ^🔗	Shakespea	And I'm down by2 on my 3 suggestions this month :(
16:39 ^🔗	Aoede	mdgx was grabbed by archivebot in 2015
16:39 ^🔗	Aoede	http://archive.fart.website/archivebot/viewer/job/49n9f
16:40 ^🔗	Aoede	oldapps.com in 2014 http://archive.fart.website/archivebot/viewer/job/7dvez
16:42 ^🔗	Shakespea	Aoede: Thanks... mgdx gets updates a quite a bit though... so I hope it's on a regular schedule :)
16:42 ^🔗	Shakespea	*mdgx
16:42 ^🔗	Aoede	Want me to throw it in Archivebot?
16:43 ^🔗	Shakespea	Feel free, if it's possible to do an incremental
16:43 ^🔗	Shakespea	The one thing I can never find online is old sewing patterns though....
16:44 ^🔗	Aoede	Dunno if incremental is possible
16:48 ^🔗	Sanqui	I think OldApps may be covered, not sure though.
17:09 ^🔗	Shakespea	Aoede: My next query would be to look into whether wget has an 'incremental' option in it, as it save badnwidth if you only have to add a few new files vs the whole site.
17:10 ^🔗	Shakespea	If you want to throw it in the bot anyway , don't let me stop you :)
17:10 ^🔗	xmc	wget does
17:10 ^🔗	xmc	--continue
17:11 ^🔗	Shakespea	xmc: I meant "date incremental" I.e grab evreything that's changed since we last took a sample...
17:11 ^🔗	xmc	yep
17:11 ^🔗	Aoede	--warc-dedup?
17:11 ^🔗	Shakespea	Aoede: Possibly...
17:12 ^🔗	xmc	--continue --mirror will crawl the site but only download files that are different
17:12 ^🔗	xmc	i'm not sure exactly how it works, to be honest
17:12 ^🔗	Shakespea	Thanks
17:12 ^🔗	xmc	/topic unofficial wget user group
17:12 ^🔗	xmc	anyway. wget --continue --mirror will probably do what you want. but test first
17:13 ^🔗	Shakespea	My third suggestion for this month would be to ask whose archiving "adult" fiction sites like asstr, Fictionmania etc
17:13 ^🔗	Shakespea	These can apparently vanish without warning ...
17:13 ^🔗	xmc	i'm not aware of an active project for those sites
17:13 ^🔗	xmc	you're welcome to start one
17:14 ^🔗	Shakespea	I can't use the wiki at the moment, owing to some unfortunate misunderstandings...
17:14 ^🔗	yipdw	archivebot's crawlers support incremental fetch to the degree the site itself makes it possible to determine what's changed
17:14 ^🔗	yipdw	archivebot itself does not
17:14 ^🔗	yipdw	good news is you can use wget/wpull to do that manually until that situation's resolved
17:14 ^🔗	Shakespea	Thank you for that explanation.
17:15 ^🔗	xmc	doesn't it use the If-Modified-Since: header ?
17:16 ^🔗	yipdw	wget can use that yeah
17:16 ^🔗	yipdw	but a website doesn't have to send that or send one that makes any sense
17:16 ^🔗	yipdw	er, sorry, wget uses Last-Modified
17:17 ^🔗	xmc	sent by the client
17:17 ^🔗	xmc	ah
17:17 ^🔗	yipdw	it's not clear to me whether wget does conditional GETs yet
17:17 ^🔗	xmc	yes. the web is garbage, and we try to layer useful things over that
17:17 ^🔗	yipdw	yeah whoops
17:18 ^🔗	yipdw	I confused If-Modified-Since with Last-Modified, go me
17:18 ^🔗	xmc	np
17:18 ^🔗	yipdw	they're only different parts of the request
17:20 ^🔗	Shakespea	But still in theory possible not to have to grab a whole site multiple times...
17:20 ^🔗	Shakespea	( which some may still want to do for other reasons, of course...)
17:21 ^🔗	Shakespea	Thanks ....
17:21 ^🔗	Shakespea	BTW My forurth of 3 suggestions for archive this month (Sorry) would be news sites on Trump that are pre-election before his lawyers get to them ;) )
17:22 ^🔗	*	Shakespea out
17:22 ^🔗		Shakespea has left
17:22 ^🔗	xmc	uh
17:28 ^🔗	yipdw	typing on the edge of chaos
17:44 ^🔗		computerf has quit IRC (Read error: Operation timed out)
18:13 ^🔗		computerf has joined #archiveteam-bs
19:42 ^🔗		Start has joined #archiveteam-bs
19:46 ^🔗		kristian_ has joined #archiveteam-bs
19:53 ^🔗		krazedkat has quit IRC (Read error: Operation timed out)
20:09 ^🔗		Start has quit IRC (Remote host closed the connection)
20:50 ^🔗		Start has joined #archiveteam-bs
20:52 ^🔗		Start has quit IRC (Client Quit)
20:54 ^🔗		Start has joined #archiveteam-bs
20:59 ^🔗		Start has quit IRC (Client Quit)
21:03 ^🔗		Start has joined #archiveteam-bs
21:11 ^🔗		Yoshimura has quit IRC (Remote host closed the connection)
21:44 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
21:49 ^🔗		BartoCH has joined #archiveteam-bs
21:54 ^🔗		Start has quit IRC (Remote host closed the connection)
22:24 ^🔗		krazedkat has joined #archiveteam-bs
23:04 ^🔗		Stiletto has quit IRC (Read error: Operation timed out)
23:14 ^🔗		BlueMaxim has joined #archiveteam-bs
23:17 ^🔗		GE has quit IRC (Quit: zzz)
23:17 ^🔗	godane	so i should be past 1 million items by the morning
23:17 ^🔗	xmc	wow!
23:29 ^🔗		Start has joined #archiveteam-bs

irclogger-viewer