#archiveteam-bs 2016-11-16,Wed

↑back Search

Time Nickname Message
00:20 🔗 yipdw why can you not search Google Apps for the App Passwords page
00:20 🔗 yipdw come on Google, index yourself
00:21 🔗 * yipdw gcan never friggin find this damn thin
00:22 🔗 xmc who indexes the indexer
00:23 🔗 yipdw interestingly, you can search in the Google Apps Admin site for settings
00:23 🔗 yipdw I wonder if nobody at Google uses the App Passwords page enough for it to matter
00:23 🔗 yipdw because why in the world would you ever use an app that wasn't web-based
00:34 🔗 godane has joined #archiveteam-bs
01:00 🔗 powerKit2 has joined #archiveteam-bs
01:00 🔗 powerKit2 https://catalogd.archive.org/log/599682290 ...did this task break?
01:00 🔗 xmc should be still running
01:02 🔗 powerKit2 -shrug- it just seemed to be taking longer than it should
01:03 🔗 xmc when they break the row in /history/ turns red and there's an error message in the log
01:03 🔗 xmc unless something is deeply wrong
01:04 🔗 xmc if the derive for a 20 minute video doesn't complete in six hours, then that's cause for worry
01:04 🔗 xmc but an hour? ehhh
01:05 🔗 powerKit2 I think the longest video in the item is 3 hours and 20 minutes
01:05 🔗 nickname_ has quit IRC (Read error: Operation timed out)
01:06 🔗 xmc oh, well, then that's yeah
01:06 🔗 xmc take two aspirin and call me in the morning
01:08 🔗 powerKit2 I'm guessing this is why people don't typically upload 39Gigabytes of video onto the Internet Archive.
01:09 🔗 powerKit2 I just figured it'd be kinda mean to dump 121+ individual items into community video.
01:20 🔗 powerKit2 Anyway, I've been meaning to start recording my videos in FFv1 from now on. Can the archive derive from video encoded that way?
01:23 🔗 xmc well. an item should be a work that stands on its own
01:23 🔗 xmc not three works, not half a work
01:23 🔗 xmc how you define this ... hard to say
01:26 🔗 Yoshimura has quit IRC (Remote host closed the connection)
01:28 🔗 powerKit2 Honestly, I just didn't want to go through 121 random videos with non descriptive names and figure out what each one.
01:28 🔗 xmc fair
01:28 🔗 powerKit2 *what each one was.
01:30 🔗 powerKit2 Anyway, before I start recording my future videos in FFv1, can the archive actually derive from them?
01:30 🔗 xmc what is ffv1
01:30 🔗 powerKit2 https://en.wikipedia.org/wiki/FFV1
01:30 🔗 xmc i suggest you make a short test video and upload it into a test item and see what happens
01:30 🔗 xmc test items get deleted after a month
01:31 🔗 powerKit2 I think it'd work, it looks like derive.php uses llibavcodec which included FFV1.
01:31 🔗 powerKit2 *libavcodec
01:33 🔗 powerKit2 Yeah, I'll just make a test video later and see.
02:12 🔗 powerKit2 has quit IRC (Quit: Page closed)
02:46 🔗 zenguy has quit IRC (Ping timeout: 370 seconds)
02:57 🔗 Yoshimura has joined #archiveteam-bs
03:03 🔗 zenguy has joined #archiveteam-bs
03:12 🔗 n00b184 has joined #archiveteam-bs
04:05 🔗 Ravenloft has quit IRC (Read error: Connection reset by peer)
04:55 🔗 krazedkat has quit IRC (Leaving)
05:06 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
05:07 🔗 godane i'm at 995k items now
05:07 🔗 godane less then 5k items away from 1 million items
05:08 🔗 godane also nasa docs are almost done
05:13 🔗 Sk1d has joined #archiveteam-bs
05:26 🔗 mst__ has joined #archiveteam-bs
05:35 🔗 mst__ has quit IRC (Quit: bye)
06:49 🔗 Asparagir has joined #archiveteam-bs
07:08 🔗 whopper National Library of Australis PANDORA internet archive.. I had no idea this thing existed - http://pandora.nla.gov.au/
07:08 🔗 whopper 485,506,170 files and 25.66 TB
07:26 🔗 turnkit has joined #archiveteam-bs
07:27 🔗 turnkit Anyone heard of college "viewbooks" -- they are basically mini booklets describing a college for prospective students. I'm considering trying to create a large collection of them.
07:27 🔗 turnkit I found a site that has them sort of aggregated already: https://issuu.com/search?q=viewbook
07:28 🔗 turnkit but many of them are marked "no download"
07:28 🔗 turnkit and if I go to different college sites I can find them. But I think it'd be basically a lot of just manual searching to get one from each college for each year that they were available.
07:29 🔗 turnkit The part I am interested in is finding what college clubs each college had each year.
07:29 🔗 turnkit I would think someone already has indexed this but I haven't found an index of college clubs yet.
07:30 🔗 turnkit anyone happen to already stumble on a college viewbook pdf collection that I could use to extract that info?
07:30 🔗 turnkit i guess... https://www.google.com/search?q=college+viewbook+type%3A.pdf
07:31 🔗 turnkit Can I just run that into wget somehow? (time to listen to the man)
07:33 🔗 ravetcofx has quit IRC (Read error: Operation timed out)
07:49 🔗 yipdw turnkit: if you've got a list of URLs, yeah, you can feed those into wget/wpull/whatever
07:50 🔗 turnkit this is a pretty dumb question but do you know an easy way to get google results into a list? I guess I could save the whole page then grep or sed for http:// but seems like there should be an simplier way
07:50 🔗 yipdw unfortunately I don't know of any Google search scraper offhand that'll do this
07:50 🔗 yipdw the main difficulty is that Google builds a lot of bot checks into the search
07:51 🔗 turnkit I found a SEO plugin that claims to save Google results as CSV but it was bloaty
07:51 🔗 turnkit Well I found how to change the Google setting to get 100 results per page -- that sort of helps
07:52 🔗 turnkit ? http://www.labnol.org/internet/google-web-scraping/28450/
07:53 🔗 turnkit oh that doesn't work -- I found that last week and couldn't figure it out
07:54 🔗 turnkit i guess this is more basic than I thought.... stumbling around. https://www.google.com/search?&q=scrape+google+search+results+into+links
07:56 🔗 yipdw so, the basics are not too bad; if you keep a human-like pace and don't give yourself away obviously (e.g. use the default curl/wget user-agent or whatever) you'll probably be fine just grabbing each search page
07:56 🔗 yipdw and parsing out the links with nokogiri/beautifulsoup/whatever
07:57 🔗 yipdw the problem comes when people go "oh, one process is good, let me scale up to 47"
07:57 🔗 yipdw and then they wonder why they are getting no result
07:57 🔗 yipdw s
07:59 🔗 yipdw you will have to deal with getting the URL out of the Google URL redirect thingy
08:06 🔗 yipdw turnkit: e.g. https://gitlab.peach-bun.com/snippets/44, quick scripting
08:26 🔗 turnkit I'll check that out. Thanks!
08:32 🔗 turnkit_ has joined #archiveteam-bs
09:10 🔗 krazedkat has joined #archiveteam-bs
09:11 🔗 GE has joined #archiveteam-bs
09:51 🔗 turnkit_ has quit IRC (Ping timeout: 268 seconds)
10:05 🔗 Smiley has joined #archiveteam-bs
10:07 🔗 SmileyG has quit IRC (Ping timeout: 250 seconds)
10:38 🔗 GE has quit IRC (Quit: zzz)
10:42 🔗 turnkit has quit IRC (Quit: Page closed)
10:44 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:21 🔗 n00b184 has quit IRC (Ping timeout: 268 seconds)
12:35 🔗 GE has joined #archiveteam-bs
14:00 🔗 SilSte has joined #archiveteam-bs
14:08 🔗 tfgbd_znc has quit IRC (Read error: Operation timed out)
14:09 🔗 tfgbd_znc has joined #archiveteam-bs
14:11 🔗 SilSte has quit IRC (Read error: Connection reset by peer)
14:12 🔗 SilSte has joined #archiveteam-bs
14:49 🔗 sep332_ has quit IRC (konversation out)
14:51 🔗 sep332_ has joined #archiveteam-bs
14:54 🔗 Start has quit IRC (Quit: Disconnected.)
15:50 🔗 Ravenloft has joined #archiveteam-bs
16:07 🔗 ravetcofx has joined #archiveteam-bs
16:25 🔗 Shakespea has joined #archiveteam-bs
16:25 🔗 Shakespea Afternoon
16:26 🔗 Shakespea I found an intresting site
16:27 🔗 Shakespea www.oldapps.com any possibility of getting it archived? I would mention this on the Web site, but owing to some unfrotunate misunderstandings I can't raise the matter there at the moment.
16:28 🔗 Aoede "This web page at download.oldapps.com has been reported to contain unwanted software and has been blocked "
16:28 🔗 Aoede thanks firefox
16:28 🔗 Shakespea Are you using an ad-blocker?
16:28 🔗 Shakespea It loaded fine for me
16:29 🔗 Shakespea http://www.oldapps.com/index.php being the full URL
16:29 🔗 Aoede Loads fine, just doesn't let me download anything. Weird
16:30 🔗 Shakespea The useful thing is that seems to have older versions of some 'sharing' tools... ;)
16:30 🔗 Aoede :D
16:30 🔗 Shakespea I also noted - www.mdgx.com
16:31 🔗 Shakespea Which has suport files going back nearly 20 years
16:31 🔗 Shakespea (And which probably should be mirrored at some point)
16:32 🔗 Shakespea And I'm down by2 on my 3 suggestions this month :(
16:39 🔗 Aoede mdgx was grabbed by archivebot in 2015
16:39 🔗 Aoede http://archive.fart.website/archivebot/viewer/job/49n9f
16:40 🔗 Aoede oldapps.com in 2014 http://archive.fart.website/archivebot/viewer/job/7dvez
16:42 🔗 Shakespea Aoede: Thanks... mgdx gets updates a quite a bit though... so I hope it's on a regular schedule :)
16:42 🔗 Shakespea *mdgx
16:42 🔗 Aoede Want me to throw it in Archivebot?
16:43 🔗 Shakespea Feel free, if it's possible to do an incremental
16:43 🔗 Shakespea The one thing I can never find online is old sewing patterns though....
16:44 🔗 Aoede Dunno if incremental is possible
16:48 🔗 Sanqui I think OldApps may be covered, not sure though.
17:09 🔗 Shakespea Aoede: My next query would be to look into whether wget has an 'incremental' option in it, as it save badnwidth if you only have to add a few new files vs the whole site.
17:10 🔗 Shakespea If you want to throw it in the bot anyway , don't let me stop you :)
17:10 🔗 xmc wget does
17:10 🔗 xmc --continue
17:11 🔗 Shakespea xmc: I meant "date incremental" I.e grab evreything that's changed since we last took a sample...
17:11 🔗 xmc yep
17:11 🔗 Aoede --warc-dedup?
17:11 🔗 Shakespea Aoede: Possibly...
17:12 🔗 xmc --continue --mirror will crawl the site but only download files that are different
17:12 🔗 xmc i'm not sure exactly how it works, to be honest
17:12 🔗 Shakespea Thanks
17:12 🔗 xmc /topic unofficial wget user group
17:12 🔗 xmc anyway. wget --continue --mirror will probably do what you want. but test first
17:13 🔗 Shakespea My third suggestion for this month would be to ask whose archiving "adult" fiction sites like asstr, Fictionmania etc
17:13 🔗 Shakespea These can apparently vanish without warning ...
17:13 🔗 xmc i'm not aware of an active project for those sites
17:13 🔗 xmc you're welcome to start one
17:14 🔗 Shakespea I can't use the wiki at the moment, owing to some unfortunate misunderstandings...
17:14 🔗 yipdw archivebot's crawlers support incremental fetch to the degree the site itself makes it possible to determine what's changed
17:14 🔗 yipdw archivebot itself does not
17:14 🔗 yipdw good news is you can use wget/wpull to do that manually until that situation's resolved
17:14 🔗 Shakespea Thank you for that explanation.
17:15 🔗 xmc doesn't it use the If-Modified-Since: header ?
17:16 🔗 yipdw wget can use that yeah
17:16 🔗 yipdw but a website doesn't have to send that or send one that makes any sense
17:16 🔗 yipdw er, sorry, wget uses Last-Modified
17:17 🔗 xmc sent by the client
17:17 🔗 xmc ah
17:17 🔗 yipdw it's not clear to me whether wget does conditional GETs yet
17:17 🔗 xmc yes. the web is garbage, and we try to layer useful things over that
17:17 🔗 yipdw yeah whoops
17:18 🔗 yipdw I confused If-Modified-Since with Last-Modified, go me
17:18 🔗 xmc np
17:18 🔗 yipdw they're only different parts of the request
17:20 🔗 Shakespea But still in theory possible not to have to grab a whole site multiple times...
17:20 🔗 Shakespea ( which some may still want to do for other reasons, of course...)
17:21 🔗 Shakespea Thanks ....
17:21 🔗 Shakespea BTW My forurth of 3 suggestions for archive this month (Sorry) would be news sites on Trump that are pre-election before his lawyers get to them ;) )
17:22 🔗 * Shakespea out
17:22 🔗 Shakespea has left
17:22 🔗 xmc uh
17:28 🔗 yipdw typing on the edge of chaos
17:44 🔗 computerf has quit IRC (Read error: Operation timed out)
18:13 🔗 computerf has joined #archiveteam-bs
19:42 🔗 Start has joined #archiveteam-bs
19:46 🔗 kristian_ has joined #archiveteam-bs
19:53 🔗 krazedkat has quit IRC (Read error: Operation timed out)
20:09 🔗 Start has quit IRC (Remote host closed the connection)
20:50 🔗 Start has joined #archiveteam-bs
20:52 🔗 Start has quit IRC (Client Quit)
20:54 🔗 Start has joined #archiveteam-bs
20:59 🔗 Start has quit IRC (Client Quit)
21:03 🔗 Start has joined #archiveteam-bs
21:11 🔗 Yoshimura has quit IRC (Remote host closed the connection)
21:44 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
21:49 🔗 BartoCH has joined #archiveteam-bs
21:54 🔗 Start has quit IRC (Remote host closed the connection)
22:24 🔗 krazedkat has joined #archiveteam-bs
23:04 🔗 Stiletto has quit IRC (Read error: Operation timed out)
23:14 🔗 BlueMaxim has joined #archiveteam-bs
23:17 🔗 GE has quit IRC (Quit: zzz)
23:17 🔗 godane so i should be past 1 million items by the morning
23:17 🔗 xmc wow!
23:29 🔗 Start has joined #archiveteam-bs

irclogger-viewer