#archiveteam-ot 2020-01-07,Tue

↑back Search

Time Nickname Message
00:41 🔗 kiska has quit IRC (Remote host closed the connection)
00:41 🔗 Flashfire has quit IRC (Remote host closed the connection)
00:42 🔗 kiska has joined #archiveteam-ot
00:42 🔗 Flashfire has joined #archiveteam-ot
00:45 🔗 superkuh_ is now known as superkuh
01:07 🔗 godane has quit IRC (Quit: Leaving.)
01:07 🔗 godane has joined #archiveteam-ot
01:11 🔗 josey I've been given a login for VampireFreaks (shutting down 1st Feb). I'm going to try to scrape what I can, but if ArchiveTeam are planning a scrape I've got permission to pass on the credentials.
01:13 🔗 Joseph_ has joined #archiveteam-ot
01:19 🔗 VerifiedJ has quit IRC (Read error: Operation timed out)
01:27 🔗 nepeat has joined #archiveteam-ot
01:47 🔗 atphoenix josey, I did start an archivebot scrape, but the Javascript is interfering with parts. What level of contact do you have with the site?
01:49 🔗 atphoenix The web scrape method is the most basic way to preserve. Fancier involves figuring out APIs. Fanciest involves actual direct file or database tables. Usually the information saved is that information that which can be fully public.
01:49 🔗 atphoenix Other information obviously has complications
01:54 🔗 josey Thanks for starting an archivebot scrape. I don't have any contact or connection to the site. It might be worth contacting the site owner about a database, but I don't knowwhat he'd say.
01:54 🔗 josey Which bits are Javascript interfering with?
01:55 🔗 godane has quit IRC (Ping timeout: 610 seconds)
01:56 🔗 atphoenix main thing I know about right now is the photo albums use JS to show the full image. ArchiveBot can only find the thumbnails.
01:56 🔗 atphoenix also, it doesn't look like older journal entries are visible. Maybe that is different after being logged in? IDK
02:00 🔗 josey Do we have any idea when the scrape will complete? I could go through the scape and get the full size images, and grep for "You do not have permission to view this content" so I knew what to get with the account.
02:00 🔗 josey Would the login info be useful for you?
02:07 🔗 josey re. older journal entries, the oldest I've seen so far is from 2015.
02:07 🔗 josey I'm just off to bed. I'll check back in tomorrow.
02:16 🔗 godane has joined #archiveteam-ot
02:19 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
03:08 🔗 fdstw has joined #archiveteam-ot
03:09 🔗 fdstw has quit IRC (Quit: Leaving)
03:10 🔗 cerca has quit IRC (Remote host closed the connection)
03:11 🔗 X-Scale` has joined #archiveteam-ot
03:11 🔗 fdstw has joined #archiveteam-ot
03:14 🔗 godane has quit IRC (Read error: Connection reset by peer)
03:17 🔗 X-Scale has quit IRC (Ping timeout: 610 seconds)
03:17 🔗 X-Scale` is now known as X-Scale
03:19 🔗 qnisz has joined #archiveteam-ot
03:25 🔗 fdstw has quit IRC (Read error: Operation timed out)
03:29 🔗 X-Scale` has joined #archiveteam-ot
03:34 🔗 X-Scale has quit IRC (Read error: Operation timed out)
03:34 🔗 X-Scale` is now known as X-Scale
03:53 🔗 atphoenix josey, IDK what the scrape timeline looks like. You can check progress in the http://dashboard.at.ninjawedding.org/?showNicks=1 and look for vampirefreaks or job id dqnnylxugllw9uyhtou49q4yc
03:54 🔗 atphoenix 87k in queue 422k done
04:13 🔗 qw3rty__ has joined #archiveteam-ot
04:17 🔗 qw3rty_ has quit IRC (Ping timeout: 276 seconds)
04:17 🔗 m007a83 has joined #archiveteam-ot
04:17 🔗 m007a83 has quit IRC (Read error: Connection reset by peer)
04:18 🔗 m007a83 has joined #archiveteam-ot
04:19 🔗 qnisz has quit IRC (Quit: Leaving)
05:00 🔗 godane has joined #archiveteam-ot
05:36 🔗 Raccoon` If Seagate is willing to just give a goofy Youtuber about $50k in free harddrives, I wonder what AT/IA can get by asking nicely. https://www.youtube.com/watch?v=eCz-IixxR_k
05:38 🔗 ivan Internet Archive Sponsored by Seagate
05:39 🔗 Raccoon` We needs a marketing team that gives good mouth spin.
06:32 🔗 astrid JAA: there are many non indexed items in archiveteam_youtube
06:35 🔗 astrid iirc there was a mass noindex'ing because just please stop using up space to mirror youtube
07:38 🔗 oxguy3 has quit IRC (My MacBook has gone to sleep. ZZZzzz…)
07:38 🔗 oxguy3 has joined #archiveteam-ot
07:38 🔗 oxguy3 has quit IRC (Client Quit)
07:39 🔗 oxguy3 has joined #archiveteam-ot
07:39 🔗 oxguy3 has quit IRC (Client Quit)
07:41 🔗 oxguy3 has joined #archiveteam-ot
07:41 🔗 oxguy3 has quit IRC (Client Quit)
07:42 🔗 oxguy3 has joined #archiveteam-ot
07:42 🔗 oxguy3 has quit IRC (Client Quit)
07:42 🔗 oxguy3 has joined #archiveteam-ot
07:42 🔗 oxguy3 has quit IRC (Client Quit)
07:43 🔗 oxguy3 has joined #archiveteam-ot
07:43 🔗 oxguy3 has quit IRC (Client Quit)
07:44 🔗 oxguy3 has joined #archiveteam-ot
07:44 🔗 oxguy3 has quit IRC (Client Quit)
07:45 🔗 oxguy3 has joined #archiveteam-ot
07:45 🔗 oxguy3 has quit IRC (Client Quit)
07:46 🔗 oxguy3 has joined #archiveteam-ot
07:46 🔗 oxguy3 has quit IRC (Client Quit)
07:46 🔗 oxguy3 has joined #archiveteam-ot
07:46 🔗 oxguy3 has quit IRC (Client Quit)
07:47 🔗 oxguy3 has joined #archiveteam-ot
07:47 🔗 oxguy3 has quit IRC (Client Quit)
07:52 🔗 dhyan_nat has joined #archiveteam-ot
08:33 🔗 dhyan_nat has quit IRC (Read error: Connection reset by peer)
08:33 🔗 dhyan_nat has joined #archiveteam-ot
09:15 🔗 atphoenix Raccoon`, can we get WD drives instead? :)
09:16 🔗 Raccoon` "Hey WD, Seagate offered us 100 10TB drives, would you guys do 200?"
09:18 🔗 atphoenix exactly
09:19 🔗 atphoenix in fairness, that video channel was started in 2008 and got the drives in 2017. I guess ArchiveTeam is about the same age and is due for drives :)
09:34 🔗 atphoenix DigDug told is about it in #archivebot
09:34 🔗 atphoenix told us*
09:34 🔗 atphoenix wrong channel, derp
09:35 🔗 DigiDigi has quit IRC (Remote host closed the connection)
09:51 🔗 ivan I am sure that hard drive companies are in the business of giving away free drives to weirdos and charities
10:04 🔗 Raccoon` If it means a big sponsored ad at the bottom of every page for, say, 5 years
10:06 🔗 Raccoon` And the rights to assert "Official Sponsor of the Internet Archive -- ``Preserving our history for them.``" or some touching slogon that doesn't exist yet.
10:31 🔗 josey9 has joined #archiveteam-ot
10:31 🔗 OrIdow6 As I remember (though going through in a few minutes now, I can't find any eamples), there's sort of a shift of attitude in early discussion of the IA from considering overt commercial "partnerships" like this as being potentially useful to opposing them on semi-philosophical grounds of independence
10:36 🔗 OrIdow6 By the way, here's Brewster Kahle without glasses: https://web.archive.org/web/20030305092730im_/http://chronicle.com/photos/v44/i26/4426a271.jpg
10:36 🔗 josey has quit IRC (Ping timeout: 745 seconds)
10:42 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
10:51 🔗 atphoenix that independence thing is probably the key. IA probably doesn't want any heavy dependency (life or death level) on a commercial partnership
11:30 🔗 josey has joined #archiveteam-ot
11:37 🔗 josey9 has quit IRC (Ping timeout: 745 seconds)
12:00 🔗 JAA astrid: Yup, and I'm pretty sure ola_norsk was told about that at the time because he was one of the "lol tubeup!" users.
14:35 🔗 jrwr Fionera: Hello nerd
14:35 🔗 Fionera hi :D
14:51 🔗 kiska Hewo!
15:01 🔗 Mateon1 has quit IRC (Remote host closed the connection)
15:01 🔗 Mateon1 has joined #archiveteam-ot
15:10 🔗 Fionera smh i am already ages here
15:52 🔗 josey I can't see the job for vampirefreaks (id dqnnylxugllw9uyhtou49q4yc) on the ArchiveBot tracker anymore. Does that mean it's done? Is it possible to get a list of URLs so I can crawl the links that require a login?
15:59 🔗 josey atphoenix: ^
16:18 🔗 josey Are the nonindexed YouTube still stored by IA?
16:31 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
16:38 🔗 marked1 AT can use drives if IA does not want them
16:40 🔗 systwi has joined #archiveteam-ot
16:54 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
16:54 🔗 Craigle has joined #archiveteam-ot
16:57 🔗 atphoenix josey, check https://archive.fart.website/archivebot/viewer/
16:58 🔗 atphoenix https://archive.fart.website/archivebot/viewer/job/dqnny
16:59 🔗 atphoenix 5.4 GiB
17:00 🔗 atphoenix yes the vampirefreaks crawl finished. I know that the photos are thumbnails.
17:01 🔗 atphoenix a custom crawl of the photo URLs could fix those gaps
17:02 🔗 DigiDigi has joined #archiveteam-ot
17:25 🔗 DigiDigi has quit IRC (Remote host closed the connection)
17:31 🔗 DigiDigi has joined #archiveteam-ot
18:06 🔗 jamiew has joined #archiveteam-ot
18:12 🔗 jamiew has quit IRC (zzz)
18:13 🔗 jamiew has joined #archiveteam-ot
18:21 🔗 NickN00b has joined #archiveteam-ot
18:23 🔗 NickN00b SketchCow: ah damn it, I thought I saw you at magfest, but I wasn't sure if it was you. I should've stopped and said hi
19:09 🔗 jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
21:01 🔗 schbirid has joined #archiveteam-ot
21:33 🔗 apache2_ has quit IRC (Remote host closed the connection)
21:33 🔗 jodizzle has quit IRC (Read error: Operation timed out)
21:33 🔗 jodizzle has joined #archiveteam-ot
21:35 🔗 Jens has quit IRC (Remote host closed the connection)
21:35 🔗 apache2 has joined #archiveteam-ot
21:36 🔗 Jens has joined #archiveteam-ot
21:45 🔗 apache2 has quit IRC (Remote host closed the connection)
21:47 🔗 apache2 has joined #archiveteam-ot
21:47 🔗 Jens has quit IRC (Remote host closed the connection)
21:48 🔗 Jens has joined #archiveteam-ot
21:55 🔗 ivan has quit IRC (Quit: Leaving)
21:56 🔗 jodizzle has quit IRC (Quit: ZNC 1.7.1 - https://znc.in)
21:56 🔗 jodizzle has joined #archiveteam-ot
22:11 🔗 ivan has joined #archiveteam-ot
22:12 🔗 svchfoo3 sets mode: +o ivan
22:12 🔗 svchfoo1 sets mode: +o ivan
22:25 🔗 schbirid has quit IRC (Quit: Leaving)
22:35 🔗 Joseph__ has joined #archiveteam-ot
22:35 🔗 Joseph_ has quit IRC (Read error: Connection reset by peer)
22:39 🔗 Mateon1 has quit IRC (Ping timeout: 258 seconds)
22:39 🔗 SketchCow It was me.
22:41 🔗 SootBectr So when I was running the manual python yahoo groups download I saw the nice retry with backoff timer if there's a http error stuff, is that all handled by a specific part of the code that I could use to run a manual download of my own? I'm looking through the code on github to see if I can find it
22:42 🔗 josey Thanks atphoenix for the links.
22:44 🔗 SootBectr e.g. I have a list of ~400k image urls and want retries and proper logging of any failures (my 5min bash script just checks filesize and retries if it's 0)
22:48 🔗 BlueMax has joined #archiveteam-ot
22:54 🔗 SootBectr or is there a recommended script/tool that I could try?
22:58 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
23:51 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)

irclogger-viewer