#archiveteam-bs 2019-01-16,Wed

↑back Search

Time Nickname Message
00:29 🔗 Oddly has quit IRC (Ping timeout: 255 seconds)
00:52 🔗 VerfiedJ has quit IRC (Quit: Leaving)
01:14 🔗 omarroth has quit IRC (Quit: Konversation terminated!)
01:47 🔗 t3 PurpleSym: So I'm trying to upload WARCs to the Internet Archive, but I get the HTTP status code 503. Am I being throttled?
01:48 🔗 Kaz IA usually send's 429's I think
02:15 🔗 Sk1d has quit IRC (Read error: Operation timed out)
02:43 🔗 qw3rty113 has joined #archiveteam-bs
02:44 🔗 qw3rty112 has quit IRC (Ping timeout: 600 seconds)
02:46 🔗 Flashfire How long until the savenow page usually grabs all the outlinks because one of mine has been trying for an hour now
02:48 🔗 t3 Kaz: Should I contact Internet Archive?
02:49 🔗 benjinsmi has quit IRC (Quit: Leaving)
02:49 🔗 Kaz Try again later/tomorrow, then email
02:49 🔗 Flashfire http://playerthree.com/ Saving... I dont think it should have been doing this for an hour do you?
02:50 🔗 benjins has joined #archiveteam-bs
02:52 🔗 Flashfire kaz sorry to bother you but do you have any opinion on the savenow outlinks capture taking so long
02:53 🔗 Kaz Not a clue
02:53 🔗 Kaz I didn't even realise it did outlinks
02:53 🔗 Flashfire https://web.archive.org/save/ you can choose to grab outlinks
03:11 🔗 wp494 has quit IRC (Ping timeout: 260 seconds)
03:11 🔗 wp494 has joined #archiveteam-bs
03:15 🔗 t3 Flashfire: Well maybe they're running low on space or bandwidth or something.
03:15 🔗 t3 Because they're throttling me too.
03:15 🔗 t3 I think.
03:22 🔗 Kaz oh actually, looks like IA returns a 503 for slowdown
03:22 🔗 Kaz see: http://monitor.archive.org/stats/s3.php
03:33 🔗 jianaran has joined #archiveteam-bs
03:33 🔗 jianaran c&p from #archiveteam, as it should probably have been here instead: " hi all, I'm looking to archive some twitter accounts of niche importance with a habit of regularly deleting their tweets. This is all fairly new to me, so could someone please point me in a good direction to learn how to do this?"
03:34 🔗 jianaran I'm reasonably familiar with Python and can do a bit of bash scripting, but am far from an expert.
03:52 🔗 jodizzle jianaran: I think a decent workflow at the moment is using snscrape to produce a file full of tweet URLs and then using some other tool to request and save the URLs.
03:53 🔗 jodizzle Here's the link to snscrape: https://github.com/JustAnotherArchivist/snscrape
03:53 🔗 jianaran That's pretty much what I've got to, but I'm struggling with turning the list of tweet URLs into a usefully-formatted series of tweets (plus attached media!)
03:54 🔗 exoire has quit IRC (Read error: Operation timed out)
03:54 🔗 jodizzle What tool are you using to download the tweets? And what do you mean by usefully-formatted?
03:55 🔗 jianaran Well, I've only really tried wget
03:56 🔗 jianaran Usefully formatted: anything that preserves the link between text and media
03:57 🔗 jianaran As I said, I'm brand new to this so I don't really have any good idea of what the best solution would be
04:05 🔗 Flashfire put them in a text file and !ao < with archivebot
04:05 🔗 Flashfire single line per link
04:06 🔗 jianaran I'd like to automate and run the job daily (or weekly, at least). Is that acceptable with archivebot?
04:08 🔗 Kaz https://github.com/ludios/grab-site
04:08 🔗 Kaz probably better if you want to do it automatically at that rate
04:08 🔗 jodizzle jianaran: Hmm I actually don't know what options would be required to make wget grab the tweets appropriately, but you could check out grab-site: https://github.com/ludios/grab-site#twittercomuser
04:08 🔗 jodizzle oops beat me to it
04:11 🔗 jodizzle But yeah you could set up snscrape and grab-site as a cron job or something in order to grab a user's tweets periodically
04:11 🔗 jianaran That looks great, thanks. I'll see if I can get it working
04:11 🔗 jianaran has quit IRC (Read error: Connection reset by peer)
04:11 🔗 jodizzle Using archivebot has the benefit of the links you archive being visible in the wayback machine, but you would have to have Flashfire or someone do it for your each time.
04:12 🔗 Flashfire not me lol I was naughty
04:12 🔗 Flashfire No voice for me
04:12 🔗 Flashfire soemone else
04:13 🔗 Flashfire actually i think you can do !ao < Without voice
04:13 🔗 jodizzle Does that archive without recursion?
04:13 🔗 Flashfire Yes
04:18 🔗 newbie85 has joined #archiveteam-bs
04:18 🔗 newbie85 is now known as jianaran
04:24 🔗 jodizzle omarroth: Is the 'archived_video_ids.csv' really a CSV file, or is it just a list of links? I think I downloaded it correctly but it doesn't seem to have any columns.
04:36 🔗 Sk1d has joined #archiveteam-bs
04:45 🔗 Fusl i keep forgetting that this channel exists lol
04:46 🔗 qw3rty114 has joined #archiveteam-bs
04:48 🔗 jodizzle Actually, does grab-site handle things like embedded video on tweets?
04:48 🔗 jodizzle I don't think it does right?
04:49 🔗 Sk1d has quit IRC (Read error: Operation timed out)
04:49 🔗 qw3rty113 has quit IRC (Read error: Operation timed out)
04:53 🔗 Sk1d has joined #archiveteam-bs
04:55 🔗 odemgi has joined #archiveteam-bs
04:56 🔗 odemg has quit IRC (Ping timeout: 265 seconds)
04:57 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
05:06 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:06 🔗 Frogging has quit IRC (Ping timeout: 252 seconds)
05:08 🔗 odemg has joined #archiveteam-bs
05:08 🔗 Frogging has joined #archiveteam-bs
05:09 🔗 Sk1d has joined #archiveteam-bs
05:19 🔗 Fusl have an example url?
05:34 🔗 jianaran OK! Been off playing with that for a while, and I've now got grab-site working and making a WARC
05:35 🔗 jianaran However, it doesn't seem to have pulled the embedded images (at least, nowhere that I can find). I'm using Webrecorder player to view the WARC. Am I viewing it wrong, or does grab-site not save images for the tweets?
06:04 🔗 jodizzle Fusl: Here's a random tweet with a video that I found: https://twitter.com/9GAG/status/1085416357049524224
06:10 🔗 Fusl nope
06:10 🔗 Fusl root@archiveteam:/data# fgrep video.twimg.com twitter.com-9GAG-status-1085416357049524224-2019-01-16-bd5a242a/wpull.log
06:10 🔗 Fusl root@archiveteam:/data#
06:10 🔗 Fusl yeah, thats because it gets the video url from a xhr request json file
06:11 🔗 jianaran does grab-site get embedded images in tweets? I feel I've heard it does, but I can't seem to get it working
06:13 🔗 Fusl yep it does
06:13 🔗 Fusl root@archiveteam:/data# fgrep DvexosMXcAEvDTk.jpg twitter.com-OhNoItsFusl-status-1078526120318898176-2019-01-16-69f0b5b4/wpull.log
06:13 🔗 Fusl 2019-01-16 06:12:20,744 - wpull.processor.web - INFO - Fetching ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’.
06:13 🔗 Fusl 2019-01-16 06:12:20,844 - wpull.processor.web - INFO - Fetched ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’: 200 OK. Length: 96506 [image/jpeg].
06:17 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:21 🔗 Sk1d has joined #archiveteam-bs
06:29 🔗 psi JAA: https://mastodon.mit.edu/@dukhovni/101424659386821523 they seem to have more of these issues so might be worth backing up in case it goes down permanently
06:34 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:36 🔗 jodizzle jianaran: grab-site definitely gets at least some images, not sure why they wouldn't show up in Webrecorder though
06:37 🔗 jianaran sorry, to clarify: it seems to get the embedded level of detail, but doesn't seem to be fetching the full-size image (what you get when clicking on images embedded into tweets)
06:39 🔗 Sk1d has joined #archiveteam-bs
06:53 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:56 🔗 jodizzle jianaran: Ah okay. I'm not sure if there's a way to get all that in a "usefully-formatted" way without simulating a browser (or at least the clicking-behavior).
06:56 🔗 Sk1d has joined #archiveteam-bs
06:56 🔗 jodizzle But someone else here might know better
06:56 🔗 jianaran jodizzle: simulating a browser isn't the worst thing in the world, if necessary. But do you know how to scrape the original size embedded media in the first place?
06:57 🔗 Flashfire use chromebot in #archivebot to simulate clicking
06:57 🔗 jodizzle IIRC you can get it by just appending either ':orig' or ':large' to the end of a twitter image
06:57 🔗 Flashfire alternatively build crocoite yourseld
06:58 🔗 jodizzle e.g., 'https://pbs.twimg.com/media/whateverlongstring.jpg:orig'
06:59 🔗 jodizzle Oh I was actually wondering about chromebot Flashfire
07:00 🔗 Flashfire https://github.com/PromyLOPh/crocoite
07:00 🔗 jodizzle Yeah just found it. This seems neat.
07:03 🔗 jodizzle Though IIRC chromebot is pretty slow right?
07:06 🔗 Flashfire only when recursion is enabled mainly because I dont think you can add ignores
07:39 🔗 jianaran has quit IRC (Read error: Operation timed out)
08:11 🔗 SketchCow So, I've been spending the months, and am now high-gear, moving hundreds of thousands of files around ARCHIVE.ORG
08:45 🔗 turnkit_ has joined #archiveteam-bs
08:46 🔗 turnkit has quit IRC (Read error: Operation timed out)
08:48 🔗 turnkit has joined #archiveteam-bs
08:53 🔗 turnkit_ has quit IRC (Ping timeout: 360 seconds)
09:01 🔗 BlueMax has quit IRC (Quit: Leaving)
09:41 🔗 BlueMax has joined #archiveteam-bs
09:41 🔗 m007a83_ has joined #archiveteam-bs
09:46 🔗 m007a83 has quit IRC (Read error: Operation timed out)
09:49 🔗 m007a83_ has quit IRC (Read error: Operation timed out)
10:03 🔗 Oddly has joined #archiveteam-bs
10:18 🔗 Mateon1 has quit IRC (Quit: Mateon1)
10:20 🔗 Mateon1 has joined #archiveteam-bs
10:30 🔗 Oddly has quit IRC (Ping timeout: 255 seconds)
10:42 🔗 Sk1d has quit IRC (Read error: Operation timed out)
10:46 🔗 Sk1d has joined #archiveteam-bs
10:48 🔗 LFlare has quit IRC (Quit: Ping timeout (120 seconds))
10:49 🔗 LFlare has joined #archiveteam-bs
10:53 🔗 Wigser has joined #archiveteam-bs
10:54 🔗 Wigser Hi
10:55 🔗 Wigser has quit IRC (Client Quit)
11:00 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:02 🔗 Sk1d has joined #archiveteam-bs
11:24 🔗 BlueMax has quit IRC (Quit: Leaving)
12:10 🔗 fredgido has quit IRC (Ping timeout: 633 seconds)
12:13 🔗 wp494 has quit IRC (Read error: Operation timed out)
12:13 🔗 wp494 has joined #archiveteam-bs
12:40 🔗 JAA psi: Sounds good to me. I'll throw it into ArchiveBot.
12:41 🔗 psi Lovely thanks
12:46 🔗 fredgido has joined #archiveteam-bs
12:56 🔗 mistym has quit IRC (Ping timeout: 506 seconds)
12:56 🔗 mistym has joined #archiveteam-bs
13:30 🔗 Oddly has joined #archiveteam-bs
13:35 🔗 VerfiedJ has joined #archiveteam-bs
13:42 🔗 Sk1d has quit IRC (Read error: Operation timed out)
13:48 🔗 Sk1d has joined #archiveteam-bs
14:00 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:05 🔗 Sk1d has joined #archiveteam-bs
14:17 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:20 🔗 Sk1d has joined #archiveteam-bs
16:38 🔗 yano I ended up finding these bittorrents of/for ArchiveTeam:
16:39 🔗 yano magnet:?xt=urn:btih:7a318721571616333b993dd6172597deaa748083&dn=urlteam_2016-05-19-18-17-02
16:39 🔗 yano magnet:?xt=urn:btih:1a00e5a54aa599d63cd5a3dc084760228d90f407&dn=archiveteam_newssites_20180217081616
16:40 🔗 yano magnet:?xt=urn:btih:4cf5896b507f3ca6f50819a2788e99dfa5bcb58b&dn=urlteam
16:41 🔗 yano magnet:?xt=urn:btih:82667bfe6bbeb2e928f583687071543552a59225&dn=astrid_archivebot_www_robotsandcomputers_com_20180708
16:47 🔗 Kaz they sound like they're just IA items
17:02 🔗 Mateon1 has quit IRC (Ping timeout: 360 seconds)
17:02 🔗 Mateon1 has joined #archiveteam-bs
17:18 🔗 Arctic has joined #archiveteam-bs
18:07 🔗 Oddly has quit IRC (Ping timeout: 255 seconds)
18:21 🔗 Sk1d has quit IRC (Read error: Operation timed out)
18:24 🔗 Sk1d has joined #archiveteam-bs
18:32 🔗 Arctic has quit IRC (Quit: Page closed)
18:59 🔗 achip has quit IRC (Read error: Operation timed out)
19:02 🔗 achip has joined #archiveteam-bs
19:51 🔗 m007a83 has joined #archiveteam-bs
20:01 🔗 second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
20:04 🔗 second has joined #archiveteam-bs
21:08 🔗 BlueMax has joined #archiveteam-bs
21:11 🔗 wp494 has quit IRC (Read error: Operation timed out)
21:11 🔗 wp494 has joined #archiveteam-bs
21:21 🔗 Despatche has joined #archiveteam-bs
21:23 🔗 schbirid has quit IRC (Remote host closed the connection)
21:24 🔗 Despatche has quit IRC (Remote host closed the connection)
21:28 🔗 Despatche has joined #archiveteam-bs
21:29 🔗 Despatche has quit IRC (Read error: Connection reset by peer)
21:29 🔗 Despatche has joined #archiveteam-bs
21:52 🔗 Despatche has quit IRC (Read error: Operation timed out)
21:56 🔗 Despatche has joined #archiveteam-bs
22:14 🔗 omarroth has joined #archiveteam-bs
22:16 🔗 omarroth jodizzle: Sorry for the delay. It's a list of newline-seperated video IDs, postgres accepts it as a valid CSV file. It may need a column name for the first line to work for you
22:18 🔗 jodizzle omarroth: Oh no I think I can read it fine, I just wanted to make sure I wasn't missing anything.
22:19 🔗 jodizzle So are the IDs ones that definitely had annotations, or just IDs that you checked at all?
22:19 🔗 omarroth Those are the IDs that we checked. I'm still going through everything but I expect most of those had some form of annotation data
22:25 🔗 jodizzle Okay. When I get a chance I'll check the IDs against videos I have downloaded.
22:25 🔗 omarroth Thank you!
22:25 🔗 jodizzle I also know ivan_ was a big youtube hoarder
22:26 🔗 ivan_ odemg is even bigger
22:27 🔗 ivan_ also had the foresight to grab h264 instead of vp9 (f'ing iOS devices)
22:27 🔗 omarroth Please send any annotation datamy way!
22:30 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:35 🔗 Sk1d has joined #archiveteam-bs
22:35 🔗 BlueMax has quit IRC (Quit: Leaving)
22:37 🔗 Hani has quit IRC (Read error: Operation timed out)
22:45 🔗 Hani has joined #archiveteam-bs
22:48 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:51 🔗 Sk1d has joined #archiveteam-bs
23:00 🔗 erinmoon has quit IRC (Quit: WeeChat 2.1)
23:03 🔗 newbie96 has joined #archiveteam-bs
23:05 🔗 marked has quit IRC (Read error: Connection reset by peer)
23:05 🔗 newbie96 Hi all. Following on from yesterday's discussion re: archiving a twitter account that regularly deletes old postings, I've Following on from yesterday' Q's regarding archiving a twitter account that regularly deletes posts, I've now got a nightly cronjob running to snscrape -> web-grab the tweets into a WARC, and also download all images at original resolution with ripme.
23:05 🔗 newbie96 is now known as jianaran
23:05 🔗 jianaran (whoops)
23:08 🔗 jianaran This seems to work, *but* I'm generating a complete WARC of all currently-available tweets every day. This is obviously pretty inefficient. Two possible solutions:
23:08 🔗 jianaran -Maintain a file with all previously scraped tweets, and have the daily job diff snscrape's output against this file, web-grab the diff, then update the file; original
23:08 🔗 jianaran -Regularly merge the obtained WARCs to produce a single growing WARC containing all grabbed tweets
23:09 🔗 jianaran Any thoughts? I'm not very familiar with the WARC format: how easy would it be to merge WARCs to produce a single archive of each URL, keeping the oldest version (so as not to overwrite content with the inevitable 404 page)
23:09 🔗 marked has joined #archiveteam-bs
23:10 🔗 JAA I would suggest going with the first idea. Pure merging of WARCs is extremely easy (just concatenate them), but deduplication is a different beast. It can be done, but it's definitely more complex than keeping a list of grabbed tweets and using 'comm' to filter them out from the current snscrape output.
23:11 🔗 jodizzle Isn't there an option in snscrape to get tweets newer than some date?
23:12 🔗 jodizzle Only get tweets newer than some date, I mean
23:12 🔗 JAA There is: snscrape twitter-search 'from:username since:2019-01-10'
23:12 🔗 jodizzle Ah okay, so I think that would be another solution right? Just only get tweets newer than the date you last scraped.
23:12 🔗 jianaran That could be useful, but I feel that getting all *new* tweets would be safer (if a nightly run doesn't execute for whatever reason, the next day should grab everything that was missed
23:13 🔗 jodizzle You could configure your cron job to write the date to a file or something when it runs.
23:13 🔗 jodizzle Then read it back at the beginning of the job.
23:13 🔗 JAA when it completes successfully*
23:13 🔗 jodizzle Right
23:14 🔗 JAA That's exactly what I did with my ArchiveBot item listing thingy.
23:14 🔗 jianaran that's true
23:15 🔗 jianaran OK, so that approach would give me lots of small but non-overlapping WARCs, which should be easy enough to concatenate
23:18 🔗 jodizzle JAA: What is the ArchiveBot thingy?
23:19 🔗 JAA jodizzle: https://github.com/JustAnotherArchivist/archivebot-archives
23:20 🔗 JAA It's a mediocre replacement for the ArchiveBot viewer, which doesn't always list all data. I stopped the automatic updates late last year though because it needs a rewrite.
23:23 🔗 Sk1d has quit IRC (Read error: Operation timed out)
23:24 🔗 BlueMax has joined #archiveteam-bs
23:27 🔗 jianaran OK! How does this look, for a .sh script to be run nightly as a cronjob: https://pastebin.com/5VSBRm77
23:27 🔗 Sk1d has joined #archiveteam-bs
23:27 🔗 jianaran ($username is hardcoded: if I want to start scraping more than one account, it shouldn't be too hard to loop the whole thing and read from a list of usernames
23:32 🔗 JAA jianaran: comm requires sorted input, so you'll have to do something like: comm -23 <(sort /twitter-archival/$username-snscrape) <(sort /twitter-archival/$username-snscrape-archive)
23:34 🔗 SmileyG has joined #archiveteam-bs
23:34 🔗 JAA snscrape produces output sorted by decreasing tweet ID, but comm needs lexicographically ascending sorted files (e.g. tweet ID 100 would come before 19).
23:34 🔗 JAA (snscrape's output order is actually not guaranteed. It just prints whatever Twitter's servers return.)
23:36 🔗 Smiley has quit IRC (Ping timeout: 265 seconds)
23:40 🔗 Sk1d has quit IRC (Read error: Operation timed out)
23:44 🔗 Sk1d has joined #archiveteam-bs
23:44 🔗 exoire has joined #archiveteam-bs
23:46 🔗 jianaran JAA: thanks, that's really helpful.
23:59 🔗 omarroth has quit IRC (Ping timeout: 268 seconds)

irclogger-viewer