[00:29] *** Oddly has quit IRC (Ping timeout: 255 seconds) [00:52] *** VerfiedJ has quit IRC (Quit: Leaving) [01:14] *** omarroth has quit IRC (Quit: Konversation terminated!) [01:47] PurpleSym: So I'm trying to upload WARCs to the Internet Archive, but I get the HTTP status code 503. Am I being throttled? [01:48] IA usually send's 429's I think [02:15] *** Sk1d has quit IRC (Read error: Operation timed out) [02:43] *** qw3rty113 has joined #archiveteam-bs [02:44] *** qw3rty112 has quit IRC (Ping timeout: 600 seconds) [02:46] How long until the savenow page usually grabs all the outlinks because one of mine has been trying for an hour now [02:48] Kaz: Should I contact Internet Archive? [02:49] *** benjinsmi has quit IRC (Quit: Leaving) [02:49] Try again later/tomorrow, then email [02:49] http://playerthree.com/ Saving... I dont think it should have been doing this for an hour do you? [02:50] *** benjins has joined #archiveteam-bs [02:52] kaz sorry to bother you but do you have any opinion on the savenow outlinks capture taking so long [02:53] Not a clue [02:53] I didn't even realise it did outlinks [02:53] https://web.archive.org/save/ you can choose to grab outlinks [03:11] *** wp494 has quit IRC (Ping timeout: 260 seconds) [03:11] *** wp494 has joined #archiveteam-bs [03:15] Flashfire: Well maybe they're running low on space or bandwidth or something. [03:15] Because they're throttling me too. [03:15] I think. [03:22] oh actually, looks like IA returns a 503 for slowdown [03:22] see: http://monitor.archive.org/stats/s3.php [03:33] *** jianaran has joined #archiveteam-bs [03:33] c&p from #archiveteam, as it should probably have been here instead: " hi all, I'm looking to archive some twitter accounts of niche importance with a habit of regularly deleting their tweets. This is all fairly new to me, so could someone please point me in a good direction to learn how to do this?" [03:34] I'm reasonably familiar with Python and can do a bit of bash scripting, but am far from an expert. [03:52] jianaran: I think a decent workflow at the moment is using snscrape to produce a file full of tweet URLs and then using some other tool to request and save the URLs. [03:53] Here's the link to snscrape: https://github.com/JustAnotherArchivist/snscrape [03:53] That's pretty much what I've got to, but I'm struggling with turning the list of tweet URLs into a usefully-formatted series of tweets (plus attached media!) [03:54] *** exoire has quit IRC (Read error: Operation timed out) [03:54] What tool are you using to download the tweets? And what do you mean by usefully-formatted? [03:55] Well, I've only really tried wget [03:56] Usefully formatted: anything that preserves the link between text and media [03:57] As I said, I'm brand new to this so I don't really have any good idea of what the best solution would be [04:05] put them in a text file and !ao < with archivebot [04:05] single line per link [04:06] I'd like to automate and run the job daily (or weekly, at least). Is that acceptable with archivebot? [04:08] https://github.com/ludios/grab-site [04:08] probably better if you want to do it automatically at that rate [04:08] jianaran: Hmm I actually don't know what options would be required to make wget grab the tweets appropriately, but you could check out grab-site: https://github.com/ludios/grab-site#twittercomuser [04:08] oops beat me to it [04:11] But yeah you could set up snscrape and grab-site as a cron job or something in order to grab a user's tweets periodically [04:11] That looks great, thanks. I'll see if I can get it working [04:11] *** jianaran has quit IRC (Read error: Connection reset by peer) [04:11] Using archivebot has the benefit of the links you archive being visible in the wayback machine, but you would have to have Flashfire or someone do it for your each time. [04:12] not me lol I was naughty [04:12] No voice for me [04:12] soemone else [04:13] actually i think you can do !ao < Without voice [04:13] Does that archive without recursion? [04:13] Yes [04:18] *** newbie85 has joined #archiveteam-bs [04:18] *** newbie85 is now known as jianaran [04:24] omarroth: Is the 'archived_video_ids.csv' really a CSV file, or is it just a list of links? I think I downloaded it correctly but it doesn't seem to have any columns. [04:36] *** Sk1d has joined #archiveteam-bs [04:45] i keep forgetting that this channel exists lol [04:46] *** qw3rty114 has joined #archiveteam-bs [04:48] Actually, does grab-site handle things like embedded video on tweets? [04:48] I don't think it does right? [04:49] *** Sk1d has quit IRC (Read error: Operation timed out) [04:49] *** qw3rty113 has quit IRC (Read error: Operation timed out) [04:53] *** Sk1d has joined #archiveteam-bs [04:55] *** odemgi has joined #archiveteam-bs [04:56] *** odemg has quit IRC (Ping timeout: 265 seconds) [04:57] *** odemgi_ has quit IRC (Read error: Operation timed out) [05:06] *** Sk1d has quit IRC (Read error: Operation timed out) [05:06] *** Frogging has quit IRC (Ping timeout: 252 seconds) [05:08] *** odemg has joined #archiveteam-bs [05:08] *** Frogging has joined #archiveteam-bs [05:09] *** Sk1d has joined #archiveteam-bs [05:19] have an example url? [05:34] OK! Been off playing with that for a while, and I've now got grab-site working and making a WARC [05:35] However, it doesn't seem to have pulled the embedded images (at least, nowhere that I can find). I'm using Webrecorder player to view the WARC. Am I viewing it wrong, or does grab-site not save images for the tweets? [06:04] Fusl: Here's a random tweet with a video that I found: https://twitter.com/9GAG/status/1085416357049524224 [06:10] nope [06:10] root@archiveteam:/data# fgrep video.twimg.com twitter.com-9GAG-status-1085416357049524224-2019-01-16-bd5a242a/wpull.log [06:10] root@archiveteam:/data# [06:10] yeah, thats because it gets the video url from a xhr request json file [06:11] does grab-site get embedded images in tweets? I feel I've heard it does, but I can't seem to get it working [06:13] yep it does [06:13] root@archiveteam:/data# fgrep DvexosMXcAEvDTk.jpg twitter.com-OhNoItsFusl-status-1078526120318898176-2019-01-16-69f0b5b4/wpull.log [06:13] 2019-01-16 06:12:20,744 - wpull.processor.web - INFO - Fetching ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’. [06:13] 2019-01-16 06:12:20,844 - wpull.processor.web - INFO - Fetched ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’: 200 OK. Length: 96506 [image/jpeg]. [06:17] *** Sk1d has quit IRC (Read error: Operation timed out) [06:21] *** Sk1d has joined #archiveteam-bs [06:29] JAA: https://mastodon.mit.edu/@dukhovni/101424659386821523 they seem to have more of these issues so might be worth backing up in case it goes down permanently [06:34] *** Sk1d has quit IRC (Read error: Operation timed out) [06:36] jianaran: grab-site definitely gets at least some images, not sure why they wouldn't show up in Webrecorder though [06:37] sorry, to clarify: it seems to get the embedded level of detail, but doesn't seem to be fetching the full-size image (what you get when clicking on images embedded into tweets) [06:39] *** Sk1d has joined #archiveteam-bs [06:53] *** Sk1d has quit IRC (Read error: Operation timed out) [06:56] jianaran: Ah okay. I'm not sure if there's a way to get all that in a "usefully-formatted" way without simulating a browser (or at least the clicking-behavior). [06:56] *** Sk1d has joined #archiveteam-bs [06:56] But someone else here might know better [06:56] jodizzle: simulating a browser isn't the worst thing in the world, if necessary. But do you know how to scrape the original size embedded media in the first place? [06:57] use chromebot in #archivebot to simulate clicking [06:57] IIRC you can get it by just appending either ':orig' or ':large' to the end of a twitter image [06:57] alternatively build crocoite yourseld [06:58] e.g., 'https://pbs.twimg.com/media/whateverlongstring.jpg:orig' [06:59] Oh I was actually wondering about chromebot Flashfire [07:00] https://github.com/PromyLOPh/crocoite [07:00] Yeah just found it. This seems neat. [07:03] Though IIRC chromebot is pretty slow right? [07:06] only when recursion is enabled mainly because I dont think you can add ignores [07:39] *** jianaran has quit IRC (Read error: Operation timed out) [08:11] So, I've been spending the months, and am now high-gear, moving hundreds of thousands of files around ARCHIVE.ORG [08:45] *** turnkit_ has joined #archiveteam-bs [08:46] *** turnkit has quit IRC (Read error: Operation timed out) [08:48] *** turnkit has joined #archiveteam-bs [08:53] *** turnkit_ has quit IRC (Ping timeout: 360 seconds) [09:01] *** BlueMax has quit IRC (Quit: Leaving) [09:41] *** BlueMax has joined #archiveteam-bs [09:41] *** m007a83_ has joined #archiveteam-bs [09:46] *** m007a83 has quit IRC (Read error: Operation timed out) [09:49] *** m007a83_ has quit IRC (Read error: Operation timed out) [10:03] *** Oddly has joined #archiveteam-bs [10:18] *** Mateon1 has quit IRC (Quit: Mateon1) [10:20] *** Mateon1 has joined #archiveteam-bs [10:30] *** Oddly has quit IRC (Ping timeout: 255 seconds) [10:42] *** Sk1d has quit IRC (Read error: Operation timed out) [10:46] *** Sk1d has joined #archiveteam-bs [10:48] *** LFlare has quit IRC (Quit: Ping timeout (120 seconds)) [10:49] *** LFlare has joined #archiveteam-bs [10:53] *** Wigser has joined #archiveteam-bs [10:54] Hi [10:55] *** Wigser has quit IRC (Client Quit) [11:00] *** Sk1d has quit IRC (Read error: Operation timed out) [11:02] *** Sk1d has joined #archiveteam-bs [11:24] *** BlueMax has quit IRC (Quit: Leaving) [12:10] *** fredgido has quit IRC (Ping timeout: 633 seconds) [12:13] *** wp494 has quit IRC (Read error: Operation timed out) [12:13] *** wp494 has joined #archiveteam-bs [12:40] psi: Sounds good to me. I'll throw it into ArchiveBot. [12:41] Lovely thanks [12:46] *** fredgido has joined #archiveteam-bs [12:56] *** mistym has quit IRC (Ping timeout: 506 seconds) [12:56] *** mistym has joined #archiveteam-bs [13:30] *** Oddly has joined #archiveteam-bs [13:35] *** VerfiedJ has joined #archiveteam-bs [13:42] *** Sk1d has quit IRC (Read error: Operation timed out) [13:48] *** Sk1d has joined #archiveteam-bs [14:00] *** Sk1d has quit IRC (Read error: Operation timed out) [14:05] *** Sk1d has joined #archiveteam-bs [14:17] *** Sk1d has quit IRC (Read error: Operation timed out) [14:20] *** Sk1d has joined #archiveteam-bs [16:38] I ended up finding these bittorrents of/for ArchiveTeam: [16:39] magnet:?xt=urn:btih:7a318721571616333b993dd6172597deaa748083&dn=urlteam_2016-05-19-18-17-02 [16:39] magnet:?xt=urn:btih:1a00e5a54aa599d63cd5a3dc084760228d90f407&dn=archiveteam_newssites_20180217081616 [16:40] magnet:?xt=urn:btih:4cf5896b507f3ca6f50819a2788e99dfa5bcb58b&dn=urlteam [16:41] magnet:?xt=urn:btih:82667bfe6bbeb2e928f583687071543552a59225&dn=astrid_archivebot_www_robotsandcomputers_com_20180708 [16:47] they sound like they're just IA items [17:02] *** Mateon1 has quit IRC (Ping timeout: 360 seconds) [17:02] *** Mateon1 has joined #archiveteam-bs [17:18] *** Arctic has joined #archiveteam-bs [18:07] *** Oddly has quit IRC (Ping timeout: 255 seconds) [18:21] *** Sk1d has quit IRC (Read error: Operation timed out) [18:24] *** Sk1d has joined #archiveteam-bs [18:32] *** Arctic has quit IRC (Quit: Page closed) [18:59] *** achip has quit IRC (Read error: Operation timed out) [19:02] *** achip has joined #archiveteam-bs [19:51] *** m007a83 has joined #archiveteam-bs [20:01] *** second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in) [20:04] *** second has joined #archiveteam-bs [21:08] *** BlueMax has joined #archiveteam-bs [21:11] *** wp494 has quit IRC (Read error: Operation timed out) [21:11] *** wp494 has joined #archiveteam-bs [21:21] *** Despatche has joined #archiveteam-bs [21:23] *** schbirid has quit IRC (Remote host closed the connection) [21:24] *** Despatche has quit IRC (Remote host closed the connection) [21:28] *** Despatche has joined #archiveteam-bs [21:29] *** Despatche has quit IRC (Read error: Connection reset by peer) [21:29] *** Despatche has joined #archiveteam-bs [21:52] *** Despatche has quit IRC (Read error: Operation timed out) [21:56] *** Despatche has joined #archiveteam-bs [22:14] *** omarroth has joined #archiveteam-bs [22:16] jodizzle: Sorry for the delay. It's a list of newline-seperated video IDs, postgres accepts it as a valid CSV file. It may need a column name for the first line to work for you [22:18] omarroth: Oh no I think I can read it fine, I just wanted to make sure I wasn't missing anything. [22:19] So are the IDs ones that definitely had annotations, or just IDs that you checked at all? [22:19] Those are the IDs that we checked. I'm still going through everything but I expect most of those had some form of annotation data [22:25] Okay. When I get a chance I'll check the IDs against videos I have downloaded. [22:25] Thank you! [22:25] I also know ivan_ was a big youtube hoarder [22:26] odemg is even bigger [22:27] also had the foresight to grab h264 instead of vp9 (f'ing iOS devices) [22:27] Please send any annotation datamy way! [22:30] *** Sk1d has quit IRC (Read error: Operation timed out) [22:35] *** Sk1d has joined #archiveteam-bs [22:35] *** BlueMax has quit IRC (Quit: Leaving) [22:37] *** Hani has quit IRC (Read error: Operation timed out) [22:45] *** Hani has joined #archiveteam-bs [22:48] *** Sk1d has quit IRC (Read error: Operation timed out) [22:51] *** Sk1d has joined #archiveteam-bs [23:00] *** erinmoon has quit IRC (Quit: WeeChat 2.1) [23:03] *** newbie96 has joined #archiveteam-bs [23:05] *** marked has quit IRC (Read error: Connection reset by peer) [23:05] Hi all. Following on from yesterday's discussion re: archiving a twitter account that regularly deletes old postings, I've Following on from yesterday' Q's regarding archiving a twitter account that regularly deletes posts, I've now got a nightly cronjob running to snscrape -> web-grab the tweets into a WARC, and also download all images at original resolution with ripme. [23:05] *** newbie96 is now known as jianaran [23:05] (whoops) [23:08] This seems to work, *but* I'm generating a complete WARC of all currently-available tweets every day. This is obviously pretty inefficient. Two possible solutions: [23:08] -Maintain a file with all previously scraped tweets, and have the daily job diff snscrape's output against this file, web-grab the diff, then update the file; original [23:08] -Regularly merge the obtained WARCs to produce a single growing WARC containing all grabbed tweets [23:09] Any thoughts? I'm not very familiar with the WARC format: how easy would it be to merge WARCs to produce a single archive of each URL, keeping the oldest version (so as not to overwrite content with the inevitable 404 page) [23:09] *** marked has joined #archiveteam-bs [23:10] I would suggest going with the first idea. Pure merging of WARCs is extremely easy (just concatenate them), but deduplication is a different beast. It can be done, but it's definitely more complex than keeping a list of grabbed tweets and using 'comm' to filter them out from the current snscrape output. [23:11] Isn't there an option in snscrape to get tweets newer than some date? [23:12] Only get tweets newer than some date, I mean [23:12] There is: snscrape twitter-search 'from:username since:2019-01-10' [23:12] Ah okay, so I think that would be another solution right? Just only get tweets newer than the date you last scraped. [23:12] That could be useful, but I feel that getting all *new* tweets would be safer (if a nightly run doesn't execute for whatever reason, the next day should grab everything that was missed [23:13] You could configure your cron job to write the date to a file or something when it runs. [23:13] Then read it back at the beginning of the job. [23:13] when it completes successfully* [23:13] Right [23:14] That's exactly what I did with my ArchiveBot item listing thingy. [23:14] that's true [23:15] OK, so that approach would give me lots of small but non-overlapping WARCs, which should be easy enough to concatenate [23:18] JAA: What is the ArchiveBot thingy? [23:19] jodizzle: https://github.com/JustAnotherArchivist/archivebot-archives [23:20] It's a mediocre replacement for the ArchiveBot viewer, which doesn't always list all data. I stopped the automatic updates late last year though because it needs a rewrite. [23:23] *** Sk1d has quit IRC (Read error: Operation timed out) [23:24] *** BlueMax has joined #archiveteam-bs [23:27] OK! How does this look, for a .sh script to be run nightly as a cronjob: https://pastebin.com/5VSBRm77 [23:27] *** Sk1d has joined #archiveteam-bs [23:27] ($username is hardcoded: if I want to start scraping more than one account, it shouldn't be too hard to loop the whole thing and read from a list of usernames [23:32] jianaran: comm requires sorted input, so you'll have to do something like: comm -23 <(sort /twitter-archival/$username-snscrape) <(sort /twitter-archival/$username-snscrape-archive) [23:34] *** SmileyG has joined #archiveteam-bs [23:34] snscrape produces output sorted by decreasing tweet ID, but comm needs lexicographically ascending sorted files (e.g. tweet ID 100 would come before 19). [23:34] (snscrape's output order is actually not guaranteed. It just prints whatever Twitter's servers return.) [23:36] *** Smiley has quit IRC (Ping timeout: 265 seconds) [23:40] *** Sk1d has quit IRC (Read error: Operation timed out) [23:44] *** Sk1d has joined #archiveteam-bs [23:44] *** exoire has joined #archiveteam-bs [23:46] JAA: thanks, that's really helpful. [23:59] *** omarroth has quit IRC (Ping timeout: 268 seconds)